Graduate essay: Factor Language Model Programming

Factor Language stick ProgrammingLanguage ModelLanguage ideal stand bys a speech recognizer figure come forth how likely a get hold back sequence is, independent of acoustics. T here is a linguistic and statistical fire to calculate the probability. The linguistic technique tries to understand the syntactical and semantic body structure of a language and derive the probabilities of condition sequences using this knowledge. The ch entirelyenge here is to have proper co fact statistics of the unit of recognition. The approach in rehearse evaluates a huge text corpus in a statistical way and intelligence activity transitions. Current language models make no use of the syntactic properties of natural language but rather use very simple statistics such as countersignature co-occurrences. Recent results show that incorporating syntactic constraints in a statistical language model reduces the pronounce misapprehension rate on a conventional dictation task by 10% M.S.Salam, 2 009.Proposed Language ModelThe approach proposed here uses factored language model which incorporates the geomorphological knowledge. Factored language models have recently been proposed for incorporating morphological knowledge in the modeling lexicon. As suffix and compound language ar the cause of the growth of the vocabulary, a logical creative thinker is to split the row into shorter units.The language model proposed in this look into is base on morphology. A morphological analyser obtains and verifies the internal structure of a condition over complete word form Rosenfield, 2000. Building a morphological analyser for highly inflecting, agglutinative languages is a ch eitherenging task. It is very difficult to build a high performance analyser for such languages. The main idea here is to divide a given word form into a stem and single suffix.Morphology plays a much greater role in Telugu. An inflected Telugu word starts with a stem and may have suffix(s) added to the right according to complex rules of saMdhi. This re calculate proposes a new info structure based on Inverted Index and an efficient algorithmic program for accessing its elements. Few researchers have used tries for efficient retrieval from dictionary, earlier. This research kick the bucket is different from earlier work in two waysa) variation to the structure of trieb) the method of identifying and combining flexures.Modified Trie StructureA trie is a tree based data structure for storing depicts in order to support fast pattern matching. A trie T re states the strings of narrow down S of n strings with paths from stemma to the external boss of T. Fig 5.1 Original Trie StructureThe trie considered here is different from standard trie in two ways1) A standard trie does not wholeow a word to be prefix of another, but the proposed trie structure every(prenominal)ows a word to be prefix of another word. The node structure and search algorithm withal is given according to t his new property.2) Each word in a standard trie ends at an external node, where as in the modified trie a word may end at either an external node, or the internal node. Irrespective of whether the word ends at internal node or external node, the node stores the index of the associated word in the occurrence cite.The node structure is changed such that, each node of the trie is represented by a triplet C,R,Ind.C represents role stored at that node. R represents whether the concatenation of characters from root till that node forms a meaningful stem word. Its value is 1, if characters from root node to that node form a stem, 0 otherwise.Ind represents index of the occurrence list. Its value depends on the value of R. Its value is -1 (negative 1), if R=0, indicating it is not a valid stem. So no index of occurrence list matches with it. If R=1, its value is index of occurrence list of associated stem.Fig 5.2 Modified Trie StructureAdvantages relative to binary search treeThe followi ng be the main advantages of tries overbinary search trees(BSTs)Looking up fundamentals is faster. Looking up a key of lengthmtakes worst caseO(m) time. A BST performs O(log(n)) similes of keys, wherenis the number of elements in the tree, because lookups depend on the foresight of the tree, which is logarithmic in the number of keys if the tree is balanced. Hence in the worst case, a BST takes O(mlogn) time. Moreover, in the worst case log(n) result approachm. Also, the simple operations tries use during lookup, such as array indexing using a character, are fast on real machines.Tries can require less space when they endure a large number of short strings, because the keys are not stored explicitly and nodes are shared between keys with common initial subsequences.Tries facilitatinglongest-prefix matching, helping to find the key sharing the longest possible prefix of characters all unique.Corpus structure of proposed Language ModelThe corpus consists of the following modules Stem word dictionaryThis accommodates all the stems of the language. Stem word dictionary is implemented as an Inverted Index for better efficiency. The Inverted index will have the following two data structures in it1) Occurrence list It is an array of equates, 2) Stem trie consisting of stem speech communicationOccurrence list is constructed based on the grammar of the language, where each entry of the list contains the pair (ii) Inflection DictionaryThis dictionary contains the list of all possible inflections of the Telugu language. Each entry of Stem word dictionary lists the indexes of this dictionary to indicate which all inflections are possible with that stem.The proposed corpus structure helps in reducing the corpus size drastically. Every stem word may have number of inflections possible. If the inflected words are stored as it is, then corpus size would be m*n, where m is number of stem words and n is number of inflections. Instead of storing all the inflected words, the proposed corpus structure stores stem words and inflections separately, and handles the inflected words through morphology. Hence the corpus size required is for m stem words and n inflections i.e., m+n. Thus there is a great reduction in the corpus size. For a corpus of 1000 stem words and 10 inflections, the required corpus size is 1000+10=1010, which otherwise would have required 1000*10=10000.Fig 5.3 Corpus structure of proposed Language ModelTextual Word subdivisionation using Proposed Language ModelThe proposed language model is used to develop a textual word segmenter. A word segmenter is used to divide the given inflected word into a stem and single inflection. This is required as the corpus stores stems and inflections separately.Input the word segmenter is an Inflected word. Syllabifier takes this word and divides the word into syllables and identifies if the letter is a vowel or a consonant. After applying the rules syllabified form of the input will be obtained. On ce the process of syllabification is d hotshot, this will be taken up by the analyzer. Analyzer separates the stem and inflection interpreter of the given word. This stem word will be validated by examine it with the stem words present in stem dictionary. If the stem word is present, then the inflection of the input word will be compared with the inflections present in inflection dictionary of the given stem word. If both the inflections get matched then it will directly displays the output otherwise it takes the allow inflection(s) through comparison and then displays.Syllabification is the separation of the words into syllables, where syllables are considered as phonological building blocks of words. It is dividing the word in the way of our pronunciation. The separation is marked by hyphen. In the morphological analyzer, the main objective is to divide the given word into root word and the inflection. For this, we divide the given input word into syllables and we compare the s yllables with the root words and inflections to get the root word and appropriate inflection.Fig 5.4 Block diagram of Word Segmentr for textSteps for word segmentationReceiving the inflected word as an input from the exploiter.Syllabify the inputAnalyze the input and validating the stem word.Identify the appropriate inflection for the given stem word by comparing the inflection of given word with the inflections present in inflection dictionary of the stem word.Displaying the appropriate inflected word.For example, considering the word nAnnagariki (-) meaning to father, the input is given the user in Roman transliteration format. This input is basically divided into lexemes as directly, the array is processed which gives the type of lexeme by applying the rules of syllabification one by one.Applying come up 1 No two vowels come together in Telugu literature.The given user input does not have two vowels together. Hence this rule is meet by the given user input. The output after ap plying this rule is same as above. If the rule is not satisfied, an error message is displayed that the given input is incorrect. Now the array is c v c c v c v c v c vApplying line up 2 initial and final consonants in a word go with the first and last vowel respectively.Telugu literature rarely has the words which end up with a consonant. mostly all the Telugu words end with a vowel. So this rule does not mean the consonant that ends up with the string, but it means the last consonant in string. The diligence of this rule2 changes the array as followingc v c c v c v c v c vcv c c v c v c v cvThis generated output is further processed by applying the other rules.Applying Rule 3 VCV The C goes with the right vowel.The string wherever has the form of VCV, then this rule is applied by dividing it as V CV. In the above rule the consonant is connectd with the vowel, but here in this rule the consonant is combined with the right vowel and separated from the odd vowel. To the output generated by the application of rule2, this rule is applied and the output will be as cv c c v c v c v cvcv c c v cv cv cvThis output is not yet completely syllabified, one more than rule is to be applied which finishes the syllabification of the given user input word.Applying Rule 4 Two or more Cs between Vs First C goes to the left and the rest to right.It is the string which is in the form of VCCC*V, then according to this rule it is split as VC CC*V. In the above output VCCV in the string can be syllabified as VC CV. Then the output becomescv c c v cv cv cv cvc cv cv cv cvNow this output is converted to the respective consonants and vowels. Thus giving the complete syllabified form of the given user input.nAn na cA ri ku cvc cv cv cv cvHence, for the given user input, nAnnagAriki, the generated syllabified form is, nAn na gA ri ki.Fig 5.5 Word Segmenter showing an inflected word without change in stem formFig 5.6 W ord Segmenter showing an inflected word with a change in stem formSCIL bringing Corrector for Indian LanguagesIn inflectional language every word consists of one or several morphemes into which the word can be segmented. The approach used here aims at reducing the above mentioned problem of having a very huge corpus for good recognition accuracy. It exploits the characteristic of Telugu language that every word consists of one or several morphemes into which the word can be segmented.SCIL is a procedureTo deal with complex word formsapplied after recognitionUsing which misrecognized words are correctedArchitecture of SCILThe design of Speech Corrector for Indian Languages, consists of the Syllable Identifier, Phone Sequence Generator, Word Segmenter, and Morpho- Syntactic Analyzer modules. Input speech is decoded by a normal ASR system which gives the identified word as a string. The sequence of phones would be the input to the Word Segmenter module which matches the phonetized in put with the root words stored in dictionary module, and generates a possible set of root words. Morpho-Syntactic Analyzer compares the inflection part of the point out with the possible inflections list from the database and gives correct inflection. This will be given to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word.Fig 5.7 Block diagram of SCILi) Syllable IdentifierSyllable identifier marks the rough boundaries of the syllables and labels them. At this stage , we get list of syllables separated with hyphen. The user input is syllabified and this would be the input to the next module. E.g. dE-vA-la-yA-kuii) Phone Sequence GeneratorAs the words in the dictionary are stored at phone level transcription, this module generates the phone sequences from the syllables. E.g. d-E-v-A-l-a-y-A-k-uiii) Word SegmentorThis module compares the phonetized input from starting with the root words stored in dictionary module and lists the possib le set of root words. The possible root word is dEvAlayamu.iv) DictionaryDictionary contains stems and inflections separately. It does not store inflected words as it is very difficult, if not impossible, to cover all inflected words of the language. The database consists of 2 dictionariesStem DictionaryInflection DictionaryStem dictionary contains the stem words of the language, signal information for that stem which includes the sequence and mess of that utterance and list of indices of inflection dictionary which are possible with that stem word.Inflection Dictionary contains the inflections of the language, signal information for that inflection which includes the duration and location of that utterance. Both the dictionaries are implemented using trie structure in order to reduce the search space.v) Morpho Syntactic AnalyzerThis module compares the inflection part of the signal with the possible inflections list from the database and gives correct inflection. This will be giv en to Morph Analyzer to apply morpho-syntactic rules of the language and gives the correct inflected word.Post Recognition functionCapture the utterance, an isolated inflected word.Get its syllabified form.Generate phone sequence from the syllabified word.Compare the phone sequences with stem words in the dictionary and identify the stem.Segment the word into stem and inflection.Get the list of possible inflections.Compare the inflection signals possible with that stem one by one and apply morpho-syntactic rules of the language to combine stem and inflection.Display the inflected word.Using the rules the possible set of root words are combined with possible set of inflections and the obtained results are compared with the given user input and the nearest possible root word and inflection are displayed if the given input is correct. If the given input is not correct then the inflection part of the given input word is compared with the inflections of that particular root word and ide ntifies the nearest possible inflection and combines the root word with those identified inflections, applies sandhi rules and displays the output. When there is more than one root word or more than one inflection has minimum edit distance then the model will display all the possible options. User can choose the correct one from that. For example, when the given word is pustakaMdO (), the inflections tO making it pustakaMtO () meaning with the book and lO making it pustakaMlO () meaning in the book) mis are possible. Present work will list both the words and user is given the option. We are working on improving this by selecting the appropriate word based on the context.SCIL AlgorithmW=Utterance.wavSyl=SyllableIdentifier(W)Phone=phonetizer(Syl)Stem=getStem(Syl)Infl=getInflections(Stem)While (not exactMatch)word=MorphAnalyzer(stem,inflMatch)display wordStopWorking of SCILOnce possible root words identified the given word is segmented into two parts, first being the root word and seco nd part inflection. Now the inflection part is compared in the reverse direction for a match in the inflection dictionary. It will consider only the inflections that are mentioned against the possible root words, thus reducing the search space and making the algorithm faster.For example consider nAnnagariki (-) meaning to father, is misrecognized as nAn-na-cA-ri-ku () then SCIL is applied and will correct the recognition error as followsThe output from ASR is nAn-na-cA-ri-ku. The phone sequence generator will generate the phone sequence as n-A-n-n-a-c-A-r-i-k-u. Now, match it with the set of root words stored in dictionary module. This process will identify the possible set of root words from the Stem dictionary as followsOnce possible root words identified the given word is segmented into two parts, first being the root word and second part inflection. Now the inflection part is compared for a match in the inflection dictionary. It will consider only the inflections that are mentio ned against the possible root words, thus reducing the search space and making the algorithm faster.Possible set of inflections in inflections dictionaryAfter getting the possible set of root words and possible set of inflections they are combined with the help of SaMdhi formation rules. Here in this example cA-ri-ku is compared with the inflections of the root word nAnnaAfter comparing it identifies gAriki as the nearest possible inflection and combines the root word with the inflection and displays the output as nAnnagAriki.ConclusionsLanguage model proposed in this work results in reduction in corpus size by using factored approach. The search process is fastened by use of trie based structure. A change to standard trie is proposed.A post recognition procedure SCIL, is designed which uses the proposed language model and corrects the words misrecognized at inflections. The approach is tested using 1500 speech samples. These samples consist of 100 distinct words , each word repeate d 3 propagation and recorded by 5 speakers in the age group 18-50. It is implemented as a speaker dependent system. An average model is built from the third utterances of each word for each speaker. Each speaker is given a unique ID, using which average model of that speaker is used for testing.

Graduate essay

Tuesday, June 4, 2019

Factor Language Model Programming

No comments:

Post a Comment