Based on Dynamic Programming sentence input method

In General,We will not useDynamic ProgrammingCall it the algorithm for the problem "Dynamic Programming”,But called "Hidden Markov Models”,but,If we simply use the dynamic programming algorithm to solve a common directed acyclic graph,Then it can only be said to be dynamic planning ......

This time we want for the,It is based on the thesaurusWhole sentence inputlaw。The hidden Markov model is not based on the state transition Solution。


Since no model,ourWhole sentence inputIt is based on the vocabulary,We need a thesaurus。The thesaurus should record the most ordinary common vocabulary,And one thing isFrequency (weight)You must be correct。


You can understand,When you enter a string of pinyin time,Each tone corresponds a lot of text,If you look at every single word,There will be too many words corresponding - this is called the weight code。While the corresponding dictionary words,If the user enters a word,Then we hit the thesaurus priority,Thus,Since the code words long,Then the probability of re-code number and greatly reduced,Coupled with Frequency,We can get a more in line with common sense candidate word sequence for the user to choose the。

But the sentence it? When the user inputs too long to match words from the lexicon to time,How can we do it?

Sentence calculation

Simple Edition

According to the above principle,We follow the idea Think,Such a long sentence,We can consider splitting them open,For example, front to back,Followed by matching to find the maximum (long) Vocabulary,Then the input sentence can be achieved!

For example, you enter round/give/shu/ru/does/The/neng/shi/qi/jin/wei/zhi/zui/hao/yong/from/shu/ru/does

Then I follow from front to back in sequence through each sound,Most likely find that,This is probably a result of,The answer depends on how bad your thesaurus:

Rogge / IME / may be its / Guards / organic is best to use / / input method

Obviously,This is what we want from the very far worse。If you make so that you release,So users have to kill you。So,We change the thinking,Or sequentially matching the longest possible word,But it does not automatically match all,From the first to be able to match words beginning,Pause computing allows users to choose their own suitable candidate。

Ok,Although this is not a complete sentence intelligent input,However, if a user with the thesaurus,Then barely usable,If your thesaurus data and compare all,Then the effect should not bad。

Advanced Edition

Then,So it? of course not,You see we do not use the title of "dynamic programming" mean。In fact, think carefully,Our Thesaurus,With accurate word frequency,The how to use it together? We algorithm,Frequency planning to carry out the thesaurus,Frequencies will be converted to weight,Right word significant weight in the right word。

Thus,We put all possible combinations to enumerate,Then multiplied by each Word Frequencies,Finally, the greater the results obtained,Then the more word statement,While to find the optimal combination (overall maximum term frequency),This should almost。

such as round/give/shu/ru/does

We use the thesaurus is this:

  • Pocketed:0.11;Rogge:0.15;Luo Ge:0.13
  • Number:0.15;The number of cells:0.13
  • enter:0.21;Shu as:0.19
  • Confucianism law:0.09
  • Input:0.29
  • hair:0.05;law:0.03

Then the result may have these (partial):

  1. Rogge / input - 0.15*0.29 = 0.0435
  2. Off the grid / input - 0.11*0.29 = 0.0319
  3. Rogge / input / hair - 0.15*0.21*0.05 = 0.001575
  4. Off the grid / input / hair - 0.11*0.21*0.05 = 0.001155

You see,This time it can get a little better quality of the solution,Although it seems simple version of the same and can be obtained, "Rogge input method",But the above is no brain match,It is likely that will be behind words such as "date",To the demolition of the previous word in。But we are now in advanced algorithms,Calculated "by far" the weight,It is better than "may be the" higher,Then it will not be apart,With weights,We will be able to avoid a great probability demolished high-frequency words。


Viterbi Algorithm

Viterbi is probably the most widely used dynamic programming algorithm it,Word from digital communications to language,Without exception。So,We also use it here。but,We are not a mathematician,This algorithm is simple to study is probably not necessary,So the algorithm not to mention the problem temporarily exposing too,Mainly to see how to use it to give us a sentence to solve。

after all,To calculate the sentence in all possible combinations of words,SB is a brute-force behavior,Short okay,The user enters a 10 word it? 50 word it?

We start from the head string tone,Successively traverse the entire statement,Find all started can be found in the thesaurus,And then take them to generate start Graphs,For example, we still round/give/shu/ru/does ,We assume that there is no thesaurus "off the grid input method" term,So there will likely start:

  • Slightly (a series of words, etc., etc.)
  • Pocketed
  • Rogge
  • Luo Ge
  • Rogge book

These in turn will start to establish paths,Followed by a path through each graph,This process is repeated,Until the end,Such,We can get all the possible combinations of the,Because of the problems we will calculate the conversion to fence network digraph,With dynamic programming algorithm,We can greatly reduce the computational complexity,Let our algorithm to run。


Because the sentence may be very long,Then you may have a problem in the calculation of,Weight is so small that the accuracy of the computer is not enough lead to incorrect results。Here, I simply said that under,That is logarithmic。For example, a small value 0.000000000000232,The result is the number of -12.6345120151091 ,This prevents a small number and a large number of interoperable be discarded。We use a logarithmic instead of additions to the original weight multiplication。

to sum up

Simply using dynamic programming to achieve sentence input + Thesaurus,Not difficult to understand,Implementation is very simple,This is also pocketed algorithm input method has been used - after all, fast! With respect to the HMM-based word Solution,Obviously,Vocabulary weight code number much smaller!

But the premise is also high,To thesaurus as large as possible,Otherwise, there is no word in the dictionary can not participate in the calculation;Ask your Frequency realistic and high quality,Otherwise, the calculated result of unsatisfactory。

Other,We did not discuss the problems we pinyin word text,Since the input method is off the grid Larry input method,So now thanks to Larry features I have not faced this problem,Pinyin spelling is variable length,The best way is to encode them,At the same time it comes to word spelling problems,This is another piece of knowledge。

anyShareshare to:

Leave a Reply

Your email address will not be published. Required fields are marked *