How to split the full pinyin of drop-off input method

5 years ago,I have written aBased on Dynamic Programming sentence input methodArticles,The problem of pinyin splitting is mentioned at the end of the article,Because the drop-off input method was mainly aimed at Shuangpin,Splitting is not actually required,Just take it apart in two。(This is another reason why I admire Shuangpin,After all, there is one less technical difficulty)

Later, drop-off input method supported the full spell,And start to optimize the whole spell,Only found the original Pinyin participle,Even more difficult than Chinese word segmentation。

Many people mention Pinyin participles,First of all, I thought of making an analogy with English word segmentation.,Not really accurate,Although they are similar in form,such as English letters,separated by spaces,Single characters are meaningless, etc.,But the number of English vocabulary is very large,And there are only hundreds of Chinese pinyin... So in fact, it can really correspond,It's Chinese word segmentation - at least it's all language,Not so much weight,context makes sense。

popular programs online

If you search for "Pinyin Split Algorithm" on the Internet, most of your needs are search engines.,No one studies pinyin splitting for input methods... these splitting schemes,Both have a fatal problem - splitting is lossy,Cannot handle ambiguous splits。

such as:

problem lies in

all these programs,Both are "insert spaces in pinyin strings",split a possibility。But in fact the splitting of pinyin is ambiguous,such as zhanan , can be zha'in (scumbag),but can also be zhan'an (standing press),Both are reasonable and legal pinyin,If you think the latter is not commonly used (that's not a word at all),Then let's change an example fangan ,it can be fan'gan (disgusted) can also be fang'an (Program),There are too many pinyin combinations like this... a disaster for Chinese input methods。

"Insert space in pinyin string" refers to the final destination form,not the algorithm itself。

Another is to change the length to split the difference,as typical xian can also be xi'anlian can also be at the'an ……

easy way

Actually, if you look closely,You will find that the problem is very regular,take fangan for example,If we split using the longest match principle,That's right fang'an ,the reverse is fan'gan ,perfect。But the reverse split is easy to go wrong,such as spelling susongan this phrase,Forward split is his'song'an , Then the reverse split becomes his's'O'n'gan ,totally failed。

transition matrix

Ok,After the above plan does not work,I thought of another way - since Chinese, I use the transition matrix to record the transition probability and then solve it,Then why can't we do the same here in Pinyin split? Although the cost is that it is a bit slow to solve twice。

So I counted the pinyin transfer of all the mentioned above,Pinyin is not English after all,it has too few units,So there is no difference between transferring first-order and second-order,but overall better than longest match,Match the previous pinyin word frequency,Occasionally still get the correct result。I've been using this program for over two years,There have been many improvements in between,But most of the ambiguity splits are hard-coded in the form of manual processing.。

return to essence

I don't know if you found out,In my example,There is another rule,That is after the forward split,The pinyin at the end must be a final! either fang'an still is zhan'an Even gang'a (The ideal split should be " gan'ga ”),There are finals at the end (strictly speaking, it should be "no initials"),Because in this case, the initials must be taken away by the previous pinyin,It and the previous pinyin finals form another taste of the finals。So,We can completely match the longest,Determine whether this pinyin has an initial,If there is no,Take out the previous spelling,take its last letter and combine,If the result of such a combination is legal pinyin,It is legal to remove this letter from the previous pinyin,Then we found an ambiguous split combination,Just add them all to the list and you're good to go!

No need for a statistical language model at all (I actually tried,The effect is not much better than the pure transition probability,even slower),Judging by the rules。This completely solves this fixed-length ambiguity splitting problem,In subsequent full-sentence queries,We can directly feed these pinyin into the model,Let it find the candidate for the most appropriate context on its own。

Here is the actual Swfit code used in my engine,Direct copy paste does not work,because of missing related object declarations,But it can be understood as pseudo code:



Variable length ambiguous split problem

The discussion above focuses on the issue of fixed-length ambiguous splitting,This is due to the limitations of the drop-off input method engine itself - cannot mix and handle variable-length pinyin strings,This is also a functional limitation based on Shuangpin development in the early years。So in dealing with variable length ambiguity,The approach I take is to combine these potentially insurgent words together when processing the thesaurus,such as query xian ,There will be words like "Xian",But there are also words like "Xi'an",they share the same code,But now the actual test does not seem to be ideal,such as user input xinlianwei ,will be split into xin'lian'wei ,This should be the "reason" of "psychological comfort",But it will not be higher than the "even" in any case... Regrettably, the drop-off input method cannot support simultaneous queries xin'lian'wei and xin'li'an'wei ,because a length is 3,Another length is 4。I haven't found a better solution here either,in the future,I'll be back to add content。



Original article written by Gerber drop-off:R0uter's Blog » How to split the full pinyin of drop-off input method

Reproduced Please keep the source and description link:

About the Author

R0uter's Blog

The non-declaration,I have written articles are original,Reproduced, please indicate the link on this page and my name。

Leave a Reply

Your email address will not be published.