R0uter's Blog

Log everything.

Being Chinese Wikipedia corpus

2016 year 12 moon 2 day

Recent changes：4th September, 2022

Recently doing input method thesaurus，Implement new sentence input model，（Based on input model sentence the word I would talk back before），The new model is based on the input sentence HMM (I.e.Hidden Markov Models) To do，Of course，Due to limited funds, I personally equipment，Only the second order matrix。But even so，Model still need training。

of course，Not to say that with the novels and to train bad，Just difficult to find businesses related fiction，after all，Area they cover too single，This is not really a high quality corpus。Speaking of high-quality，The natural non-none other than Wikipedia，From now on，We have to get information from Wikipedia all Chinese，And export them as a corpus，Used to model training。

Download Data

Do not write reptiles to climb，Wikipedia is open，So they provide their own links to the download package，Really very intimate。download linkyes：https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2 。

This is the official regularly dump out xml format data，Download then basically is about 1GB，Chinese content does too little ah。If you unzip，It is more than a 6GB xml file，But do not go silly to decompress，Wikipedia yourself gives you a handy tool to export content。

Data output

Here we use python To perform the export，First, download the tools provided by Wikipedia GENSIM

1	pip3 install gensim

After a successful installation，Probably would writepythonscript

⚠️ here Note that the code only applies to Python 3

1

2

3

4

5

6

7

import gensim

input_file = "./article/zhwiki-latest-pages-articles.xml.bz2"

f = open('./article/zhwiki.txt',encoding='utf8',mode='w')

wiki = gensim.corpora.WikiCorpus(input_file, lemmatize=False, dictionary={})

for text in wiki.get_texts():

str_line = ' '.join(text)

f.write(str_line+'\n')

UserWarning: Pattern library is not installed, lemmatization won't be available. For this warning，Ignore it，We do not use it。

~~at me 2015 On the model year with 13-inch rmbp ran about ten minutes just fine~~，very slow，To wait a long time，Probably about half an hour，Export of data is ~~950M~~ 1.09G Text，Each article row。

Direct export Wikipedia Chinese documents

Organize text

Ok，Directly exported text has been too difficult to use an ordinary text editor to open the，But apparently，We included the so-called Chinese Simplified Traditional ...... they still should be treated at the，Here I will Traditional to Simplified，Of course, in turn row，We use OpenCC To complete the job。

Installation OpenCC

Of course，I macOS platform，Directly command a key installation： brew install opencc

After installation，Also you need to write a configuration file，Write in the same directory to your corpus：

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

{

"name": "Traditional Chinese to Simplified Chinese",

"segmentation": {

"type": "mmseg",

"dict": {

"type": "ocd",

"file": "TSPhrases.ocd"

}

},

"conversion_chain": [{

"dict": {

"type": "group",

"dicts": [{

"type": "ocd",

"file": "TSPhrases.ocd"

}, {

"type": "ocd",

"file": "TSCharacters.ocd"

}]

}

}]

}

Save it as zht2zhs_config.json spare。

Change

Then execute the command in the current directory

1	opencc -i zhwiki.txt -o zhswiki.txt -c zht2zhs_config.json

This time the OK。

Translate to English content

in conclusion

Thus，We get a simplified content，No punctuation marks and numbers of Chinese Wikipedia corpus，Throw it in training machine to read it -

About the Author

R0uter

The non-declaration，I have written articles are original，Reproduced, please indicate the link on this page and my name。

Comments

st474ddr says:

2018 year 4 moon 15 day at pm 4:52

Hello
I tried your program
But they were in line 6:
str_line = bytes.join(b’ ‘, text).decode()
The following error occurred
sequence item 0: expected a bytes-like object, str found

Looks like the form problem
Internet to find any relevant answers
But a small change still reported the same mistakes
How to solve this?

Reply
1. R0uter says:
  
  2018 year 4 moon 15 day at pm 4:59
  
  str_line = text
  
  Try to change that?
  
  Reply
  1. st474ddr says:
    
    2018 year 4 moon 15 day at pm 10:45
    
    Become such a return
    can only concatenate list (not “str”) to list
    
    Reply
    1. R0uter says:
      
      2018 year 4 moon 16 day at am 7:22
      
      Do not worry emmm，I'm going to download a copy at your back to help testing code，Then I will update this article and inform you。
      
      Reply
    2. R0uter says:
      
      2018 year 4 moon 16 day at am 9:04
      
      Hello there，I've corrected the code，They changed the behavior api，Now it directly into the text of the content processing，This is more convenient，Price is the processing speed more slowly ......
      You should now be able to export in accordance with the text of the code。
      
      Reply
      1. st474ddr says:
        
        2018 year 4 moon 16 day at pm 7:34
        
        Thank you for a really successful
        Enthusiastic and patient guidance and assistance

R0uter's Blog

Being Chinese Wikipedia corpus

Download Data

Data output