Run and train Moses on macOS

Moses official website actually has macOS Binary packageof,You don't need to compile them from source。But in short,,Since Moses developers no longer use Macs,So he can't update,This leads to a bug in the code of the latest version (4.0),Make the binary file can not be used directly,The author said, "It's not difficult to compile from source anyway..." But in short,It is almost impossible to compile Moses from BigSur,Various strange errors,Headache。

In fact,We can directly correct the errors in the binary file,Run directly。

Fix the error

Direct download of Moses binary file,If you execute any one, you will encounter the following error:

以及

Let's deal with the second error first,This error is a bit complicated,Obviously,These binary files are linked to a non-existent dynamic link library,This is tricky,Because the code has been packaged into a binary file,Can't we directly modify the code or compiler parameters to fix this error...?

analysis

Use the command otool -L ./moses/bin/consolidate View consolidate This executable,We get the following result:

This is all the dynamic libraries linked to this file,In theory, we only need to change those non-existent paths to existing paths to make this program run normally.,There are two steps:

  1. Find the dynamic library that is really linked;
  2. Change the link library address of this binary file。

For the second point, we can use install_name_tool This Xcode comes with commands to complete,The first point is to find xmlrpc-c Got it,Best be 1.39.07 This version (I have tried to install the latest version directly with brew,But because it's so new,Two dynamic link libraries have been directly removed,So it’s better to have the same version,Ensure that the specific API remains unchanged)。Fortunately, xmlrpc-c has an official historical version,We can start fromHereDownload 1.39.07 Source code for this version,Compile。

Fix the error

To compile xmlrpc-c,Need to use gcc-10,If you haven't installed it,You can use the command brew install gcc A key installation,Then compile and install:

Where you can install it at will,But remember this address,I will find this path later。

The next step is to modify the link address:

For example for libxmlrpc_xmltok.3.39.dylib This one,We just replace it like this,Become a truly usable dynamic link library。But each binary file has 11 wrong links that need to be replaced... It is still a bit troublesome to handle manually,So I wrote a simple script,You can copy it down and write a .sh In the file,And then use the sh xxx.sh ./moses/bin This form is used to replace the corresponding binary file,This script can directly process all executable files in a given directory:

do
echo “processing $file …”
updateLink “${1}/${file}”
done
[/crayon]
After repairing the error command,We can use Moses normally。

Here we talk about the first error,It is the newly introduced gatekeeper of macOS,This security mechanism will prevent you from running any unsigned binary files by default...obviously,All binaries in Moses are unsigned... anyway,Since we have modified these binary files with a script,Now the system thinks they are generated by ourselves,So it won’t block the operation anymore。But just in case,If you met,Just use Finder to go to the directory,Right click on it,Select "Open"。 Then the system will use its own terminal to run the binary file,Then go back to your terminal and re-execute this command,It can be executed normally (each new command must be processed once,Fortunately, it’s only needed for the first time,Don't need it in the future。)

Note that there is a libirstlm.0.dylib We didn’t actually generate,But it doesn't matter,Because we didn't use this library,Just replace its dependencies with any path that can be found.,As long as you don’t use this part of the function,Then there will be no impact in theory~

Prepare data

Here we useUnited Nations Public Parallel Corpusconduct experiment,Because the data is too large,Here are only Chinese and English 60 Ten thousand lines,The data downloaded is tar.gz Subcontracting files,Here we use cat UNv1.0.in-zh.tar.gz.* >>a.in-zh.tar.gz Command to merge subcontracted files,Then unzip。

Here we will name the intercepted data as en60w.txt and zh60w.txt ,These two files are basically one sentence per line,The format of the two files is the same,Consistent content,The only difference is the language。

Participle

We need to segment the data,Note that English corpus also needs word segmentation,This will separate some punctuation marks from English words or numbers,Convenient for follow-up operations。

Chinese word segmentation,We use jieba:

Note that this is used -d " " This parameter,Change jieba's default slash word breaker to space。So we get the result of word segmentation zh60w_cuted.txt

Segmentation of English words,Use Moses' own tools:

Here i used -time To show the final time consumption,use -threads 6 Indicate the use of multithreading to speed up processing,use -lines 20000 Set each thread to process each time 20000 Row,Default is 2000. So we get the English word segmentation result en60w.took.txt

At this point, en60w.txt and zh60w.txt Can be deleted。

Handle case

Change all uppercase in English data to lowercase,This helps speed up the translation,We first need to train Truecase,And then use it to quickly process the corpus:

So we get truecase-model.in and truecase-model.cn Two models,Then we use these two models to process the segmented corpus:

So we get the processed in-zh60w.true.in and in-zh60w.true.cn ,Note that it starts here,Our naming has certain rules,Because subsequent commands will use。

At this point, en60w.took.txt and zh60w.took.txt Can be deleted。

Remove long sentences

At last,Let's trim the corpus again,For example, a sentence that is too long will significantly slow down the training speed and affect the final accuracy:

Such,We got it again in-zh60w.clean.cn and in-zh60w.clean.in These two cleaned corpus files。

Generative language model

The language model is used to ensure that the translated content is fluent and readable:

Then compress the generated model into binary,Speed ​​up queries:

So we get in-zh60w.blm.cn This model file, in-zh60w.barley.cn Can be deleted。

Use commands to test the model: echo "I love Beijing Tiananmen" | ./moses/bin/query in-zh60w.blm.cn Get output:

At this point, in-zh60w.true.in and in-zh60w.true.cn Can be deleted。

Training the translation model

Now,Everything is ready,We can start training the translation model:

We first create a separate directory,Execute training commands here:

Pay attention here -mgiza -mgiza-cpus 6 Must be added,Because the macOS package of Moses only contains the tool mgiza,If you don't use it,Will use a single-threaded processing tool by default,Eventually cause the command not to be found and an error。among them -mgiza-cpus 6 Indicates that I want to use 6 Threads for training。

Tuning parameters

Model training is complete,But now the hyperparameters are all default values,Not optimal,We need to tune the parameters。

First take another interception from the parallel corpus downloaded at the beginning 10 Linguistics,Here I have intercepted the first 60 Million to 70 100,000 rows of data,Save as opt.in and opt.cn These two small corpora,We will use this 10 Ten thousand data as a debugging set for parameter tuning。

Word segmentation and handling case

Still similar steps,Process the data:

Now we get opt.true.cn and opt.true.in ,Now it can be used to tune the parameters。

Tuning

--is dead /[absolute path]/moses/bin/ Pay attention to this parameter,To be written as an absolute path,Although the program can also be written as a relative path,But after the end of the tuning, the export script cannot be generated correctly,Will cause the actual directory to be misplaced。

Here we use --multi-mosesParameters open multiple processes and use parameters --decoder-flags='-threads 6' Send instructions directly to decoder,Means to use 6 Processes,Speed ​​up processing。

Although the parameter name here is threads,But in fact the format is like this Number of processes:Number of threads per process:Number of threads in additional processes ,If you just gave me a number like me,Then it means 6:1:0 ,That is, 6 processes, 1 thread per process,No extra process。This is the fastest and most memory-consuming solution。

Not used in actual measurement --multi-mosesEven if you set 6 threads,Actually only 2 threads are used,Occupies 4.2GB of memory (this size is based on the size of different models,You have to deal with it according to your actual situation,For example, my memory is 32GB,After getting this memory footprint,You can end the process,Then start again with 6 processes,Accelerated processing)。The tuning process is very, very slow,My suggestion is that you choose a sample that is a multiple of 10 for tuning,In this way, the current progress can be calculated based on the quantity。

The program will first filter the model based on the data you want to test,Remove items that are definitely not needed in the model,This will greatly increase the loading speed without affecting the test results,But in actual use, please do not use this filtered special model,And if the test data is replaced, the filtered model must be regenerated。

note,Tuning will not stop automatically,It will iterate over and over again,Stop it when you think it's almost the same,And use the best result。

Binary model compression

The generated model is textual,We can compress the model,Generate binary data,This can greatly improve the loading speed of Moses。We create a directory to store the generated binary model: mkdir working/binarised-model

Then use two commands to generate two model files:

will train/model/moses.this copy to binarised-model/moses.this

Edit it,turn up # feature functions This piece, LexicalReordering Parameters in this field path= for binarised-model/reordering-table Absolute path, PhraseDictionaryMemory Change this field to PhraseDictionaryCompact And the parameters path= To binarised-model/phrase-table.minphr Absolute path。

Then we can use the command ../moses/bin/moses -f binarised-model/moses.this Let's start Moses。

Batch test

Batch testing also needs corresponding parallel corpus,English for translation,Chinese used for the final comparison accuracy。Use the same English tokenizer.perl Word segmentation,Chinese should be segmented with jieba and other thesaurus,And then use the truecase.perl To process。

Prepare the model

Same,We first filter the model against the test set,Remove items that are not used at all,This can greatly speed up the test without affecting the results:

Here I directly used the tuned data to test。

Batch processing

Use command to make moses translate everything in batch:

Calculate BLEU

BLEU is an algorithm for judging the accuracy of translation results,The result is a percentage:

For example, according to the corpus in this example,We get the result as follows:

 

Reference links

 

Original article written by LogStudio:R0uter's Blog » Run and train Moses on macOS

Reproduced Please keep the source and description link:https://www.logcg.com/archives/3487.html

About the Author

R0uter

The non-declaration,I have written articles are original,Reproduced, please indicate the link on this page and my name。

Leave a Reply

Your email address will not be published. Required fields are marked *