1 Introduction Cross-lingual sentence retrieval aims at aligning parallel sentence pairs that are translations of each other from unlabeled multilingual documents. Run the following command on your terminal: python bitextmining.
Modify the following variables in the code based on your own use cases: minsentlen 10 maxsentlen 200. OPUS-MT is prepared to use Wiki data from various Wikimedia wikis (Wikipedia, Wikiquote, Wikisource, Wikibooks, Wikinews). monolingual and parallel data size, up to a cer-tain size threshold, rather than on what lan-guage pairs are used for training or evaluation. sourcefile 'data/so.txt.xz' targetfile 'data/yi.txt.xz' This script also filters out sentences that are not between minsentlen and maxsentlen characters long. We also conduct a comprehensive study on how each part in the pipeline works. The next step is to fetch some monolingual data to be back-translated. Our approach achieves state-of-the-art results on WMT16, WMT17, WMT18 English$\leftrightarrow$German translations and WMT19 German$\to$French translations, which demonstrate the effectiveness of our method. Early work on incorporating monolingual data into NMT concentrated on target-side monolin-gual data.Jean et al.(2015) andGulcehre et al.
composed of monolingual texts in different languages sharing subject matter. Finally, the model is fine-tuned on the genuine bitext and a clean version of a subset of the synthetic bitext without adding any noise. yielding patterns of correspondences and providing quantitative data. Next, a model is trained on a noised version of the concatenated synthetic bitext where each source sequence is randomly corrupted. 4 Copied Monolingual Data for NMT We propose a method for incorporating target-side monolingual data into low-resource NMT that does not rely heavily on the amount or quality of the parallel data. First, we generate synthetic bitext by translating monolingual data from the two domains into the other domain using the models pretrained on genuine bitext. Existing work (Feng et al.
Pre-trained language models are fine-tuned with the translation ranking task.
In this work, we study how to use both the source-side and target-side monolingual data for NMT, and propose an effective strategy leveraging both of them. In this paper, we propose to align sentence representations from different languages into a unified embedding space, where semantic similarities (both cross-lingual and monolingual) can be computed with a simple dot product. While target-side monolingual data has been proven to be very useful to improve neural machine translation (briefly, NMT) through back translation, source-side monolingual data is not well investigated.