TransIns uses the Okapi framework to parse documents of supported formats (MS Office, OpenOffice, HTML and plain text) into a representation that preserves the document markup and allows access to the document's
text content on a sentence level. The sentences, without the markup, are translated by the neural machine translation framework MarianNMT using translation models provided by
OPUS-MT. Afterwards, the markup is reinserted into the translated sentences based on token alignments. Finally, a translated document is provided in the original format.
April 2023: We added a translation model for Wayuu (guc) to Spanish, as provided by Nora Graichen.
We implement the following strategies for reinserting markup into the translated sentence using the tokens' alignments:
mtrain: A strategy assigning markup to the token next to it, as described in this paper of Matthias Müller and implemented in the Zurich NLP mtrain Python package
mtrain++: An improved version of the mtrain strategy
Complete Mapping Strategy (CMS): A strategy assigning markup to
all tokens in the markup's scope
TransIns is available as open-source software under the MIT License in our GitHub repository.
For more information, please contact
.