Document Translation with Markup Reinsertion

TransIns uses the Okapi framework to parse documents of supported formats (MS Office, OpenOffice, HTML and plain text) into a representation that preserves the document markup and allows access to the document's text content on a sentence level. The sentences, without the markup, are translated by the neural machine translation framework MarianNMT using translation models provided by OPUS-MT. Afterwards, the markup is reinserted into the translated sentences based on token alignments. Finally, a translated document is provided in the original format.

April 2023: We added a translation model for Wayuu (guc) to Spanish, as provided by Nora Graichen.

We implement the following strategies for reinserting markup into the translated sentence using the tokens' alignments:

TransIns is available as open-source software under the MIT License in our GitHub repository. For more information, please contact