MT Normalizer

Introduction

MT Normalizer is an offline tool developed by Linguitronics (www.linguitronics.com) aiming to facilitate the application of Machine Translation (MT) and improve the MT work efficiency in localization works.

System Requirement:

64-bit Win7 SP1 and above
Win7/Win8: Need to install .Net Framework 4.6.1 (download)
Win10/11: Since Win10/11 is installed with .Net Framework 4.8 by default, there is no need to install .Net Framework 4.6.1

MT Normalizer includes independent modules listed as below:

Normalize MT Target Segment
Compare Segment
Modify Source Segment
Extract Text
Mark Color In Target Segment
Pseudo Translate SDLXLIFF
Machine Translate
Generate Dual-Language Translation
Vet Content
Search TM
Align Parallel Corpus
Convert File Encoding

Note:

MT Normalizer only regards lower case file extension as valid extension.

How To Register License?

Before being able to use MT Normalizer, you need to obtain a license first from Linguitronics. Registration steps are as follow:

Send an email to Linguitronics (amberski.zhao.sh@linguitronics.com) to ask for registration. Then you will receive 2 files (HID.exe and generate_uuid.bat) from Linguitronics.
Run the HID.exe on your machine, you will get an automatically generated HardwareID.txt in the same folder of the HID.exe.
Run the generate_uuid.bat on your machine, you will get an automatically generated uuid.txt in the same folder of the generate_uuid.bat.
Send the HardwareID.txt and uuid.txt generated from the above 2 steps to Linguitronics (amberski.zhao.sh@linguitronics.com). Then you will receive a license file (.license).
Download the latest build of MT Normalizer from this page and unzip it to your machine. Put the license file in the folder of MT_Normalizer.exe. All is done, you can launch MT Normalizer now.

Normalize MT Target Segment

Feature:

Batch replace texts inside target segments of sdlxliff/mxliff/mqxliff/tmx/TW xlz files.

Options:

Skip Locked Segment. This option is applicable to sdlxliff/mxliff/mqxliff/TW xlz files.
Add Space Between Chinese Character And English Letter/Arabic Number.
If user puts a Target_Exception_List.txt file in the folder of MT_Normalizer.exe, each line of the txt file will be considered as exceptional string meaning if a target segment includes any of these exceptional string, all the text replacement of this target segment will be skipped.

Note:

If user does not check the "Use Custom Setting" checkbox and specify valid Custom Replace List in the Settings panel, by default, the tool will apply embedded rules to run the target segment text replacement.
If user wants to use self created list to do customized target replacement, please check the "Use Custom Setting" checkbox and specify valid Custom Replace List in the Settings panel. Please follow the below example to create the custom replace list. Here the arrow symbol represents tab key in text file.
The required replacement rule is: Original Target→Target After Replacement→Replacement Type
Plain type stands for plain text replacement. Regx type stands for Regular Expression replacement that complies with C# regular grammar.

Home URL→Landing Page→Plain
<span t="(\d+?)">→t=${1}→Regx

Compare Segment

Feature:

Compare Source or Target segment between Trados sdlxliff/Phase mxliff/tmx/MemoQ mqxliff/GlobalLink txlf/Lionbridge TW xlz/Idiom xlz/XTM xlsx files. In addition to total segment number and modified places, the comparison report also includes detailed differences between original target and revised target of each segment.

Options:

Edit Distance. If this is selected, there will be a column showing the Edit Distance value in the comparison report.
Character Based Target/Word Based Target. Character based languages refer to Simplified Chinese, Traditional Chinese, Japanese, Thai, Lao and Burmese. If the target language is any of these languages, please select Character Based Target, otherwise, select Word Based Target.
Include Identical Segment. If this is selected, the report will also include the comparison information of unchanged source/target segment.

Note:

For sdlxliff files, the comparison report will also include segment's MT or Locked status.
If a tmx file contains more than 60000 source segments, that file will not be compared.
Before comparing the source of 2 files, please make sure the files contain equal number of source segments.
When running Edit Distance, if Character Based Target is selected, user can ignore English letter, space, punctuation and Arabic number by adjusting the options in the Settings panel.
If the reported segment number is more than all segment lines when the Include Identical Segments checkbox is checked, that means some segments only have source source without target.
As for XTM xlsx files, only single sheet xlsx files exported from XTM are supported. The 1st column should be segment ID, the 2nd column should be source, the 3rd column should be target.

Modify Source Segment

Feature:

Batch replace texts inside source segments of sdlxliff/mxliff/mqxliff/txt files.

Options:

If user puts a Source_Exception_List.txt file in the folder of MT_Normalizer.exe, each line of the txt will be taken as exceptional string meaning if a source segment includes any of these exceptional strings, all the text replacement of this source segment will be skipped.
To process txt files, user must define a txt formate replace list in the Settings dialog.

Note:

If user does not check the "Use Custom Setting" checkbox and specify valid Custom Replace List in the Settings panel, by default, the tool will apply embedded rules to run the source segment text replacement.
If user wants to use self created list to do customized target replacement, please check the "Use Custom Setting" checkbox and specify valid Custom Replace List in the Settings panel. Please follow the below example to create the custom replace list. Here the arrow symbol represents tab key in text file.
If file extension is not txt, the required replacement rule is:
Original Source→Source After Replacement→Replacement Type
Plain type stands for plain text replacement. Regx type stands for Regular Expression replacement that complies with C# regular grammar.

Source Text→Revised Source String→Plain
<span s="([a-zA-Z0-9]+?)">→s=${1}→Regx

If file extension is txt, the required replacement rule is:
Search Pattern→Replacement Pattern→Replacement Type→Ignore Case Or Not→Single Line Or Not
Plain type stands for plain text replacement. Regx type stands for Regular Expression replacement that complies with C# regular grammar.

Something old→Something new→Plain
<seg-source>(.*?)</seg-source><target>.*?</target>→<seg-source>$1</seg-source><target>$1</target>→Regx→Yes→Yes

If file extension is txt, the Source_Exception_List.txt must comply with the format:
Exceptional Texts→Applicable To Which Line of the Custom Replace List→Check Which Match Group
Applicable To Which Line of the Custom Replace List and Check Which Match Group values must be positive integer.

conf="Translated"→1→2

Extract Text

Feature:

Batch extract texts from sdlxliff/mxliff/xliff/txt/tmx files according to user defined extracting rules into a csv report. The generated report will be in the same folder of rule list.

Options:

Remove Tags From Target Files. If this is selected, all tags inside the target files will be ignored during extraction.
Auto Escape Rules Targeting Match Group 0. If this is selected, all the rules targeting match group 0 will be escaped. If you already escaped the rules, please do not check this option.
SDLXLIFF/MXLIFF/XLIFF/TMX Check Field. User can define the to-be-checked field of these types of files is source or target segment.
SDLXLIFF/MXLIFF/XLIFF/TMX Include Context. When extracting text from these types of files, user can define if context is needed.

Note:

If the target file is sdlxliff/mxliff/xliff/tmx (1.4 or above format), only source or target segment content will be scanned for text extraction.
Please follow the below example to create the extraction rule list, you can add description after each rule, but it's not mandatory. Here the arrow symbol represents tab key in the rule list.
The required extraction rule is: Regular Expression→To-Be-Extracted Match Group→Ignore Case Or Not→Single Line Or Not

"(http://.*?)"→1→Yes→Yes
"(https://.*?)"→1→Yes→No

Mark Color In Target Segment

Feature:

Mark color in sdlxliff files (already has been manually translated) to differentiate the target of full match (including 100% match and context match), MT and fuzzy match/new segments. The prerequisite is the sdlxliff file must be generated from docx/xlsx/pptx files. The marked color will also show in the translated output files.

Options:

Do Not Change. If this is selected, the corresponding Full Match or MT texts' color will not be changed.
Two suggested colors are Purplish Grey and Light Orange
Customize Color. Using this option, user can select preferred color from all the 16777216 colors supported by Windows.
Sdlxliff file cannot retain all the original pretranslation and MT status information inside itself throughout the dynamic translation process. If user wants to accurately mark color for different categories of segments, a comparison between the current and original sdlxliff files is needed. If a folder of the originally pretranslated or MTed sdlxliff files is provided, this function will mark color accurately based on the comparison mentioned here.

Note:

Regarding the file name, the secondary file extension, which represents the source file format, must be.docx/.xlsx/.pptx. It means they must be concatenated with the real file extension .sdlxliff, and the all letters of the extensions must be lower case.
It is suggested that during the translation preparation of docx/xlsx/pptx file, it's better to use Office 2013 to save the source file as doc/xls/ppt, and then re-save doc/xls/ppt to docx/xlsx/pptx. Eventually use the re-saved file to translate. The purpose is to guarantee the effectiveness of color marking.

Pseudo Translate SDLXLIFF

Feature:

Pseudo translate sdlxliff files (already has been manually translated) to differentiate the target of full match (including 100% match and context match), MT and fuzzy match/new segments. The pseudo translation marks will also show in the translated output files.

Options:

Two suggested pseudo translation marks are x and y.
Since sdlxliff file cannot retain all the original pretranslation and MT status information inside itself throughout the dynamic translation process, if user wants to accurately add pseudo translation marks for different categories of segments, a comparison between the current and original sdlxliff files is needed. If a folder of the originally pretranslated or MTed sdlxliff files is provided, this function will pseudo translate accurately based on the comparison mentioned here.

Machine Translate

Feature:

Batch machine translate source segments of sdlxliff/mxliff/TW xlz/tmx files. A bilingual tmx file (complying with Trados tmx format) will be generated.

Options:

27 languages are supported.
Skip Locked Segments And Already Translated Segments.
User can define the number of segments per request. If this number is too big, there is possibility of getting improper translation results.
If the Populate Translation Into Source File option is selected, the MTed target will be directly added into the source sdlxliff/mxliff/TW xlz files.
If the Add MT Marker In SDLXLIFF option is selected, the source sdlxliff files will be added with an MT marker, which will help translator differentiate MT result from the other type of translation in the target segment.
If the Translate Repetition option is selected, all the repeated source segments will be translated.

Note:

The input tmx file needs to be 1.4 or above format.
In order to use this function, user needs to do Baidu MT account setup first.
After registering Baidu MT account, please input your APPID and Key into the Settings panel before running this function.
If you have used your Baidu MT account to do MT via MT Normalizer, then in case you want to use your Baidu MT account on any other system/platform to do MT, you will need to contact Linguitronics (amberski.zhao.sh@linguitronics.com) first, in order to enable you to migrate your Baidu account to new system/platform.
If a tmx file contains more than 60000 source segments, that file will be skipped.

Generate Dual-Language Translation

Feature:

Put both source and translation into target segment inside sdlxliff/mxliff files, in purpose of enabling translated target files to have source and translation overlapping effect for reading convenience.

Note:

User can arrange the two languages' order in arbitrary mode according to acutal need.
If encountering back-conversion issue with the dual-language sdlxliff file, please generate File Verification report in Trados and extract the problematic segment id from the report into a list (txt file), then define the Exceptional Segment ID List in MT Normalizer to re-generate the dual-language file.

Vet Content

Feature:

Batch vet content of doc/docx/xls/xlsx/xlsm/ppt/pptx files against checklist (UTF-8 txt file) or user defined sentence character limit. The generated csv report will be in the user defined folder.

Options:

If no issue is found or user defined neither non-empty check point font nor long sentence font for veting, it will not create backup file.
The DOC/DOCX Style Type option only works for doc/docx files.

Note:

This function only supports 64-bit Microsoft Office.
For Word files, it will only check unhidden text of paragraph and Word native textbox.
For Excel files, it will only check unhidden columns of each sheet.
For PowerPoint files, it will only check text of PowerPoint native textbox and smartart.
Please follow the below example to create check list. Here the arrow symbol represents tab key in check list.
Veting Text/Regular Expression→Checking Type
Plain type stands for plain text. Regx type stands for Regular Expression (case insensitive) that complies with C# regular grammar.

Veting Text→Plain
<span s="([a-zA-Z0-9]+?)">→Regx

Search TM

Feature:

Batch search content of sdltm files against user defined searching texts. The generated xlsx report will be in the MT_Normalizer exe folder.

Options:

Folder Level refers to the subfolder depth of the user defined folder. If it's empty, all subfolders will be searched.
Search Field refers to the Source segment or Target segment of TM.
If the Only Check TU Total Amount option is selected, it will create a csv report in the MT_Normalizer exe folder listing the TU (Translation Unit) total amount of every sdltm.

Note:

The xlsx search report can only contain 1048576 lines of data by maximum.
Letter case and all tags of the searching texts will be ignored.

Align Parallel Corpus

Feature:

Auto align source and translation of sdlxliff/txt files. The generated tmx file will be in the MT_Normalizer exe folder. This feature only supports Simplified Chinese and English.

Options:

Source language code and target language code must be different.
File Type refers to the type of files to be processed.
Higher SDLXLIFF Quality Threshold represents stricter automatic alignment quality standard on SDLXLIFF files.

Note:

Please make sure Python 2.x/3.x is installed and the path of python.exe is added into system environment variables.
In the sdlxliff files, the <seg-source> </seg-source> tag must include the <mrk mtype="seg" mid="xxx"> segment information.
The source file folder and target file folder must have identical structure and the corresponding pair of files must have the same file name.
All tags of the texts will be ignored.
If File Type is .sldxliff, all unalignable segments will be put into the Unaligned_Source.txt and Unaligned_Target.txt files, which are in the MT_Normalizer exe folder.

Convert File Encoding

Feature:

Batch convert encoding of text files.

Options:

If multiple extensions are needed, please use semi-colon to separate them in the Valid Extensions field.

Note:

A csv format log will be generated in the MT_Normalizer exe folder.