Global data: how to match Latin and non-Latin data

Esther Labrie
Posted by Esther Labrie Content and Brand Manager Monday, November 12, 2018 - 12:55

Esther Labrie is language specialist and content manager at Quadient. Joining the company in 2010, Esther specialized in upcoming themes in online marketing like digital communications, omni-channel and Big Data. Esther creates content that focuses on building a bridge between online marketing and customer centric selling. She enjoys music and literature and likes to spend time with friends and family.

Customer Experience Update
how to match Latin and non-Latin data

The western world often tends to forget about those parts of the world where languages are not based on the Latin script. This is remarkable, since the language Top 10 according to Babbel (determined by its number of native speakers) contains no less than seven non-Latin languages! For data quality experts like Quadient, it means that we need to help our customers worldwide to face data challenges regardless of the language that they present themselves in. So, following my blog about diacritics in the Latin script, I will now zoom in on non-Latin characters.

How do you make sure you match the right records?

One of the biggest challenges for any business is to keep track of their customer data. Are they complete, clean, actual, reliable? Is the Mr. Muller that is in our CRM-system the same person as we see under a slightly different name (Müller) in the billing system? When dealing with customer data, organizations need to make sure they match the right records. Comparison of data, recorded in different writing systems, poses many challenges. Naturally, a Unicode enabled environment is necessary to represent the data. But this is not where the real difficulties lie. Assessing the degree of similarity between records in Latin script and non-Latin script includes a much higher degree of complexity. For this, a set of substitution rules that replaces one or more non-Latin characters by one or more Latin characters and vice versa is needed. In addition, normalization rules for orthography and language-specific pronunciation issues have to be applied. For example, an initial letter “w”, followed by a vowel, will be transliterated into a “u” followed by a vowel in Cyrilic. Finally, robust and advanced matching methods, such as bigrams, trigrams, consonant-bigrams and grapheme-phoneme substitution must be deployed to determine the level of similarity. The general strategy for this matching process is based on the principles of Quadient’s fault-tolerant identification method used by Quadient ® DataHub.

What is transliteration and transcription?

Transliteration is the representation of letters of one script by the letters of another; transcription is the representation of sounds of one language in letters of one script. The difference is potentially a crucial one. Transcription is generally more readable, because it is expressing sounds associated with one script using the sounds associated with a language using another script. As transcriptions can lead to wide variations in the output, a good transliteration system should therefore produce results that are as readable as any transcription, one that can be read by an end-user without anybody needing to consider the rules that produced them. For example, the transliteration of the words “box”,“cadeaux” and the Chinese town “Xi’an” must take the different pronunciation of the letter “x” into account. A good transliteration system preferably handles this in the normalization phase.

How does data normalization work?

The example above shows different representations of the same name. As we can see, there are not only differences in the various writing systems (Latin, Cyrillic,Arabic, etc.), but also in the country-specific Latin (orthographic) representations of the name. In Italy, for example, the notation Muammar Gheddafi is used, whereas in Indonesia the notation Moammar Khadafy prevails. Normalization of the name orthography will deal with name variants in the same script. For this, specific rewriting or substitution rules will be applied to pre-process these strings. The normalization is based on pragmatic linguistic methods (such as diacritic variations and letter equivalents). The same principle is used for the normalization of transliterated names. Here, of course, some other substitution aspects have to be taken into account. For example: Standard Arabic writes long vowels and omits short ones. Consequently, the transliteration of an Arabic name will lack these vowels. When comparing a transliterated Arabic name with an Arabic name in Latin script, the degree of similarity will be lowerTherefore, the vowel omission must be balanced out. Transliteration of names from an ideographic writing system, such as Mandarin Chinese, poses a specific challenge. Each symbol is more or less equivalent to a concept rather than to a sound. For transliteration of Mandarin, the Pinyin system is used. Pinyin uses Latin letters to represent sounds in Standard Mandarin. This representation consists of syllables using either accented Latin letters or numbers to indicate the tonal pitch of aparticular syllable:

zhêng huá wú < > zheng2 hua1 wu3

The ability to do so might prove to be of vital importance, for example when checking potential customers against sanction lists to prevent fraud, money laundering or financing criminal organizations.

Would you like to know more about matching of multilingual data? We invite you to download this complimentary whitepaper: Processing Multilingual Data in Global Businesses”.