How I used Huffman coding to easily create a translator into a fictional language
Tolkien was one of the few people who created several fictional languages to be used by strange nations in his stories. From elaborate Elven tongue spoken by ancient nations to relatively incomplete Blackspeech, harshly sounding and meant to be used by orcs and other evil beings. This is a cool feat that took so much effort that I would definitely never try to replicate it. However, with modern computers, this task can be significantly less time-consuming. Actually, far less time-consuming than learning a language.
How I came to the idea
Back in the days when I was studying, I was listening to the teacher when he was talking about context-free languages. One of the things he said was that they were used to make generators of inconspicuously-looking texts, fake personal information, spam e-mails etc. I found it interesting and wrote a parser of context-free grammars to generate text from it. The grammar had to be written manually, but the form of the output text could be generated much better than in the case of a generator based on Markov chains. It was quite possible even to make it grammatically correct.
First I used it to generate random esoteric New Age nonsense and try if the spriritually oriented people will far for it. They did. Later, after watching the Hobbit films, I had the idea to create a generator of texts in Blackspeech. It wasn’t assembling random words into a sentence templates, it was assembling words, syllable by syllable, and putting them into random length sentences. Grohhyg y otḧzär dazvybvrÿÿrv hẅög! I wasn’t very inspired by the Blackspeech from Lord of the Rings, I put there anything that sounded or looked harsh enough. Not a single pleasant sounding syllable.
I shared it with some people and I was given a suggestion to write a translator to it. I gave it some thought and figured something out.
First approach
Initially, I created a dynamic dictionary that generated a new word for unknown word inserted. I manually inserted some common words that should be short, like the, go, orc, destroy, conquer, kill, etc. It perfectly followed English grammar, but that wasn’t a problem, I wasn’t trying to pretend it was a real language. It worked, but it required a shared database accessible to all translator instances if the message created by one was to be translated back. Furthermore, it had no pretense of word morphology whatsoever.
Cipher approach
I came up with a more versatile approach. How about encoding all the information into the output text itself? An obvious idea was to replace every English letter with a syllable. The obvious problem was that the output would be way too long.
The long words could be somewhat mitigated by using Huffman coding. I wrote a program that searched through a book and calculated the number of occurrences of every substring in every word. Because I wrote it in C++, I could get away with the dumbest algorithm ever:
std::unordered_map<std::string, unsigned int> counts; for (unsigned int i = 0; i < source.size(); i++) { std::string word; for ( ; source[i] >= 'a' && source[i] <= 'z'; i++) { word.push_back(source[i]); // Assemble the word by letters for (int j = 1; (j <= FRAGMENT_MAX) && (j < word.size()); j++) { // Add substrings starting from different letters const std::string fragment = word.substr(word.size() - j); counts[fragment]++; } } }
Its complexity was pretty bad, but it processed a 657 kiB book in a split second.
The result matched by observations (showing only the first lines because it’s too long):
1: e 57612 2: o 32167 3: n 30015 4: t 27623 5: a 26427 6: h 25041 7: r 25002 8: i 23356 9: d 20332 10: s 19887 11: l 14955 12: u 12129 13: he 11310
There were thousands of substrings. Some triplets of letters were more common than some rare letters, which showed some promise that the coding would shorten the result considerably. Because replacing those would shorten the result much more than rare letters, I decided to adjust the weight of those by multiplying the substrings’ importance with their lengths.
1: e 2: o 3: n 4: t 5: a 6: h 7: r 8: i 9: he 10: d 11: s 12: nd 13: er
To avoid complicating it, I haven’t subtracted the quantities of multiletter substrings from the the number of shorter substrings they contained, because that might become an optimisation problem.
This greatly favoured substrings that were quite obviously parts of frequently used longer words, like omething
. I should have replaced them with words that would be common in an orc language, and thus short.
Because it’s impossible tell if a single letter word is written all in capitals or just capitalised, I have matched single letters that could be words on their own in English to single letters in the output. This led to a translation book like this:
1: e muz 2: o abg 3: n rup 4: t anr 5: a y 6: h zub 7: r hyd 8: i u 9: he zar 10: d urz 11: s daz 12: nd uth 13: er oth
The least common single letter, z
, was at index 524, so I made up 524 orc syllables. I had to help myself with some diacritics to make so many unique syllables. The ones on the bottom of the list thus looked rather weird:
466: rid brærgh 467: cti brærgḧ 468: stan zæhr 469: ond zæḧr 470: ser dhræm
The encoding and decoding algorithms were quite simple.
Encoding used pre-cached containers of all substrings that could be replaced for the first two letters. Then, the small group of substrings starting with those two letters were matched with the rest of the word and if there was a match, the translation of that part of the word was written out. This meant quadratic complexity after matching two letters, but there weren’t many long strings.
Decoding was going letter by letter and trying to match the part read so far with an available syllable and output a translated part if it found a match. The complexity was bad again, but it didn’t matter, the syllables were short. This was quite prone to breaking because it turned out some orc syllables were prefixes to other orc syllables. I manually fixed this.
I am not sharing the algorithm because I’ve learned some things since then and I am not particularly proud of that code.
Result
Zarmaazabg, otḧzär näzgwrüẗḧrozb! Zörgḧ zubbhak müzduvz gzÿv zambbzürbzurmuz näzgøth yuth ohbrgärrḧyghzamb y ruzẅeüthuthbrav arzdrymhdüẅzryÿrv gorzborzzögḧkuz ärrz rupvögdaz.
Totally sounds like a plan to pillage a settlement full of peaceful people, kill some people for fun and terrorise others just to feel powerful. The orcs from Lord of the Rings would definitely want to switch to this language because it makes their own sound like a lullaby. Now, what does this mean? Here’s the original:
Hello, nice people! We have come to promote peace and develop a friendly relationship between our nations.
The villagers would definitely fail to understand the message of peace. However, that would be a good thing because orcs would speak about peace only to study the defences without resistance before the arrival of their full invasion force.
The source code is here. Compiled code can be run in browser here, but the old version of Emscripten that Qt supports doesn’t seem to be very stable anymore (and doesn’t work at all in Safari).