Data-driven foreign-language learning: by the numbers

This is the story of a foreign language data mashup, and how thinking about study-time as an asset with returns can make your language-learning more efficient---in theory.

I am not a linguist, a computational linguist, or a language teacher, but I travel internationally a fair amount and have had reason to half-study a few languages.  In the course of that, I've compared many different approaches and methods as an interested learner.  I've found that with simple audio tapes (like the FSI or Pimsleur) and 6 months of self-study, it's possible for a native English speaker to get to B1 or B2 conversational level in a European language---which isn't much: you can then order coffee and comment on newspaper headlines with some ease.  It takes diligence, but is doable.  However, beyond about that level, you begin to plateau.  At that point, you've learned the grammar, you've mastered the common words, you are confident that you can get around.  Before this, every single word, every single grammatical structure had comparatively large "returns" in the sense that each additional word or rudimentary grammar element came up all the time, and improved remarkably your ability to understand.

The problem is, to get to the next level, you now need to memorize a large quantity of infrequent words, each of which seems to make little contribution to your daily comprehension by itself. But while each individual word of this class has little chance of coming up in the course of conversation, the probability that at least one of them will come up in a given sentence is very high.  For example, when's the last time you used the word "plateau?"  You probably can't remember---but there are hundreds of words that come up with about the same frequency you're likely to run into.  Scanning the preceding paragraphs, you could see "asset, efficient, computational, internationally, diligence, doable (as opposed to possible), 'to master', comparatively, rudimentary(!), remarkably, plateau, comprehension (as opposed to understanding), frequency, scanning, preceding" in that category, and if you didn't know the whole bag, these paragraphs would be (even more) unintelligible. 

Unfortunately, while learning tables and paradigms, tricks and formulas, can get you up to this level, the only way you get beyond it is with a raw data dump---you need to memorize about 4,000-6,000 words.  You need to put them into your brain---and that requires ox-work.

For their bulk, grade, and intermediate location, I call these words "the middling words."  By the time I've studied them for a month, I start to call them "the devil's words."  By the time I've studied them for two months, I conclude by calling them "the words I am tired of learning."

While it remains the case that the only way to progress is to sit down and memorize them, it helps to have a smart process, and some processes are likely better than others, either for keeping you motivated, or by prioritizing your time and level of incremental improvement.  One organization method might be approaching the words topically (say, learn all the words related to farming, and then all the words related to dentistry), or choosing words that are in a book you would like to read.  But an attractive approach to me is engaging words based on overall frequency, and that is the case for two reasons:

  1. You are still prioritizing the most important words.  As you get farther down the list, the words become less "important" in your comprehension.  So going the same "distance" down this list gets you "farther" in overall comprehension than other methods.  In fact, in that sense, this method is optimal.
  2. It helps you keep score, which can keep up morale.  If you start with a comprehensive frequency list, you can always tally how far you are, and how far you have left to go, and know that the time you have left to spend is less important than the time investment you've already made.

I haven't given enough thought to hashing it out yet, but if the "returns" to each subsequent word are diminishing, you could fit a model to describe your language status as a single, importance-weighted number.  "I'm on word 7,217 so I'm an 83.2%" or something---if that kind of thing helps inspire or motivate you.  Whatever the math, working by frequency is a way of vocabulary-learning that optimizes for returns per unit of study time.

And now you're back to fox-work.  (Well, kind of: there's no way around the fact that you still have to memorize an outrageous amount in grunt-fashion, but at least you'll feel as clever as possible while you do it.)

So, now to the data mashup part.

How do we obtain practical frequency lists that include translations of your target language into your native language?  There are some lists you can buy, (I arrived at this idea partially by working through a list of Homeric vocabulary in undergraduate Greek), and to get to mastery you should definitely use something like these.  But I found it more difficult to obtain similar lists for other languages, and was looking for something free and public domain---there is an impressive publication industry in foreign language methods, populated by entrenched players with understandably little incentive to produce something free.

I confess the method I am about to propose as well as the list it generates are of very middle, little, middling quality.  But they still have a value as a foundation and reference.  The idea is that a word-frequency list in a language is not all-too-hard to obtain, wiktionary has several, but that the difficult part lies in linking it to a translation to facilitate the data-dump into your brain. We'll do this by using machine translation to produce a best-guess translation-ese.  Easy, cheap, and quick.

But first,  I can think of two historic failures of this approach that we need to be mindful of:

  1. English as She is Spoke -- English as She is Spoke is a comical phrasebook written in the 19th century by Pedro Carolino.  Carolino wanted to create a Portguese-to-English reference.  Unfortunately, he didn't speak any English.  What he had available was a Portuguese-to-French phrasebook, and a French-to-English dictionary: hilarious idiomatic mistranslations followed to the extent that Mark Twain said of it "it is perfect, it must and will stand alone: its immortality is secure." It has recently been resurrected in print by McSweeney's.  It's off copyright, so you can find a full, free version here.
  2. Lenin's German -- Vladimir Lenin, seeking to further his study of 19th century German philosophy, and having 14 months of prison time to spend, undertook studying the German language in isolation from traditional methods and tutors.  He wrote excitedly to friends that with only a dictionary, he would "break the spine of the language" by taking the list of words he used most in Russian, and finding their corresponding German translations (going in the "wrong" direction)---and threw in a good amount of the opaque and already antiquated vocabulary of German Idealism to boot (ie, have you ever used the Hegelian word "sublation"---a concept describing how a thing negated is preserved and uplifted---in a sentence before? For your sake, I hope not.).  Needless to say, after his release from prison and subsequent visit to Germany, German speakers found him amusingly unintelligible.  (He later improved.)

So, if you're forewarned of these risks and willing to proceed, you can get a quick cheat sheet by aggregating some of the language data freely available into comprehensive word lists, organized by frequency, and providing accompanying automatic English translations.  Here is the recipe for making such a list, in a quick and dirty way you can try yourself without violating your online translator's Terms of Service:

  1. Get frequency list from Wiktionary and find your favorite language.
  2. Reformat frequency list into a more usable shape.  The one I chose was in plain text, and had to be exported into excel.
  3. Export the list as csv because that's a format online translation tools prefer.
  4. Upload the document.csv into your favorite online translation website and translate into English.  It will output a column of English words.
  5. Copy this list back into your original table.  You now have a keyed list, referencing foreign words sorted by frequency to their English translations.
  6. Study hard, but observe that many of the translations need some work, or are downright incorrect.

When I applied the method above to Hungarian, I was scanning through my word list and came across the 78th most common Hungarian word: "lett."  This was machine-translated as "Latvian."  Now, this is the correct way to translate "Latvian" into Hungarian, but it's not owing to frequent discussions of the city politics in Riga or Baltic Sprat exports that this is word earned the lofty place #78---it's instead because the primary meaning of "lett" in Hungarian is "it became" or "it had been,"---a much more common definition.

So, be wary of this but let's also consider: how did the online translation get this so wrong?  It's likely by design!  That is---you're probably less likely to look up the primary definitions of common words than their uncommon definitions---or else you wouldn't need to look it up.  In English, you've probably never looked up "AM" as in "I AM!"---but you might have looked up "AM" because you didn't know how to spell "Ante Meridiem" or couldn't remember "Amplitude Modulation" Radio.  The same will likely be true for the common words of any language.

My advice then is that when you do the procedure above, throw out the the first 1000 words!  If you put in your 6 months as I described above, you know them already, and you will find the rest of the list very helpful in optimizing your study time.

Even better, there's certainly a way to improve the above for those more ambitious or more technically inclined:

  1. Get frequency list from Wiktionary
  2. Add each foreign word as a row in a database, in a table corresponding to that language.
  3. Loop through each word, and feed it into your favorite online translator.  I'd recommend feeding it to two or three, and having them aggregate each meaning.  You'll need to be clever in the algorithm, but you want to look for a means of collecting agreement, something like having them vote against each other, and earn points for being right.
  4. Record the multiple English translations of each word in the fields.

Now, this still won't be any better than the collection of online databases you're using, but by comparing them against each other, you might get better translations.  By collecting multiple definitions, you might get better insights. And by putting it in a database, you gain the ability to sort, search, and compare reversed frequencies, which can be very helpful when "breaking the spine of the language" in your own way.

Right now, the engineering of the list is at "duct tape" level, you can do it yourself, and it springs a lot of leaks.  Because this data is generally privately owned, this may not yet be possible for anything other than private use. But if online translators' Terms of Service allow, I'd like to improve this project up to the "moon-landing" level.  If you're a data professional or translator and would like to work together on this, please contact me.  I do believe that compiling a publicly owned, efficient frequency list is a great study aid and a beneficial pro-bono activity, and that if we present a very novel algorithm for comparing word definitions, we might even be able to use the data legally under TOS as a creative repurposing.

Until then, well---I'll be ordering coffee and diligently commenting on newspaper headlines "as they are spoke."

Blog tags:


Very interesting an humorous essay. Have you seen the app for smart phones that allow you to speak into your phone and it plays back in another language?