What Language Is This? — About

I always get curious when I see a blog post in an unknown language. I mean not just in a language that I don't speak — a language that I can't even identify. When I started working on this language identifier during winter 2006, another reason was that I was interested in finding out how hard it would be to implement. I thought it might actually be prohibitively hard to identify a language with a high enough certainty without using googlishly huge databases and a powerful computer. But in the end it turned out to be quite doable. Just hundreds of hours of programming, and probably a lot of luck — my initial hunches on how to tune the algorithms proved to be pretty much on the spot.

The data used is gathered from the different language versions of Wikipedia. The criterion for choosing which languages to include is also based on how extensive the Wikipedia for that language is. It is hard to come by a good source of texts to analyze in order to gather the statistical data that is necessary to identify the language, and Wikipedia seems to be the best, reliable source. Unfortunately, that also means that since Wikipedia is written in a certain kind of language, the language identifier will also be biased toward that style of writing. For instance, slang or “chat” writing style might be difficult for it to identify correctly.

Another thing that is not supported is transliterated texts. That is, texts that are written in characters not usually used to write that language. For instance Asian languages written in latin characters such as Japanese romaji and Chinese pinyin won't be recognized as Japanese or Chinese, since that kind of statistical data is not included in the language idenfitier, for the simple reason that there is no good source of such texts to analyze, since it's a very rarely used method of writing those languages. The only language that is supported in different scripts is Serbian, which is commonly written in both the Cyrillic and Latin alphabets.

The text that you enter into is not transmitted over the Internet, and is not stored anywhere. The analyzing of your text takes place in your web browser, using JavaScript. Only the resulting language code is sent to the server, and only once per language identified, only for the purpose of showing the top 5 list on the home page. I hope you don't mind — I think it's interesting to see what languages people enter into the language identifier.

So please give it a try and see how it works. It's pretty fun to just copy-paste any piece of text your can find on the Internet into it, or just type something in a language you know yourself and see if it gets it right.

>> whatlanguageisthis.com