An efficient language detection model using Naive Bayes
A simple language detection model in Python

Language detection (or identification) is a fascinating branch of Natural Language Processing. Its goal is to create a model that is able to detect the language a text is written in.
Data Scientists usually employ neural network models to accomplish such a goal. In this article, I show how to create a simple language detection model in Python using a Naive Bayes model.
The problem
The problem we are going to face is to create a model that, once fed with a text, is able to detect its language. The text can be a sentence, a word, a more complicated text, and so on. The output variable can be, for example, the language code (like “en” for English).
A good idea would be to have a model that detects the language of a text even if this text contains words that the model has not seen in the training phase. We want a model that can generalize the underlying structure of a language in a way that makes it detect it properly.
Let’s see how to create such a model.
The data
I’m going to train this model to detect three languages: Italian, English and German. For the Italian language, I’m going to train the model using the text of a short horror tale of mine, L’isola del male. For the English language, I’ll use the English translation of this tale, The island of evil. For the German language, I’ll use the text of Also sprach Zarathustra of Friedrich Nietzsche found here: http://www.nietzschesource.org/#eKGWB/Za-I. I’ll split these documents into sentences and the final dataset of sentences will be split into training and test sets.
Vectorization using char bigrams
So, we are talking about a classification problem with a 3-valued class, that are: it, en, de. This is the target variable.
Now, let’s talk about the features. The features I’m going to use are char bigrams, that are sets of 2 consecutive characters in a sentence. Consider, for example, the word “planet”. In this case, the char bigrams are “pl”, “la”, “an”, “ne”, “et”.
Why should we use char bigrams? Because it will reduce dimensionality. The Latin alphabet is made of 26 letters. Adding the 10 digits and some other particular character or symbol, we reach roughly 50 distinct symbols. The highest number of char bigrams we can have is then 50*50 = 2500.
If we used the classical word-driven vectorization, we’ll have a higher dimensionality, because each language may have hundreds of thousands of words and we may need to vectorize them all, creating a vector of thousands of features and suffering the curse of dimensionality. Using the char bigrams we get a maximum of 2500 components and they are shared among all the languages using the Latin alphabet. It’s pretty impressive because we can use this corpus as a general feature set for every language using such an alphabet. Moreover, we are not bound with a pre-defined corpus of words, so our model will work even with words it has never been trained on, as long as they are made by the char bigrams of the original training corpus.
In our particular case, the complete word vocabulary of the three documents is made by 7062 distinct words. If we work with chars char bigrams, we get 842 features. We have reduced the dimensionality of our problem almost by 90%!
So, these are the features. For each sentence, we calculate the char bigrams and count how many times a particular bigram appears in the sentence. This number will fill the corresponding vector component representing the sentence.
The model
The model I’m going to use is a Multinomial Naive Bayes. It’s a very simple model and Naive Bayes is, generally speaking, very powerful when it comes to talking about Natural Language Processing. It almost has no hyperparameters, so we can focus on the pre-processing phase, which is the most critical.
According to scikit-learn docs, Multinomial Naive Bayes can take count vectors as input features, which is exactly what we need.
The code
Let’s import some libraries:









