OTHER ML JARGONS
Other ML Jargons: Sparse and Dense Representations of Texts for Machine Learning
A Brief Introduction to Vectorization and its Importance in the Context of NLP
Introduction
Matrices and vectors are quantified information that Machine Learning(ML) algorithms require for learning patterns and making predictions. For applying these techniques to textual data as well, numeric representations of the texts are engineered to form matrices that hold the relevant information from those texts. The concepts of “Sparsity” and “Density” arrive at efficiently designing and constructing these matrices for all high-dimensional data processing use-cases in the world of Artificial Intelligence.
Significance of Vector Representations for NLP
Representing text data as vectors are necessary for applying Machine Learning techniques to make predictions, recommendations, or clusters. In NLP, the concept of “similar words occur in similar contexts” is fundamental. Let’s see how:
- In Text Classification use-cases like categorizing support tickets, spam detection, fake news detection, and feedback sentiment analysis, texts having similar words are classified into a particular category.
- In Recommendation Systems, people with similar profile details, browsing history, and past orders indicate similar choices or tastes in products. This information is used to make recommendations.
- Unsupervised Clustering looks for patterns and similar words in the texts to group documents and articles. Typical applications include segregating news articles, trend analysis, and customer segmentation.
- In Information Retrieval systems, indexed documents are matched with queries, sometimes in a “fuzzy” way and the collection of matched documents is returned to the user. Besides, the measure of similarity is used to rank the search.
Hence, capturing similarities in these vectors is a primary research area in the NLP domain. These vectors are projected in an N-dimensional plane and then patterns in these vectors in the N-dimensional space are extracted to categorize the texts. Sometimes, dimensionality reduction techniques are applied, like PCA or t-SNE. The design of the vectors controls the overall performance of text-based ML models and is, hence, crucial.
The vector designs are broadly classified as “Sparse” (meaning scarcely populated) and “Dense” (meaning densely populated) vectors. In this article, I have recalled the concepts of matrices and vectors from a mathematical perspective, and then discussed these two classes of vectorization techniques — sparse vector representations and dense vector representations. including a demo using Scikit Learn and Gensim, respectively. I have also concluded this article with an overview of the applications and usability of these representations.
A Primer to Matrices and Vectors
Mathematically, a matrix is defined as a 2-dimensional rectangular array of numbers. If the array has m rows and n columns, then it is a matrix of size m × n.
If a matrix has only one row OR only one column it is called a vector. A 1×n matrix or vector is a row vector (where there are n columns but only 1 row) and an m × 1 matrix or vector is a column vector (where there are m rows but only 1 column). Here’s an image that clearly demonstrates this:

Here’s a primer for scalars, vectors, and matrices and an Introduction to Vectors and Matrices using Python for Data Science.
Sparse Representations | Matrices | Vectors
In almost all real-world cases, the count-based quantified numeric representation of information is sparse in nature, in other words, the numeric representation contains only a fraction that is useful to you in an ocean of numbers.
It is because, intuitively, in a collection of documents, only words that are articles, prepositions, conjunctions, and pronouns are overtly used and therefore, have a higher frequency of occurrence. However, in a collection of sports news articles, the terms ‘soccer’ or ‘basketball’, occurrences of which would help us determine which sport is the article associated with, occurs only a few times but is not of a very high frequency.
Now, if we construct a vector per new article, assuming there are 50 words per article, the word ‘soccer’ would occur about 5 times. Hence, 45 out of 50 times, the elements in the vector will be zero, which indicates the absence of the word we are focusing on. Therefore, 90% of the vector of length 50 is redundant. This is an example of a one-hot vector generation.
Another typical example of sparse matrix generation is the Count Vectorizer which determines how many times a word has occurred in a document. It generates a matrix of “count vectors” per document to constitute a matrix of size d × v where d is the number of documents and v is the number of words or vocabulary in the collection of documents.
Here’s a demonstration of how a Count Vectorizer works:
Below are four different meanings of the word ‘demo’, each of which represent one document ~
Document 1: a demonstration of a product or technique
Document 2: a public meeting or march protesting against something or expressing views on a political issue
Document 3: record a song or piece of music to demonstrate the capabilities of a musical group or performer or as preparation for a full recording
Document 4: demonstrate the capabilities of software or another product
I used Scikit Learn’s CountVectorizer implementation to generate this sparse matrix for these four “documents”. Below is the code I have used 👩💻







