Applications of NLP
Using keyword extraction for unsupervised text classification in NLP
A hybrid approach to an unsupervised classification task
Text classification is a common task in Natural Language Processing. The main approach tends toward representing the text in a meaningful way — whether through TF-IDF, Word2Vec, or more advanced models like BERT — and training models on the representations as labelled inputs. Sometimes, however, either labelling the data is impractical or there is just not enough labelled data to build an effective multi classification model. Instead, we are forced to leverage unsupervised methods of learning in order to accomplish the classification task.
In this article, I’ll be outlining the process I took to build an unsupervised text classifier for the dataset of interview questions at Interview Query, a data science interview/career prep website. This would be greatly beneficial to them for several reasons. Interview Query wants to be able to offer more insightful information for users about the companies that they are applying to, as well as the functionality to practice only certain question types. Most importantly, it would enable them to “characterize” different companies by the types of questions that they ask.
Our task is to classify a given interview question as either relating to machine learning, statistics, probability, Python, product management, SQL, A/B testing, algorithms, or take-home. I decided the most practical approach would be to first extract as many relevant keywords as possible from the corpus, and then manually assign the resulting keywords into “bins” corresponding to our desired classifications. Finally, I’d iterate through each interview question in the dataset and compared the total counts of keywords in each bin in order to classify them. The possibility of using Latent Dirichlet Allocation was also considered in order to generate topic models and retrieve relevant keywords relating to each topic without having to manually assign them, as well as K-means clustering. These proved to be difficult and less effective than simply counting keywords, given the wide and disparate range of our classifications.
First the data had to be cleaned and preprocessed. I used SpaCy to tokenize, lemmatize, lowercase, and remove stop-words from the text.
import pandas as pd
import nltk
import spacy
from tqdm import tqd
nlp = spacy.load("en_core_web_sm")
def create_tokens(dataframe):
tokens = []
for doc in tqdm(nlp.pipe(dataframe.astype('unicode').values), total=dataframe.size):
if doc.is_parsed:
tokens.append([n.lemma_.lower() for n in doc if (not n.is_punct and not n.is_space and not n.is_stop)])
else:
tokens.append("")
return tokens
raw = pd.read_csv("topics_raw.csv")
tokens = create_tokens(raw)
After this came the problem of choosing a way to extract keywords from the corpus of text. Since my corpus was comprised of a massive number of small “documents,” each one a different interview question, I decided to extract the keywords from each document separately rather than combining any of the data, and sorting unique keywords from the resulting list by frequency.
Then, testing began. Various methods, such as TF-IDF, RAKE, as well as some more recent, state-of-the-art methods such as SGRank, YAKE, and TextRank, were considered. I was also curious enough to try Amazon Comprehend, an auto-ML solution, to see how competent it was. Unfortunately, the results were unsatisfactory as the combination of high level abstraction with the granularity of the NLP task proved still yet impractical. In the end, after comparing the keywords produced by each method, I found that SGRank produced the best results (the highest quantity of relevant keywords).
import textacy
import textacy.ke
text = " ".join(raw.tolist())
nlp = spacy.load('en_core_web_sm')
nlp.max_length = len(text)
keywords = []
for tokenlist in tqdm(question_tokens):
doc = nlp(" ".join(tokenlist))
extract = textacy.ke.sgrank(doc, ngrams=(1), window_size=2, normalize=None, topn = 2, include_pos=['NOUN', 'PROPN'])
for a, b in extract:
keywords.append(a)
Finally, I sorted unique keywords by frequency in order to get the most salient ones.
res = sorted(set(keywords), key = lambda x: keywords.count(x), reverse=True)
The result was around 1900 words, which I then manually went through and assigned the top 200 most relevant ones to our bins.
Finally, with the final list of categorized keywords, it is possible to classify each interview question as one of 8 different types by counting the appearance of keywords in each question. Furthermore, we can generate “personality” profiles for different companies which are displayed on the website.
In conclusion, I found that for this specific problem it was best to simply opt for a hybrid approach towards the unsupervised classification task that involved both machine learning as well as manual work.
Generally, working without labels in unsupervised contexts within Natural Language Processing leaves quite some distance between the analysis of data and the actual practical application of results— forcing alternate approaches like the one seen in this article. This is, in my opinion, a distinct deficiency of NLP which is more severe than in other fields such as computer vision or generative models. Of course, I anticipate future advancements in more insightful models and other research will make marked improvements in this regard.
Anyway, thanks for reading this article! I hope you learned something.