The State of NLP Literature: Part II
Areas of Research (Examining Terms in Paper Titles)
This series of posts presents a diachronic analysis of the ACL Anthology — Or, as I like to think of it, making sense of NLP Literature through pictures.

(Thanks for your interest in this work. Here is Part I (Size and Demographics) in case you have not already seen it.)
Natural Language Processing addresses a wide range of research questions and tasks pertaining to language and computing. It encompasses many areas of research that have seen an ebb and flow of interest over the years. In this post, we examine the terms that have been used in the titles of ACL Anthology (AA) papers. The terms in a title are particularly informative because they are used to clearly and precisely convey what the paper is about. Some journals ask authors to separately include keywords in the paper or in the meta-information, but AA papers are largely devoid of this information. Thus titles are an especially useful source of keywords for papers — keywords that are often indicative of the area of research. Keywords could also be extracted from abstracts and papers; we leave that for future work.
Further work is also planned on inferring areas of research using word embeddings, techniques from topic modelling, and clustering. There are clear benefits to performing analyses using that information. However, those approaches can be sensitive to the parameters used. In this post, we keep things simple and explore counts of terms in paper titles. Thus the results are easily reproducible and verifiable.
Papers (most pertinent to this post):
- NLP Scholar: A Dataset for Examining the State of NLP Research. Saif M. Mohammad. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC-2020). May 2020. Marseille, France.
- The State of NLP Literature: A Diachronic Analysis of the ACL Anthology. Saif M. Mohammad. arXiv preprint arXiv:1911.03562. November 2019.
See full list of associated papers in the About Page.
Title Terms
The title has a privileged position in a paper. It serves many functions, and here are three key ones (from an article by Sneha Kulkarni):
“A good research paper title:
1. Condenses the paper’s content in a few words
2. Captures the readers’ attention
3. Differentiates the paper from other papers of the same subject area”
If we examine the titles of papers in the ACL Anthology, we would expect that because of Function 1 many of the most common terms will be associated with the dominant areas of research. Function 2 (or attempting to have a catchy title) on the other hand, arguably leads to more unique and less frequent title terms. Function 3 seems crucial to the effectiveness of a title; and while at first glance it may seem like this will lead to unique title terms, often one needs to establish a connection with something familiar in order to convey how the work being presented is new or different.
It is also worth noting that a catchy term today, will likely not be catchy tomorrow. Similarly, a distinctive term today, may not be distinctive tomorrow. For example, early papers used neural in the title to distinguish themselves from non-nerual approaches, but these days neural is not particularly discriminative as far as NLP papers go.
Thus, competing and complex interactions are involved in the making of titles. Nonetheless, an arguable hypothesis is that:
broad trends in interest towards an area of research will be reflected, to some degree, in the frequencies of title terms associated with that area over time.
However, even if one does not believe in that hypothesis, it is worth examining the terms in the titles of tens of thousands of papers in the ACL Anthology — spread across many decades.
Q1. What terms are used most commonly in the titles of the AA papers? How has that changed with time?
A. Below are the most common unigrams (single word) and bigrams (two-word sequences) in the titles of papers published from 1980 to 2019. (Ignoring function words.) The timeline graph at the bottom shows the percentage of occurrences of the unigrams over the years (the colors of the unigrams in the Timeline match those in the Title Unigram list).


Discussion: Appropriately enough, the most common term in the titles of NLP papers is language. Presence of high-ranking terms pertaining to machine translation suggest that it is the area of research that has received considerable attention.
Other areas associated with the high-frequency title terms include lexical semantics, named entity recognition, question answering, word sense disambiguation, and sentiment analysis. In fact, the common bigrams in the titles often correspond to names of NLP research areas. Some of the bigrams like shared task and large scale are not areas of research, but rather mechanisms or trends of research that apply broadly to many areas of research. The unigrams, also provide additional insights, such as the interest of the community in Chinese language, and in areas such as speech and parsing.
The Timeline graph is crowded in this view, but clicking on a term from the unigram list will filter out all other lines from the timeline. This is especially useful for determining whether the popularity of a term is growing or declining. (One can already see from above that neural has broken away from the pack in recent years.) Since there are many lines in the Timeline graph, Tableau labels only some (you can see neural and machine). However, hovering over a line, in the eventual interactive visualization, will display the corresponding term — as shown in see screenshot below:

Despite being busy, the graph sheds light on the relative dominance of the most frequent terms and how that has changed with time. The vocabulary of title words is smaller when considering papers from the 1980’s than in recent years. (As would be expected since the number of papers then was also relatively fewer.) Further, dominant terms such as language and translation accounted for a higher percentage than in recent years where there is a much larger diversity of topics and the dominant research areas are not as dominant as they once were.
A blow-up of the timeline (2000–2019 portion):

Q2. What are the most frequent unigrams and bigrams in the titles of recent papers?
A. Below are the most frequent unigrams and bigrams in the titles of papers published 2016 Jan to 2019 June (time of data collection):

Discussion: Some of the terms that have made notable gains in the top 20 unigrams and bigrams lists in recent years include: neural machine (presumably largely due to the phrase neural machine translation), neural network(s), word embeddings, recurrent neural, deep learning and the corresponding unigrams (neural, networks, etc.). We also see gains for terms related to shared tasks such as SemEval and task.
See the keynote slides embedded below for lists from various time spans. (Click on the navigation button on the center right of the image to change time span, or better yet, first click on the icon at the bottom right of the image to go full screen. Use right and left arrow keys to navigate.)









