Summary

This context provides a tutorial on extracting keywords from text using spaCy in Python, which can be applied to generate hashtags or analyze sentence importance.

Abstract

The web content serves as a guide for leveraging spaCy, a robust natural language processing (NLP) library, to identify and extract key phrases from textual data. It begins with instructions for setting up spaCy, including the installation of the library and the downloading of an appropriate language model. The tutorial then transitions into the implementation phase, demonstrating how to write a Python function that tokenizes text, filters out stopwords and punctuation, and selects relevant keywords based on part-of-speech tags. The function's output can be used to generate hashtags or sorted by frequency for further analysis. The article concludes by summarizing the steps taken and encouraging readers to explore the functionalities of spaCy further.

Opinions

The author recommends using spaCy for its industrial strength and effectiveness in natural language processing tasks.
It is suggested that readers use a virtual environment when installing spaCy to avoid potential conflicts with other Python packages.
The author emphasizes the importance of choosing the right size for the language model based on the user's needs, offering options ranging from small to large models.
The tutorial advocates for the practical use of the extracted keywords, such as generating hashtags, which can be particularly useful for social media content creation.
The author provides a personal touch by sharing their satisfaction with the results obtained from the keyword extraction process.
There is an endorsement for the Counter module's most_common function to sort keywords by frequency, highlighting its utility in NLP applications.
The article concludes with a call to action, inviting readers to continue learning about spaCy and to try out the AI service recommended by the author for its cost-effectiveness compared to other AI services.

Extract Keywords Using spaCy in Python

Find the top keywords from an article and generate hashtags

In this piece, you’ll learn how to extract the most important keywords from a chunk of text — an article, academic paper, or even a short tweet. You can freely use it to generate hashtags, calculate the importance of the sentence and so on.

I will be using an industrial strength natural language processing module called spaCy for this tutorial. I have made a tutorial on similarity matching using spaCy previously — feel free to check it out. There are three sections in this tutorial:

Setup
Implementation
Conclusion

1. Setup

We will be installing the spaCy module via the pip install. Administrative privilege is required to create a symlink when you download the language model. Open a terminal in administrator mode. It’s highly recommended to create a virtual environment before you run the following command:

pip install -U spacy

The next step is to download the language model of your choice. I will be using the large English model for this tutorial. Feel free to check the official website for the complete list of available models.

en_core_web_lg (large)

python -m spacy download en_core_web_lg

The file size of the model is about 800MB. If you would like to just try it out, download the smaller version of the language model.

en_core_web_md (medium)

The medium model is much smaller at just 100MB.

python -m spacy download en_core_web_md

en_core_web_sm (small)

The smallest English language model should take only a moment to download as it’s around 11MB.

python -m spacy download en_core_web_sm

When you’re done, run the following command to check whether spaCy is working properly. It also indicates the models that have been installed.

python -m spacy validate

Let’s move to the next section and start writing some code in Python.

2. Implementation

Import

First, we need to add an import declaration to the top of the file.

import spacy

Apart from spaCy, we need the following import as well. Counter will be used to count and sort the keywords based on the frequency while punctuation contains the most commonly used punctuation.

from collections import Counter
from string import punctuation

Load spaCy model

We can easily load the model that we have just installed via the following command. Modify the string according to the name of the model you’ve installed.

nlp = spacy.load("en_core_web_lg")

If you experience issues with not being able to load the model, even though it’s installed, you can load the model in a different way. Let’s import the module directly and you can use it to load the model.

import en_core_web_lg

nlp = en_core_web_lg.load()

Hotword function

We’ll be writing the keyword extraction code inside a function. It’s a lot more convenient and we can easily call it whenever we need to extract keywords from a big chunk of text. It accepts a string as an input parameter.

#1 A list containing the part of speech tag that we would like to extract. I will be using just PROPN (proper noun), ADJ (adjective) and NOUN (noun) for this tutorial. If you would like to extract another part of speech tag such as a verb, extend the list based on your requirements.

#2 Convert the input text into lowercase and tokenize it via the spacy model that we have loaded earlier. A processed Doc object will be returned. The object contains Token objects based on the tokenization process.

#3 Loop over each of the token and determine if the tokenized text is part of the stopwords or punctuation. Ignore this token and move on to the next token if it is.

#4 Store the result if part of speech tag of the tokenized text is the one that we have specified previously.

#5 Return the result as a list of strings.

Let’s test it out by using a simple text of your choice. I’m using the following input text:

output = get_hotwords('''Welcome to Medium! Medium is a publishing platform where people can read important, insightful stories on the topics that matter most to them and share ideas with the world.''')

print(output)

I obtained the following result after running the function.

['welcome', 'medium', 'medium', 'publishing', 'platform', 'people', 'important', 'insightful', 'stories', 'topics', 'ideas', 'world']

Remove duplicate items

Note that the function we’ve just written contains duplicate items if it contains the same important keywords inside the input text. In this case, the keyword medium is repeated twice. You can easily remove it via the set function:

output = set(get_hotwords('''Welcome to Medium! Medium is a publishing platform where people can read important, insightful stories on the topics that matter most to them and share ideas with the world.'''))

print(output)

You should be able to get the following output:

{'medium', 'ideas', 'publishing', 'important', 'stories', 'people', 'insightful', 'platform', 'world', 'topics', 'welcome'}

Generate hashtags from keywords

You can easily generate hashtags from keywords by appending the hash symbol at the start of every keyword. The easiest way to do this is to use the list comprehension method. You need to join the resulting list with a space to generate a hashtag string:

output = set(get_hotwords('''Welcome to Medium! Medium is a publishing platform where people can read important, insightful stories on the topics that matter most to them and share ideas with the world.'''))

hashtags = [('#' + x) for x in output]
print(' '.join(hashtags))

The following result will be shown when you run it:

#medium #ideas #publishing #important #stories #people #insightful #platform #world #topics #welcome

Sort by frequency

There may be cases in which the order of the keywords is based on frequency. in that case, you need to sort them based on how frequently the keywords appear — use the Counter module to sort and get the most frequent keywords. TheCounter module has a most_common function that accepts an integer as an input parameter. Remember, you must remove the set function to retain the frequency of each keyword.

output = get_hotwords('''Welcome to Medium! Medium is a publishing platform where people can read important, insightful stories on the topics that matter most to them and share ideas with the world.''')

hashtags = [('#' + x[0]) for x in Counter(output).most_common(5)]

print(' '.join(hashtags))

In these cases, the top five most common hashtags are as follow:

#medium #welcome #publishing #platform #people

3. Conclusion

Let’s recap what we’ve learned today. We started off installing the spaCy module via pip install. Then we downloaded a pre-trained language model. In this case, I downloaded the large version of the English model.

Next, we wrote some simple codes to implement our own keyword extractor. We defined our own hotword function that accepts an input string and outputs a list of keywords. We used the Python built-in set function to remove duplicates from the result. List comprehension is extremely helpful in appending the hash symbol at the front of each keyword to create a hashtags string. Finally, we explored the most_common function in the Counter module to sort the keywords based on frequency.

Thanks for reading and I hope to see you in the next piece!