Language Analysis of a Medium Story
How can you perform language analysis on a Medium story — and why should you care?
Not a member yet? — Read here for free!
Have you ever wondered why your stories perform the way they do?
Let’s create a tool that analyzes our Medium stories!

What metrics can we derive from just the text of a story?
From simple to advanced metrics:
- Word Frequency: What are the most common words in a text?
- Text structure: What is the sentence length and word length in a post?
- Readability: How readable is the text?
- Sentiment: In what tone is the story written? Is it positive, negative, or somewhere in between?
- Keywords: What are the keywords that summarize the story?
- Topics: Can the story be categorized?
Let’s get started!
Open the Text
To begin, we open the text:
with open("stories/text.txt", "r") as file:
text = file.read()Word Frequency
We calculate the word frequency using a regex that extracts all the words from the post:
from collections import Counter
import re
words = re.findall(r'\b\w+\b', text.lower())
word_counts = Counter(words)
most_common_words = word_counts.most_common(10)
print(most_common_words)
# [('i', 11), ('this', 8), ('my', 7), ('to', 7), ('it', 7), ('a', 6), ('is', 6), ('external', 4), ('traffic', 4), ('and', 4)]Text Structure
We use regex to extract the paragraphs and the sentences from the text, which we then use to calculate the average length:
import re
paragraphs = re.split(r'\n\s*\n', text)
print(paragraphs)
num_paragraphs = len(paragraphs)
paragraph_lengths = [len(paragraph.split()) for paragraph in paragraphs]
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
sentence_lengths = [len(sentence.split()) for sentence in sentences]
avg_paragraph_length = sum(paragraph_lengths) / num_paragraphs
avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths)
print(avg_paragraph_length, avg_sentence_length)
# 11.8 14.75Readability
There are many measures of readability. Let’s consider:
- Grade Level: The grade level in the U.S. education system required to understand the text.
- Reading Ease: How easy is the text to read? As a score between 0–100.
import textstat
reading_ease = textstat.flesch_reading_ease(text)
grade_level = textstat.flesch_kincaid_grade(text)
print(grade_level)
print(reading_ease)
# 5.0
# 77.74Sentiment
Let’s calculate the sentiment of the story:
from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)
# {'neg': 0.119, 'neu': 0.721, 'pos': 0.16, 'compound': 0.936}Keywords
Now, we extract the keywords:
from transformers import T5Tokenizer, T5ForConditionalGeneration
keyword_model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
keyword_tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")
input_sequences = ["Keywords: " + text]
input_ids = keyword_tokenizer(input_sequences, return_tensors="pt", truncation=False).input_ids
output = keyword_model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)
predicted_keywords = keyword_tokenizer.decode(output[0], skip_special_tokens=True)
print(predicted_keywords)
# external traffic, read-to-view ratioTopics
Moving onto the topics:
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from scipy.special import expit
topic_tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/tweet-topic-21-multi")
topic_model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/tweet-topic-21-multi")
class_mapping = topic_model.config.id2label
all_scores = []
all_topics = []
# Split the text into chunks, since we can only process a maximum of 514 characters
input_length = 100
chunks = [text[i:i+input_length] for i in range(0, len(text), input_length)]
# Process each chunk of text
for chunk in chunks:
tokens = topic_tokenizer(chunk, return_tensors="pt")
output = topic_model(**tokens)
scores = output.logits
scores = expit(scores.detach().numpy())
for score in scores:
scores_with_topics = []
for i, topic_score in enumerate(score):
topic = class_mapping[i]
scores_with_topics.append((topic, topic_score))
all_scores.append(scores_with_topics)
# Compute average scores
average_scores = [0] * len(class_mapping)
for scores_with_topics in all_scores:
for i, (_, score) in enumerate(scores_with_topics):
average_scores[i] += score
# Normalize average scores
average_scores = [score / len(all_scores) for score in average_scores]
# Sort topics based on scores
sorted_topics = sorted(zip(class_mapping.values(), average_scores), key=lambda x: x[1], reverse=True)
# Print sorted topics
for topic, score in sorted_topics:
print(f"{topic.replace('_', ' ')}: {score:.4f}")
# diaries & daily life: 0.4806
# news & social concern: 0.2519
# other hobbies: 0.1693
# ...That’s Everything!
Well done! Now, you’re able to perform a text analysis using Python!
Please let me know if there are any metrics that I’ve missed!
And If you enjoyed this tutorial:
- 👏 Give the story 50 claps!
- ✅ Follow Oliver Lövström
- ☕ Support me by buying me a coffee
Day 26 out of 30: Want to check out how I’ve used this tool?
Links
- Regex Tutorial: A Simple Cheatsheet by Examples
- ChatGPT for regexes
- Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer — Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, ACIIDS 2022
- Tweet Topic 21 Multi — Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. Twitter Topic Classification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386–3400, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Thank you again for reaching all the way to the end! 🙏
