avatarOliver Lövström

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4359

Abstract

/li><li><b>Reading Ease: </b><i>How easy is the text to read? As a score between 0–100.</i></li></ul><div id="2d08"><pre><span class="hljs-keyword">import</span> textstat

reading_ease = textstat.flesch_reading_ease(text) grade_level = textstat.flesch_kincaid_grade(text) <span class="hljs-built_in">print</span>(grade_level) <span class="hljs-built_in">print</span>(reading_ease)

<span class="hljs-comment"># 5.0</span> <span class="hljs-comment"># 77.74</span></pre></div><h2 id="c2e5">Sentiment</h2><p id="e7d5">Let’s calculate the sentiment of the story:</p><div id="4791"><pre><span class="hljs-keyword">from</span> nltk.sentiment <span class="hljs-keyword">import</span> SentimentIntensityAnalyzer sia = SentimentIntensityAnalyzer()

sentiment_scores = sia.polarity_scores(text) <span class="hljs-built_in">print</span>(sentiment_scores)

<span class="hljs-comment"># {'neg': 0.119, 'neu': 0.721, 'pos': 0.16, 'compound': 0.936}</span></pre></div><h2 id="525d">Keywords</h2><p id="ff4f">Now, we extract the keywords:</p><div id="07ff"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> T5Tokenizer, T5ForConditionalGeneration

keyword_model = T5ForConditionalGeneration.from_pretrained(<span class="hljs-string">"Voicelab/vlt5-base-keywords"</span>) keyword_tokenizer = T5Tokenizer.from_pretrained(<span class="hljs-string">"Voicelab/vlt5-base-keywords"</span>)

input_sequences = [<span class="hljs-string">"Keywords: "</span> + text]

input_ids = keyword_tokenizer(input_sequences, return_tensors=<span class="hljs-string">"pt"</span>, truncation=<span class="hljs-literal">False</span>).input_ids output = keyword_model.generate(input_ids, no_repeat_ngram_size=<span class="hljs-number">3</span>, num_beams=<span class="hljs-number">4</span>)

predicted_keywords = keyword_tokenizer.decode(output[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>) <span class="hljs-built_in">print</span>(predicted_keywords)

<span class="hljs-comment"># external traffic, read-to-view ratio</span></pre></div><h2 id="66da">Topics</h2><p id="9a6a">Moving onto the topics:</p><div id="cb0b"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForSequenceClassification, AutoTokenizer <span class="hljs-keyword">from</span> scipy.special <span class="hljs-keyword">import</span> expit

topic_tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"cardiffnlp/tweet-topic-21-multi"</span>) topic_model = AutoModelForSequenceClassification.from_pretrained(<span class="hljs-string">"cardiffnlp/tweet-topic-21-multi"</span>) class_mapping = topic_model.config.id2label

all_scores = [] all_topics = []

<span class="hljs-comment"># Split the text into chunks, since we can only process a maximum of 514 characters</span> input_length = <span class="hljs-number">100</span> chunks = [text[i:i+input_length] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(text), input_length)]

<span class="hljs-comment"># Process each chunk of text</span> <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks: tokens = topic_tokenizer(chunk, return_tensors=<span class="hljs-string">"pt"</span>) output = topic_model(**tokens)

scores = output.logits
scores = expit(scores.detach().numpy())

<span class="hljs-keyword">for</span> score <span class="hljs-keyword">in</span> scores:
    scores_with_topics = []
    <span class="hljs-keyword">for</span> i, topic_score <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(score):
        topic = class_mapping[i]
        scores_with_topics.append((topic, topic_score))
    all_scores.append(scores_with_topics)

<span class="hljs-comment"># Compute average scores</span> average_scores = [<span class="hljs-number">0</span>] * <span class="hljs-built_in">len</span>(class_mapping) <span class="hljs-keyword">for</span> scores_with_topics <span class="hljs-keyword">in</span> all_scores: <span class="hljs-keyword">for</span> i, (_, score) <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(scor

Options

es_with_topics): average_scores[i] += score

<span class="hljs-comment"># Normalize average scores</span> average_scores = [score / <span class="hljs-built_in">len</span>(all_scores) <span class="hljs-keyword">for</span> score <span class="hljs-keyword">in</span> average_scores]

<span class="hljs-comment"># Sort topics based on scores</span> sorted_topics = <span class="hljs-built_in">sorted</span>(<span class="hljs-built_in">zip</span>(class_mapping.values(), average_scores), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Print sorted topics</span> <span class="hljs-keyword">for</span> topic, score <span class="hljs-keyword">in</span> sorted_topics: <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{topic.replace(<span class="hljs-string">'_'</span>, <span class="hljs-string">' '</span>)}</span>: <span class="hljs-subst">{score:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># diaries & daily life: 0.4806</span> <span class="hljs-comment"># news & social concern: 0.2519</span> <span class="hljs-comment"># other hobbies: 0.1693</span> <span class="hljs-comment"># ...</span></pre></div><h2 id="7993">That’s Everything!</h2><p id="2d9b">Well done! Now, you’re able to perform a text analysis using Python!</p><blockquote id="ebaf"><p>Please let me know if there are any metrics that I’ve missed!</p></blockquote><blockquote id="08ae"><p><i>And If you enjoyed this tutorial:</i></p></blockquote><ul><li>👏 <b>Give the story 50 claps</b>!</li><li><b>Follow <a href="https://readmedium.com/27df33bccbb2?source=post_page-----ed96494f3d0b--------------------------------">Oliver Lövström</a></b></li><li><b>Support me </b>by <a href="https://www.buymeacoffee.com/oliverlovstrom">buying me a coffee</a></li></ul><blockquote id="aa56"><p><b>Day 26 out of 30:</b> Want to check out how I’ve used this tool?</p></blockquote><div id="321c" class="link-block"> <a href="https://readmedium.com/what-makes-a-story-great-d9a9a94883cf"> <div> <div> <h2>What Makes a Story Great?</h2> <div><h3>Writing a one in a 1,000,000 piece</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*sP6m0481dhz6D9Nx2r6ErQ.jpeg)"></div> </div> </div> </a> </div><h2 id="adeb">Links</h2><ul><li><a href="https://readmedium.com/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285">Regex Tutorial: A Simple Cheatsheet by Examples</a></li><li><a href="https://chat.openai.com/">ChatGPT for regexes</a></li><li><a href="https://huggingface.co/Voicelab/vlt5-base-keywords">Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer</a><a href="https://arxiv.org/abs/2209.14008"><i>Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, ACIIDS 2022</i></a></li><li><a href="https://huggingface.co/cardiffnlp/tweet-topic-21-multi">Tweet Topic 21 Multi</a><i>Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. <a href="https://aclanthology.org/2022.coling-1.299">Twitter Topic Classification</a>. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386–3400, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.</i></li></ul><p id="e91f">Thank you again for reaching all the way to the end! 🙏</p><div id="4c43" class="link-block"> <a href="https://readmedium.com/i-found-out-the-best-time-to-publish-4f5d7cca26bc"> <div> <div> <h2>I Found Out The Best Time To Publish</h2> <div><h3>Your Story Will do better with this technique </h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*JCyMchPL0RaM51_N)"></div> </div> </div> </a> </div></article></body>

Language Analysis of a Medium Story

How can you perform language analysis on a Medium story — and why should you care?

Not a member yet? — Read here for free!

Have you ever wondered why your stories perform the way they do?

Let’s create a tool that analyzes our Medium stories!

Photo by Gilly on Unsplash

What metrics can we derive from just the text of a story?

From simple to advanced metrics:

  1. Word Frequency: What are the most common words in a text?
  2. Text structure: What is the sentence length and word length in a post?
  3. Readability: How readable is the text?
  4. Sentiment: In what tone is the story written? Is it positive, negative, or somewhere in between?
  5. Keywords: What are the keywords that summarize the story?
  6. Topics: Can the story be categorized?

Let’s get started!

Open the Text

To begin, we open the text:

with open("stories/text.txt", "r") as file:
    text = file.read()

Word Frequency

We calculate the word frequency using a regex that extracts all the words from the post:

from collections import Counter
import re

words = re.findall(r'\b\w+\b', text.lower())
word_counts = Counter(words)

most_common_words = word_counts.most_common(10)
print(most_common_words)

# [('i', 11), ('this', 8), ('my', 7), ('to', 7), ('it', 7), ('a', 6), ('is', 6), ('external', 4), ('traffic', 4), ('and', 4)]

Text Structure

We use regex to extract the paragraphs and the sentences from the text, which we then use to calculate the average length:

import re

paragraphs = re.split(r'\n\s*\n', text)
print(paragraphs)

num_paragraphs = len(paragraphs)
paragraph_lengths = [len(paragraph.split()) for paragraph in paragraphs]

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
sentence_lengths = [len(sentence.split()) for sentence in sentences]

avg_paragraph_length = sum(paragraph_lengths) / num_paragraphs
avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths)

print(avg_paragraph_length, avg_sentence_length)

# 11.8 14.75

Readability

There are many measures of readability. Let’s consider:

  • Grade Level: The grade level in the U.S. education system required to understand the text.
  • Reading Ease: How easy is the text to read? As a score between 0–100.
import textstat

reading_ease = textstat.flesch_reading_ease(text)
grade_level = textstat.flesch_kincaid_grade(text)
print(grade_level)
print(reading_ease)

# 5.0
# 77.74

Sentiment

Let’s calculate the sentiment of the story:

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)

# {'neg': 0.119, 'neu': 0.721, 'pos': 0.16, 'compound': 0.936}

Keywords

Now, we extract the keywords:

from transformers import T5Tokenizer, T5ForConditionalGeneration

keyword_model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
keyword_tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

input_sequences = ["Keywords: " + text]

input_ids = keyword_tokenizer(input_sequences, return_tensors="pt", truncation=False).input_ids
output = keyword_model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)

predicted_keywords = keyword_tokenizer.decode(output[0], skip_special_tokens=True)
print(predicted_keywords)

# external traffic, read-to-view ratio

Topics

Moving onto the topics:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from scipy.special import expit

topic_tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/tweet-topic-21-multi")
topic_model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/tweet-topic-21-multi")
class_mapping = topic_model.config.id2label

all_scores = []
all_topics = []

# Split the text into chunks, since we can only process a maximum of 514 characters
input_length = 100
chunks = [text[i:i+input_length] for i in range(0, len(text), input_length)]

# Process each chunk of text
for chunk in chunks:
    tokens = topic_tokenizer(chunk, return_tensors="pt")
    output = topic_model(**tokens)

    scores = output.logits
    scores = expit(scores.detach().numpy())

    for score in scores:
        scores_with_topics = []
        for i, topic_score in enumerate(score):
            topic = class_mapping[i]
            scores_with_topics.append((topic, topic_score))
        all_scores.append(scores_with_topics)

# Compute average scores
average_scores = [0] * len(class_mapping)
for scores_with_topics in all_scores:
    for i, (_, score) in enumerate(scores_with_topics):
        average_scores[i] += score

# Normalize average scores
average_scores = [score / len(all_scores) for score in average_scores]

# Sort topics based on scores
sorted_topics = sorted(zip(class_mapping.values(), average_scores), key=lambda x: x[1], reverse=True)

# Print sorted topics
for topic, score in sorted_topics:
    print(f"{topic.replace('_', ' ')}: {score:.4f}")

# diaries & daily life: 0.4806
# news & social concern: 0.2519
# other hobbies: 0.1693
# ...

That’s Everything!

Well done! Now, you’re able to perform a text analysis using Python!

Please let me know if there are any metrics that I’ve missed!

And If you enjoyed this tutorial:

Day 26 out of 30: Want to check out how I’ve used this tool?

Links

Thank you again for reaching all the way to the end! 🙏

Medium Partner Program
Python
Writing Tips
Text Analysis
Data Analysis
Recommended from ReadMedium