Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4359

Abstract

/li><li><b>Reading Ease: </b><i>How easy is the text to read? As a score between 0–100.</i></li></ul><div id="2d08"><pre><span class="hljs-keyword">import</span> textstat

reading_ease = textstat.flesch_reading_ease(text) grade_level = textstat.flesch_kincaid_grade(text) <span class="hljs-built_in">print</span>(grade_level) <span class="hljs-built_in">print</span>(reading_ease)

<span class="hljs-comment"># 5.0</span> <span class="hljs-comment"># 77.74</span></pre></div><h2 id="c2e5">Sentiment</h2><p id="e7d5">Let’s calculate the sentiment of the story:</p><div id="4791"><pre><span class="hljs-keyword">from</span> nltk.sentiment <span class="hljs-keyword">import</span> SentimentIntensityAnalyzer sia = SentimentIntensityAnalyzer()

sentiment_scores = sia.polarity_scores(text) <span class="hljs-built_in">print</span>(sentiment_scores)

<span class="hljs-comment"># {'neg': 0.119, 'neu': 0.721, 'pos': 0.16, 'compound': 0.936}</span></pre></div><h2 id="525d">Keywords</h2><p id="ff4f">Now, we extract the keywords:</p><div id="07ff"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> T5Tokenizer, T5ForConditionalGeneration

keyword_model = T5ForConditionalGeneration.from_pretrained(<span class="hljs-string">"Voicelab/vlt5-base-keywords"</span>) keyword_tokenizer = T5Tokenizer.from_pretrained(<span class="hljs-string">"Voicelab/vlt5-base-keywords"</span>)

input_sequences = [<span class="hljs-string">"Keywords: "</span> + text]

input_ids = keyword_tokenizer(input_sequences, return_tensors=<span class="hljs-string">"pt"</span>, truncation=<span class="hljs-literal">False</span>).input_ids output = keyword_model.generate(input_ids, no_repeat_ngram_size=<span class="hljs-number">3</span>, num_beams=<span class="hljs-number">4</span>)

predicted_keywords = keyword_tokenizer.decode(output[<span class="hljs-number">0</span>], skip_special_tokens=<span class="hljs-literal">True</span>) <span class="hljs-built_in">print</span>(predicted_keywords)

<span class="hljs-comment"># external traffic, read-to-view ratio</span></pre></div><h2 id="66da">Topics</h2><p id="9a6a">Moving onto the topics:</p><div id="cb0b"><pre><span class="hljs-keyword">from</span> transformers <span class="hljs-keyword">import</span> AutoModelForSequenceClassification, AutoTokenizer <span class="hljs-keyword">from</span> scipy.special <span class="hljs-keyword">import</span> expit

topic_tokenizer = AutoTokenizer.from_pretrained(<span class="hljs-string">"cardiffnlp/tweet-topic-21-multi"</span>) topic_model = AutoModelForSequenceClassification.from_pretrained(<span class="hljs-string">"cardiffnlp/tweet-topic-21-multi"</span>) class_mapping = topic_model.config.id2label

all_scores = [] all_topics = []

<span class="hljs-comment"># Split the text into chunks, since we can only process a maximum of 514 characters</span> input_length = <span class="hljs-number">100</span> chunks = [text[i:i+input_length] <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, <span class="hljs-built_in">len</span>(text), input_length)]

<span class="hljs-comment"># Process each chunk of text</span> <span class="hljs-keyword">for</span> chunk <span class="hljs-keyword">in</span> chunks: tokens = topic_tokenizer(chunk, return_tensors=<span class="hljs-string">"pt"</span>) output = topic_model(**tokens)

scores = output.logits
scores = expit(scores.detach().numpy())

<span class="hljs-keyword">for</span> score <span class="hljs-keyword">in</span> scores:
    scores_with_topics = []
    <span class="hljs-keyword">for</span> i, topic_score <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(score):
        topic = class_mapping[i]
        scores_with_topics.append((topic, topic_score))
    all_scores.append(scores_with_topics)

<span class="hljs-comment"># Compute average scores</span> average_scores = [<span class="hljs-number">0</span>] * <span class="hljs-built_in">len</span>(class_mapping) <span class="hljs-keyword">for</span> scores_with_topics <span class="hljs-keyword">in</span> all_scores: <span class="hljs-keyword">for</span> i, (_, score) <span class="hljs-keyword">in</span> <span class="hljs-built_in">enumerate</span>(scor

Options

es_with_topics): average_scores[i] += score

<span class="hljs-comment"># Normalize average scores</span> average_scores = [score / <span class="hljs-built_in">len</span>(all_scores) <span class="hljs-keyword">for</span> score <span class="hljs-keyword">in</span> average_scores]

<span class="hljs-comment"># Sort topics based on scores</span> sorted_topics = <span class="hljs-built_in">sorted</span>(<span class="hljs-built_in">zip</span>(class_mapping.values(), average_scores), key=<span class="hljs-keyword">lambda</span> x: x[<span class="hljs-number">1</span>], reverse=<span class="hljs-literal">True</span>)

<span class="hljs-comment"># Print sorted topics</span> <span class="hljs-keyword">for</span> topic, score <span class="hljs-keyword">in</span> sorted_topics: <span class="hljs-built_in">print</span>(<span class="hljs-string">f"<span class="hljs-subst">{topic.replace(<span class="hljs-string">'_'</span>, <span class="hljs-string">' '</span>)}</span>: <span class="hljs-subst">{score:<span class="hljs-number">.4</span>f}</span>"</span>)

<span class="hljs-comment"># diaries & daily life: 0.4806</span> <span class="hljs-comment"># news & social concern: 0.2519</span> <span class="hljs-comment"># other hobbies: 0.1693</span> <span class="hljs-comment"># ...</span></pre></div><h2 id="7993">That’s Everything!</h2><p id="2d9b">Well done! Now, you’re able to perform a text analysis using Python!</p><blockquote id="ebaf"><p>Please let me know if there are any metrics that I’ve missed!</p></blockquote><blockquote id="08ae"><p><i>And If you enjoyed this tutorial:</i></p></blockquote><ul><li>👏 <b>Give the story 50 claps</b>!</li><li>✅ <b>Follow <a href="https://readmedium.com/27df33bccbb2?source=post_page-----ed96494f3d0b--------------------------------">Oliver Lövström</a></b></li><li>☕ <b>Support me </b>by <a href="https://www.buymeacoffee.com/oliverlovstrom">buying me a coffee</a></li></ul><blockquote id="aa56"><p><b>Day 26 out of 30:</b> Want to check out how I’ve used this tool?</p></blockquote><div id="321c" class="link-block"> <a href="https://readmedium.com/what-makes-a-story-great-d9a9a94883cf"> <div> <div> <h2>What Makes a Story Great?</h2> <div><h3>Writing a one in a 1,000,000 piece</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*sP6m0481dhz6D9Nx2r6ErQ.jpeg)"></div> </div> </div> </a> </div><h2 id="adeb">Links</h2><ul><li><a href="https://readmedium.com/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285">Regex Tutorial: A Simple Cheatsheet by Examples</a></li><li><a href="https://chat.openai.com/">ChatGPT for regexes</a></li><li><a href="https://huggingface.co/Voicelab/vlt5-base-keywords">Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer</a> — <a href="https://arxiv.org/abs/2209.14008"><i>Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, ACIIDS 2022</i></a></li><li><a href="https://huggingface.co/cardiffnlp/tweet-topic-21-multi">Tweet Topic 21 Multi</a> — <i>Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. <a href="https://aclanthology.org/2022.coling-1.299">Twitter Topic Classification</a>. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386–3400, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.</i></li></ul><p id="e91f">Thank you again for reaching all the way to the end! 🙏</p><div id="4c43" class="link-block"> <a href="https://readmedium.com/i-found-out-the-best-time-to-publish-4f5d7cca26bc"> <div> <div> <h2>I Found Out The Best Time To Publish</h2> <div><h3>Your Story Will do better with this technique </h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*JCyMchPL0RaM51_N)"></div> </div> </div> </a> </div></article></body>

Language Analysis of a Medium Story

How can you perform language analysis on a Medium story — and why should you care?

Not a member yet? — Read here for free!

Have you ever wondered why your stories perform the way they do?

Let’s create a tool that analyzes our Medium stories!

What metrics can we derive from just the text of a story?

From simple to advanced metrics:

Word Frequency: What are the most common words in a text?
Text structure: What is the sentence length and word length in a post?
Readability: How readable is the text?
Sentiment: In what tone is the story written? Is it positive, negative, or somewhere in between?
Keywords: What are the keywords that summarize the story?
Topics: Can the story be categorized?

Let’s get started!

Open the Text

To begin, we open the text:

with open("stories/text.txt", "r") as file:
    text = file.read()

Word Frequency

We calculate the word frequency using a regex that extracts all the words from the post:

from collections import Counter
import re

words = re.findall(r'\b\w+\b', text.lower())
word_counts = Counter(words)

most_common_words = word_counts.most_common(10)
print(most_common_words)

# [('i', 11), ('this', 8), ('my', 7), ('to', 7), ('it', 7), ('a', 6), ('is', 6), ('external', 4), ('traffic', 4), ('and', 4)]

Text Structure

We use regex to extract the paragraphs and the sentences from the text, which we then use to calculate the average length:

import re

paragraphs = re.split(r'\n\s*\n', text)
print(paragraphs)

num_paragraphs = len(paragraphs)
paragraph_lengths = [len(paragraph.split()) for paragraph in paragraphs]

sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
sentence_lengths = [len(sentence.split()) for sentence in sentences]

avg_paragraph_length = sum(paragraph_lengths) / num_paragraphs
avg_sentence_length = sum(sentence_lengths) / len(sentence_lengths)

print(avg_paragraph_length, avg_sentence_length)

# 11.8 14.75

Readability

There are many measures of readability. Let’s consider:

Grade Level: The grade level in the U.S. education system required to understand the text.
Reading Ease: How easy is the text to read? As a score between 0–100.

import textstat

reading_ease = textstat.flesch_reading_ease(text)
grade_level = textstat.flesch_kincaid_grade(text)
print(grade_level)
print(reading_ease)

# 5.0
# 77.74

Sentiment

Let’s calculate the sentiment of the story:

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

sentiment_scores = sia.polarity_scores(text)
print(sentiment_scores)

# {'neg': 0.119, 'neu': 0.721, 'pos': 0.16, 'compound': 0.936}

Keywords

Now, we extract the keywords:

from transformers import T5Tokenizer, T5ForConditionalGeneration

keyword_model = T5ForConditionalGeneration.from_pretrained("Voicelab/vlt5-base-keywords")
keyword_tokenizer = T5Tokenizer.from_pretrained("Voicelab/vlt5-base-keywords")

input_sequences = ["Keywords: " + text]

input_ids = keyword_tokenizer(input_sequences, return_tensors="pt", truncation=False).input_ids
output = keyword_model.generate(input_ids, no_repeat_ngram_size=3, num_beams=4)

predicted_keywords = keyword_tokenizer.decode(output[0], skip_special_tokens=True)
print(predicted_keywords)

# external traffic, read-to-view ratio

Topics

Moving onto the topics:

from transformers import AutoModelForSequenceClassification, AutoTokenizer
from scipy.special import expit

topic_tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/tweet-topic-21-multi")
topic_model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/tweet-topic-21-multi")
class_mapping = topic_model.config.id2label

all_scores = []
all_topics = []

# Split the text into chunks, since we can only process a maximum of 514 characters
input_length = 100
chunks = [text[i:i+input_length] for i in range(0, len(text), input_length)]

# Process each chunk of text
for chunk in chunks:
    tokens = topic_tokenizer(chunk, return_tensors="pt")
    output = topic_model(**tokens)

    scores = output.logits
    scores = expit(scores.detach().numpy())

    for score in scores:
        scores_with_topics = []
        for i, topic_score in enumerate(score):
            topic = class_mapping[i]
            scores_with_topics.append((topic, topic_score))
        all_scores.append(scores_with_topics)

# Compute average scores
average_scores = [0] * len(class_mapping)
for scores_with_topics in all_scores:
    for i, (_, score) in enumerate(scores_with_topics):
        average_scores[i] += score

# Normalize average scores
average_scores = [score / len(all_scores) for score in average_scores]

# Sort topics based on scores
sorted_topics = sorted(zip(class_mapping.values(), average_scores), key=lambda x: x[1], reverse=True)

# Print sorted topics
for topic, score in sorted_topics:
    print(f"{topic.replace('_', ' ')}: {score:.4f}")

# diaries & daily life: 0.4806
# news & social concern: 0.2519
# other hobbies: 0.1693
# ...

That’s Everything!

Well done! Now, you’re able to perform a text analysis using Python!

Please let me know if there are any metrics that I’ve missed!

And If you enjoyed this tutorial:

👏 Give the story 50 claps!
✅ Follow Oliver Lövström
☕ Support me by buying me a coffee

Day 26 out of 30: Want to check out how I’ve used this tool?

What Makes a Story Great?

Writing a one in a 1,000,000 piece

medium.com

Links

Regex Tutorial: A Simple Cheatsheet by Examples
ChatGPT for regexes
Keyword Extraction from Short Texts with a Text-To-Text Transfer Transformer — Piotr Pęzik, Agnieszka Mikołajczyk-Bareła, Adam Wawrzyński, Bartłomiej Nitoń, Maciej Ogrodniczuk, ACIIDS 2022
Tweet Topic 21 Multi — Dimosthenis Antypas, Asahi Ushio, Jose Camacho-Collados, Vitor Silva, Leonardo Neves, and Francesco Barbieri. 2022. Twitter Topic Classification. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3386–3400, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.

Thank you again for reaching all the way to the end! 🙏

I Found Out The Best Time To Publish

Your Story Will do better with this technique

medium.com