avatarMarvin Lanhenke

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3351

Abstract

rt the sparse TF-IDF matrix and put it inside a data frame to simplify further processing. Our TF-IDF matrix is now finished and contains 5,574 TF-IDF vectors (one for each SMS). Each vector carries 9,270 TF-IDF scores.</p><p id="c135">It’s time for the final step. Creating our topic vectors.</p><h1 id="a6c8">Making sense of LSA</h1> <figure id="eb42"> <div> <div>

            <iframe class="gist-iframe" src="/gist/marvinlanhenke/4395b61b28cf4cfaf8f527b4d4e7a857.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="00fa">And this is it. Thanks to the sklearn library, we can perform LSA pretty much effortlessly. We basically just instantiate the PCA class and apply the <code>fit_transform()</code> function on our TF-IDF matrix from before. Next, we store the matrix inside a data frame, allowing us to visualize the result more easily.</p><p id="5d6e">Let’s inspect our data frame containing 16 topic vectors.</p><figure id="29f7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*6rUXspmW5EVun13cAYDKjQ.png"><figcaption>Data frame with topic vectors [Screenshot by Author]</figcaption></figure><p id="a649">So far so good.</p><p id="0a9d">Our topic vector-matrix contains 5,574 rows. Each row represents an SMS, a vector with 16 dimensions, telling us how much a specific SMS belongs to a certain topic.</p><p id="6ac2">However, making sense of a topic, understanding what a topic stands for isn’t straightforward at all.</p><p id="f928">Nonetheless, let’s give it a try.</p>
    <figure id="e4e0">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/marvinlanhenke/75ae63ee4076e8496565c98e16309143.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="bf24">First of all, we have to rearrange the vocabulary since the TfidfVectorizer stores the tokens inside a dictionary where the keys are the tokens and the values represent indices. We, however, want to sort by the indices, allowing us to use the terms as columns in our data frame.</p><p id="4eb3">Once we extracted the terms in the correct order, we create our data frame. We pull the weights from the <code>PCA.components_</code> which tell us how much a term contributes to a certain topic and store the weights inside our data frame.</p><p id="7a5f">Next, we create a list of keywords that sketch out a spammy topic.</p><p id="5d17">If we add up the rows of the data frame filtered by our keywords, we get the following result.</p><figure id="b939"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*3Bvmu7lttoIEGochSjkmFQ.png"><figcaption>Filtered topics [Screenshot by Author]</figcaption></figure><p id="621d">Topics 4 and 15 seem to be particularly spammy, whereas topics 3 and 12 have nothing to do with our list of keywords.</p><p id="a435">And this concludes our small adventure in the realm of Latent Semantic Analysis.</p><h1 id="8430">Conclusion</h1><p id="e311">In this article, we got down to business.</p><p id="7167">We used our theoretical knowledge about Latent Semantic Analysis and a

Options

pplied PCA to a real-world dataset. Resulting in a matrix of topic vectors.</p><p id="33ca">However, making sense of this matrix, of the computed topics isn’t straightforward at all. We utilized the weights in order to get an idea of how much a term contributes to a certain topic. Equipped with this knowledge, we can merely guess what a topic might be about.</p><p id="3e90">Till now, we ignored the nearby context of a word. We ignored the surrounding words as well as the effect of neighbors and those relationships.</p><p id="67e9">In the following articles, we will tackle this problem.</p><p id="d7b1">Starting with word vectors.</p><p id="05ca">So take a seat, clean your glasses, make sure to follow, and never miss a single day of the ongoing series <b>#30DaysOfNLP.</b></p><div id="8a6f" class="link-block"> <a href="https://medium.com/@marvinlanhenke/list/3974a0c731d6"> <div> <div> <h2>#30DaysOfNLP</h2> <div><h3> </h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*2e960214dbb98533a95e401e60dd37cf20b08932.jpeg)"></div> </div> </div> </a> </div><p id="fb87"><i>Enjoyed the article? Become a <a href="https://medium.com/@marvinlanhenke/membership">Medium member</a> and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.</i></p><div id="5917" class="link-block"> <a href="https://medium.com/@marvinlanhenke/membership"> <div> <div> <h2>Join Medium with my referral link — Marvin Lanhenke</h2> <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*2j7Gh9xg1m7xh4bq)"></div> </div> </div> </a> </div><p id="9a47"><b>References / Further Material:</b></p><ul><li>Dataset: <a href="https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection#">SMS Spam Collection Data Set</a>. License: CC BY 4.0. Dua, D. and Graff, C. (2019). <a href="http://archive.ics.uci.edu/ml">UCI Machine Learning Repository</a>. Irvine, CA: University of California, School of Information and Computer Science.</li><li>Hobson Lane, Cole Howard, Hannes Max Hapke. Natural Language Processing in Action. New York: Manning, 2019.</li></ul><div id="db2b" class="link-block"> <a href="https://readmedium.com/mlearning-ai-submission-suggestions-b51e2b130bfb"> <div> <div> <h2>Mlearning.ai Submission Suggestions</h2> <div><h3>How to become a writer on Mlearning.ai</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*ib0DX0UzRoFcNuZILb7rNA.jpeg)"></div> </div> </div> </a> </div></article></body>

#30DaysOfNLP

NLP-Day 9: Performing Latent Semantic Analysis With PCA

How to create and make sense of topic vectors?

LSA with PCA #30DaysOfNLP [Image by Author]

In the previous article, we introduced the theoretical concept of Latent Semantic Analysis and fleetingly got to know its relatives. LDA and LDiA.

Now, we get practical.

In the following sections, we’re going to perform LSA by utilizing sklearn’s implementation of PCA. We will learn how to load and preprocess a text file containing 5,574 SMS. How to create a TF-IDF matrix. And how to create and make sense of topic vectors.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: Performing Latent Semantic Analysis With PCA.

No data. No topics

In order to get started, we need something to work with. We need a dataset.

Fortunately, the UCI Machine Learning Repository has got us covered. From the repository, we can download the SMS Spam Collection dataset that contains 5,574 SMS, labeled either “ham” or “spam”.

However, we need to do a little work, some preprocessing to make this dataset suitable for our purposes.

First of all, we import all necessary libraries.

Next, we define a little helper function. Inside the function, we simply read the file and create a data frame. Each row contains an SMS including the label. But we want to separate the label from the main body. Thus we apply the split() function again, drop the unnecessary column and return the preprocessed data frame.

Now, we’re all set and ready to move on.

An old friend. TF-IDF

This step should be routine by now since we’ve used the TfidfVectorizer in the previous episodes. We just instantiate the TfidfVectorizer class and apply the fit_transform() function on our data frame. One thing to notice is that we pass in the casual tokenizer from the NLTK library. This makes sense since we’re dealing with SMS, containing a wide variety of colloquial terms.

We convert the sparse TF-IDF matrix and put it inside a data frame to simplify further processing. Our TF-IDF matrix is now finished and contains 5,574 TF-IDF vectors (one for each SMS). Each vector carries 9,270 TF-IDF scores.

It’s time for the final step. Creating our topic vectors.

Making sense of LSA

And this is it. Thanks to the sklearn library, we can perform LSA pretty much effortlessly. We basically just instantiate the PCA class and apply the fit_transform() function on our TF-IDF matrix from before. Next, we store the matrix inside a data frame, allowing us to visualize the result more easily.

Let’s inspect our data frame containing 16 topic vectors.

Data frame with topic vectors [Screenshot by Author]

So far so good.

Our topic vector-matrix contains 5,574 rows. Each row represents an SMS, a vector with 16 dimensions, telling us how much a specific SMS belongs to a certain topic.

However, making sense of a topic, understanding what a topic stands for isn’t straightforward at all.

Nonetheless, let’s give it a try.

First of all, we have to rearrange the vocabulary since the TfidfVectorizer stores the tokens inside a dictionary where the keys are the tokens and the values represent indices. We, however, want to sort by the indices, allowing us to use the terms as columns in our data frame.

Once we extracted the terms in the correct order, we create our data frame. We pull the weights from the PCA.components_ which tell us how much a term contributes to a certain topic and store the weights inside our data frame.

Next, we create a list of keywords that sketch out a spammy topic.

If we add up the rows of the data frame filtered by our keywords, we get the following result.

Filtered topics [Screenshot by Author]

Topics 4 and 15 seem to be particularly spammy, whereas topics 3 and 12 have nothing to do with our list of keywords.

And this concludes our small adventure in the realm of Latent Semantic Analysis.

Conclusion

In this article, we got down to business.

We used our theoretical knowledge about Latent Semantic Analysis and applied PCA to a real-world dataset. Resulting in a matrix of topic vectors.

However, making sense of this matrix, of the computed topics isn’t straightforward at all. We utilized the weights in order to get an idea of how much a term contributes to a certain topic. Equipped with this knowledge, we can merely guess what a topic might be about.

Till now, we ignored the nearby context of a word. We ignored the surrounding words as well as the effect of neighbors and those relationships.

In the following articles, we will tackle this problem.

Starting with word vectors.

So take a seat, clean your glasses, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

References / Further Material:

  • Dataset: SMS Spam Collection Data Set. License: CC BY 4.0. Dua, D. and Graff, C. (2019). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.
  • Hobson Lane, Cole Howard, Hannes Max Hapke. Natural Language Processing in Action. New York: Manning, 2019.
Naturallanguageprocessing
Machine Learning
Latent Semantic Analysis
NLP
Ml So Good
Recommended from ReadMedium