Free AI web copilot to create summaries, insights and extended knowledge, download it at here

11697

Abstract

package and assign both <i>Independent</i> senators, Bernie Sanders and Angus S. King, Jr., to the Democrat class because they both <a href="https://www.democrats.senate.gov/about-senate-dems/our-caucus">caucus</a> with Democrats.</p> <figure id="fce3"> <div> <div>

            <iframe class="gist-iframe" src="/gist/m-newhauser/77c4ca5a0d310c6e171c2087f58cf62e.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="911a">We want to confirm that we only have our two target classes in the dataset and create a label mapping to later convert them to integers.</p>
    <figure id="105b">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/63815402a1a991074f7a9e240fd18313.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="3d2b">The DistilBERT tokenizer will only accept the training data if it has a <code>labels</code> column, so we’ll add another column with our mapped labels.</p>
    <figure id="b5fe">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/ca0693a85c06c65317fb59ea1244d59a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="fb75">Next, we create a <code>Dataset</code> from the data frame and split it into a <code>test</code> and <code>train</code> set.</p>
    <figure id="90a3">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/cd74d356519f8ee24d1da71c0fed933f.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="75f4">Then, we cast the <code>labels</code> column as a class-encoded column, which tells the dataset where to find the labels for the data and automatically formats it for training.</p>
    <figure id="a7c1">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/6fb10f1fb24826e3e3cfe292da52f36e.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="b8dc">Finally, the <code>Dataset</code> is ready for tokenizing!</p><h2 id="cc09">Tokenize dataset</h2><p id="ee8d">We start by loading the Transformers <code>AutoTokenizer</code> for DistilBERT.</p>
    <figure id="4d22">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/ffa7312d167e22d36a0cd2c949f4dfa1.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="82db">Next, we make a list of columns to remove from the dataset upon tokenization and tokenize the tweets in the <code>text</code> column.</p>
    <figure id="b1ae">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/daf79de88e03cd0381cec392a38279fe.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="a4c6">In the <code>tokenize()</code> function, we set <code>padding=True</code> to pad each tweet to the maximum length tweet in the batch. We also set <code>truncation=True</code> and <code>max_length=512</code> to truncate each tweet to the maximum sequence length for DistilBERT.</p><p id="7007">After tokenization is complete, we prepare the dataset to be fed to the model by setting its format to <code>"torch"</code> and creating data loads to correctly reshape the dataset.</p>
    <figure id="daa0">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/5cdf028e103c071d3edf8386267ede7d.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="6495">Now that the encoded dataset contains all the necessary inputs for DistilBERT, we instantiate a <code>DataLoader</code> for each split of the dataset to feed it to the model.</p>
    <figure id="95fc">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/3ed262d5952a54b8a72baa5f19a6ffec.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="d4c2">Load the model</h2><p id="99d4">Before training the model, we need to load it from the HuggingFace Model Hub and specify its configuration and hyperparameters.</p><p id="ce2c">Besides providing the model <code>checkpoint</code>, the only other configuration we need to specify is <code>num_labels</code>. I programmatically define the number of labels by setting it equal to <code>num_classes</code> in the <code>Dataset</code>.</p>
    <figure id="a78f">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/ac565fe564b7475aeb69e906ab5fb6a9.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="91d4">Define hyperparameters, optimizer, and learning rate scheduler</h2><p id="bf30">The <code>learning_rate</code> and <code>num_epochs</code> are the only hyperparameters we will define for this task.</p><ul><li><b>Epochs</b> are the number of training iterations over the dataset. In the past, I’ve had success with <code>num_epochs=5</code>.</li><li><b>Learning rate </b>specifies how much the model parameters are adjusted after each training epoch. It’s typical to start with <code>learning_rate=5e-5</code> and increase it in subsequent training sessions, if necessary.</li></ul><p id="3e99">Optimization is the process in which model parameters are adjusted after each training epoch (or batch) to iteratively reduce model error. The <b>optimizer </b>is the specific algorithm used in the process. For our task, we follow HuggingFace’s <a href="https://huggingface.co/course/chapter3/4?fw=pt">lead</a> and use the <code>AdamW()</code> optimizer, to which we pass our selected <code>learning_rate</code>.</p>
    <figure id="f995">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/8afdce62ed240fbcb1ff9a415bbcf657.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="108c">Next, we move the model to the GPU, since we are using Google Colab.</p>
    <figure id="d47e">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/c0f6b8c48ab62c0239c73c370246003e.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="f7db">Train model</h2><p id="8e81">Finally, we are ready to train! I copied this code, and the following evaluation loop code, from <a href="https://huggingface.co/course/chapter3/4?fw=pt">Chapter 3</a> of HuggingFace’s Course. I highly recommend the course for both learning and as a reference.</p>
    <figure id="5c69">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/aaabbfca2198e10113de66040925eef4.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h2 id="a160">Save model</h2><p id="122c">I usually step away from the computer while my models are training, so I always execute a code cell that saves the model locally in case the notebook times out.</p>
    <figure id="5379">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/183851d6892506f4e75290b223b89017.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="2c3e">Because Google Drive has limited storage space and model files tend to be quite large, pushing the model to Huggingface’s Model Hub is a great option to both store the model and share it with others. There are several ways to upload a model to the Hub, but I found the easiest and quickest way was to use <code>huggingface_hub</code> within the training notebook.</p><p id="5640">We start by importing the module and installing <code>git-lfs</code>, an <a href="https://git-lfs.github.com/">open-source</a> Git extension for versioning large files.</p>
    <figure id="6298">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/3e25b25a4b701ebf7ba31fb1646dc653.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="efff">Next, we log in to our Huggingface account (create an account <a href="https://huggingface.co/join">here</a>, if you haven’t already).</p>
    <figure id="3819">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/a778bf6e0067a1e362e4ee076199291a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="e107">The last step is to configure your Git settings and push the model.</p>
    <figure id="24bd">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/e9d61579c816eae3f96f86e85768b236.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="a3b1">Our model is saved and can be access <a href="https://huggingface.co/m-newhauser/distilbert-political-tweets">here</a>!</p><h2 id="6adf">Evaluate model</h2><p id="0b12">Now that we’ve saved the model, we need to assess its performance. The 🤗 <code>datasets</code> library offers several NLP performance metrics that we can easily import to evaluate the model.</p><p id="6db1">Specifying <code>metric = load_metric("glue", "mrpc")</code> will instantiate a <code>metric</code> object from the HuggingFace <a href="https://huggingface.co/metrics">metrics repository</a>, that will calculate the accuracy and F1 score of the model.</p><p id="c9c2">If you need a refresher on the confusion matrix and binary performance matrix, I highly recommend checking out this <a href="https://machinelearningmastery.com/confusion-matrix-machine-learning/#:~:text=A%20confusion%20matrix%20is%20a%20summary%20of%20prediction%20results%20on,key%20to%20the%20confusion%20matrix.">article</a>.</p>
    <

Options

figure id="f0a0"> <div> <div>

            <iframe class="gist-iframe" src="/gist/m-newhauser/0974aaeb46cb2f9dbff251ef32383422.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="2452">Both the accuracy and F1 score are looking good! Let’s inference the model with some more recent tweets that weren’t in the training or test dataset.</p><p id="2604">Here’s a <a href="https://twitter.com/SenTedCruz/status/1501309601605734408">tweet</a> from Texas Senator Ted Cruz, a well-known Republican:</p>
    <figure id="ba89">
        <div>
          <div>
            <img class="ratio" src="http://placehold.it/16x9">
            <iframe class="" src="https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;schema=twitter&amp;url=https%3A//twitter.com/sentedcruz/status/1501309601605734408&amp;image=https%3A//i.embed.ly/1/image%3Furl%3Dhttps%253A%252F%252Fabs.twimg.com%252Ferrors%252Flogo46x38.png%26key%3Da19fcc184b9711e1b4764040d3dc5c07" allowfullscreen="" frameborder="0" height="281" width="500">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="69f0">After sending the tweet through the tokenizer and feeding it to the model, we are given the logits.</p>
    <figure id="363a">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/707a0cc68a4024cb2839db10873a3372.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="a03a">Because the raw output of the model, called logits, are relatively uninterpretable, we apply a softmax function, which re-scales the values so that we get probabilities between [0, 1]. Because our model is binary, we get a probability for each class. Recall that during data preprocessing, we assigned the tweets to numeric classes: <code>{'Republican': '0', 'Democrat': '1'}</code>. This means that the first value in the resulting tensor is the probability that the tweet belongs to the Republican class. Below, you can see that <i>the model predicts Senator Cruz’s tweet as Republican with a probability of 99.999%!</i></p>
    <figure id="89ea">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/6830d26d23f91c4530cb9664e8657b44.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="e3df">Now let’s try a tweet from Massachusetts Senator Elizabeth Warren, a Democrat:</p>
    <figure id="e84f">
        <div>
          <div>
            <img class="ratio" src="http://placehold.it/16x9">
            <iframe class="" src="https://cdn.embedly.com/widgets/media.html?type=text%2Fhtml&amp;key=a19fcc184b9711e1b4764040d3dc5c07&amp;schema=twitter&amp;url=https%3A//twitter.com/ewarren/status/1498410222683148291&amp;image=https%3A//i.embed.ly/1/image%3Furl%3Dhttps%253A%252F%252Fabs.twimg.com%252Ferrors%252Flogo46x38.png%26key%3Da19fcc184b9711e1b4764040d3dc5c07" allowfullscreen="" frameborder="0" height="281" width="500">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="b315"><i>The model assigns Senator Warren’s tweet to the Democrat class with a probability of 100%!</i></p>
    <figure id="49e5">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/m-newhauser/818a6e194ae028e3113f2be89ed9f4d0.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><h1 id="beb2">Conclusion</h1><p id="7070">In this article, I illustrate how I created a custom dataset of US Senator tweets and used it to fine-tune an accurate DistilBERT text classification model that predicts the political party affiliation of a senator based on a tweet they have posted.</p><p id="142c">Here’s a brief summary of all the steps involved:</p><p id="9a17">1. Scrape historical tweets from Twitter using <code>snscrape</code>.</p><p id="5fa3">2. Clean and store tweets in a local SQLite database.</p><p id="43da">3. Tokenize the dataset and fine-tune a DistilBERT model using 🤗 <code>transformers</code> and PyTorch in Google Colab.</p><p id="e6fb">4. Push the model to the 🤗 Model Hub.</p><p id="e9df">5. Evaluate model performance on the test set and inference the model on two unseen tweets.</p><h2 id="9faf">Next steps</h2><p id="83a5">Both the <a href="https://huggingface.co/datasets/m-newhauser/senator-tweets">senator-tweets</a> dataset and <a href="https://huggingface.co/m-newhauser/distilbert-political-tweets">distilbert-political-tweets</a> model have many potential applications. A model that predicts the “partisan-ness” of senators in real-time based off of their tweets could be an interesting tool for prospective voters. A semantic search or question answering application to determine the political positions that senators take on specific issues could be similarly enlightening for voters.</p><h2 id="37bc">Final thoughts</h2><p id="2dad">I completed this project without any fancy API keys, EC2 instances, or S3 buckets. In fact, I only used two Google Colab (free edition) notebooks to create the dataset and fine-tune the model. In other words, this project is accessible! I hope this article inspires you to get out there and build something you’re interested in, something original, with all of the phenomenal open-source resources available to us during this “golden age” of NLP.</p><h2 id="9ae6">Project links</h2><ul><li><a href="https://github.com/m-newhauser/distilbert-senator-tweets">Github repository</a> 🐍</li><li><a href="https://huggingface.co/m-newhauser/distilbert-political-tweets">Final version of fine-tuned model</a> ⚙️</li><li><a href="https://huggingface.co/datasets/m-newhauser/senator-tweets">Senator tweets dataset</a> 💾</li><li>Streamlit app (coming soon!) 🎈</li></ul><p id="c81e"><i>If you’d like to stay up-to-date on the latest data science trends, technologies, and packages, consider becoming a Medium member. You’ll get unlimited access to articles and blogs like Towards Data Science and you’ll be supporting my writing. (I earn a small commission for each membership).</i></p><div id="f125" class="link-block">
      <a href="https://medium.com/@mary.newhauser/membership">
        <div>
          <div>
            <h2>Join Medium with my referral link - Mary Newhauser</h2>
            <div><h3>Read every story from Mary Newhauser (and thousands of other writers on Medium). Your membership fee directly supports…</h3></div>
            <div><p>medium.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*BkNTxVzfBR-P37Op)"></div>
          </div>
        </div>
      </a>
    </div><h2 id="cc6a">Want to connect?</h2><ul><li>📖 Follow me on <a href="https://medium.com/@mary.newhauser">Medium</a></li><li>💌 <a href="https://medium.com/@mary.newhauser/subscribe">Subscribe</a> to get an email whenever I publish</li><li>🖌️ Check out my generative AI <a href="https://www.gptechblog.com/">blog</a></li><li>🔗 Take a look at my <a href="https://www.datascienceportfol.io/marynewhauser">portfolio</a></li><li>👩‍🏫 I’m also a data science <a href="https://www.datajump.co/">coach</a>!</li></ul><h2 id="30e9">I’ve also written</h2><div id="f3de" class="link-block">
      <a href="https://towardsdatascience.com/the-ultimate-reference-for-clean-pandas-code-413df676e63c">
        <div>
          <div>
            <h2>The ultimate reference for clean Pandas code</h2>
            <div><h3>A clean way to clean text data</h3></div>
            <div><p>towardsdatascience.com</p></div>
          </div>
          <div>
            <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*4g3-zF_62EoNDO4s)"></div>
          </div>
        </div>
      </a>
    </div><h2 id="8c56">References</h2><p id="b7e7">[1] Kaggle, <a href="https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge">Toxic Comment Classification Challenge</a> (2018), Kaggle</p><p id="0913">[2] S. Bollinger, <a href="https://ucsd.libguides.com/congress_twitter">Congressional Twitter Accounts</a> (2022), UC San Diego Library</p><p id="ddfb">[3] T. Augspurger, <a href="https://tomaugspurger.github.io/modern-1-intro.html">Modern Pandas (Part 1)</a> (2022), Tom Augspurger</p><p id="7125">[4] T. Peng, <a href="https://syncedreview.com/2019/06/27/the-staggering-cost-of-training-sota-ai-models/#:~:text=Google%20BERT%20%E2%80%94%20estimated%20total%20training%20cost:%20US$6,912&amp;text=Each%20pretraining%20took%204%20days,per%20hour)%20=%20US$6,912.">The Staggering Cost of Training SOTA Models</a> (2019), Synced</p><p id="65ec">[5] S. Cellat, <a href="https://ymeadows.com/en-articles/fine-tuning-transformer-based-language-models">Fine-Tuning Transformer-Based Language Models</a> (2021), Y Meadows</p><p id="6b9d">[6] V. Sahn, <a href="https://readmedium.com/distilbert-8cf3380435b5">Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT</a> (2019), Huggingface</p><p id="fb68">[7] V. Sahn, L. Debut, J. Chaumond and T. Wolf, <a href="https://arxiv.org/pdf/1910.01108.pdf">DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</a> (2020), Huggingface</p><p id="e150">[8] Senate Demorats, <a href="https://www.democrats.senate.gov/about-senate-dems/our-caucus">Our Caucus</a> (2022), Senate Democrats</p><p id="f2b8">[9] Huggingface, <a href="https://huggingface.co/course/chapter3/4?fw=pt">A full training</a> (2022), Huggingface</p><p id="b5dd">[10] Huggingface, <a href="https://huggingface.co/metrics">Metrics</a> (2022), Huggingface</p><p id="e8e3">[11] J. Brownlee, <a href="https://machinelearningmastery.com/confusion-matrix-machine-learning/#:~:text=A%20confusion%20matrix%20is%20a%20summary%20of%20prediction%20results%20on,key%20to%20the%20confusion%20matrix.">What is a Confusion Matrix in Machine Learning</a> (2020), Machine Learning Mastery</p><p id="c1eb">[12] T. Cruz, <a href="https://twitter.com/SenTedCruz/status/1501309601605734408.">https://twitter.com/SenTedCruz/status/1501309601605734408</a> (2022), Twitter</p><p id="acb2">[13] E. Warren, <a href="https://twitter.com/ewarren/status/1498410222683148291">https://twitter.com/ewarren/status/1498410222683148291</a>, (2022), Twitter</p><h2 id="305d">Resources</h2><ul><li><a href="https://huggingface.co/course/chapter3/4?fw=pt">Huggingface Course — Write your training loop in PyTorch</a> (Article)</li><li><a href="https://colab.research.google.com/github/huggingface/notebooks/blob/master/course/chapter3/section4.ipynb#scrollTo=WARodF9Sa6Yq">Huggingface Course — A full training</a> (Notebook)</li><li><a href="https://huggingface.co/course/chapter4/3?fw=pt">Huggingface Course — Sharing pretrained models</a> (Article)</li><li><a href="https://huggingface.co/docs/transformers/model_sharing">Huggingface — Share a model</a></li></ul><h2 id="0f06">Want to connect?</h2><p id="c074">Contact me on <a href="https://www.linkedin.com/in/mary-newhauser-02273551/">LinkedIn</a> ✍️.</p></article></body>

Fine-tuning DistilBERT on senator tweets

A guide to fine-tuning DistilBERT on the tweets of American Senators with snscrape, SQLite, and Transformers (PyTorch) on Google Colab.

Introduction

Tweets are short bits of text that can (sometimes) be packed with valuable data. In the case of United States Senators, official Twitter accounts contain their opinions on a wide range of political issues and are (generally) in good grammatical form. Kaggle’s Toxic Comment Classification Challenge famously demonstrated the power of the transformer in text classification on tweets. But can a transformer perform as well on a binary task trained on a much smaller dataset of more diplomatic veiled language?

To find out, I fine-tuned the DistilBERT transformer model on a custom dataset of all 2021 tweets from US Senators. The result is a powerful text classification model that can determine a senator’s political party based on a single tweet with 90.8% accuracy.

In this article, I take you through the data procurement and fine-tuning process, demonstrating how I:

Scrape all official Twitter accounts of US Senators with snscrape, clean them with Preprocessor and store them in a local sqlite database.
Transform tweets into a HuggingFace Dataset and fine-tune the DistilBERT base model with PyTorch.
Evaluate the resulting classification model and push it to the Huggingface model hub.

Part 1: Creating the dataset

I decided to create my own dataset for this project simply because I couldn’t find any previously curated Twitter datasets that included tweets from all senators in the most recent session of Congress over a long enough period of time. This code lives in its own notebook here.

Here’s the outline for creating the dataset:

Scrape all tweets from 2021 from individual senator Twitter accounts using snscrape.
Preprocess the tweets with preprocessor and pandas (using chaining methods).
Store the tweets in a local sqlite database that I can query later on to train the model.

Scrape (historical) tweets

Tweepy is one of the most popular and easy-to-use packages for scraping Twitter. Unfortunately, it only returns tweets posted within the last seven days. This drawback, along with the fact that we only have 100 accounts to scrape from (50 states x 2 senators each), makes it impossible to train a model that will generalize and produce accurate results (trust me, I tried it).

We need to have a well-rounded sample of tweets that captures the senators’ opinions on a broad range of social, economic, and foreign policy issues. Furthermore, to make the most up-to-date model possible, we can only use tweets from the current session of Congress, which started on January 3, 2021.

First, let’s take a look at the list of senator usernames that I created based on data gathered from the UC San Diego Library:

Next, let’s practice just scraping a single senator’s profile:

This is a total mess! We need to translate the data from the API call to something interpretable.

Pre-process tweets

I created two simple functions to package the data up into a neat and tidy data frame.

This looks much better, but we’re still missing the full names and party affiliations for all the tweets. Next, we join the extracted tweet data with senators_usernames.csv, convert the timestamps to readable formats, and put everything in a data frame. To make the code more readable and efficient, we use modern Pandas methods, as detailed here by Tom Augspurger.

Store tweets

The code above shows my workflow for scraping, cleaning, and storing the tweets for just a single scraped Twitter profile. To gather data for all 100 senators of the 117th United States Congress, I place most of the code in a loop.

First, we must connect to the sqlite engine.

Now, we loop over all of the senator usernames and append the resulting tweet data to an empty list.

After scraping the data, we loop over the list of data frames and write each one to the SQLite database.

Store tweets

I chose to store these tweets in a SQLite database for a few reasons:

I want to update the database with new tweets periodically, to keep the model up-to-date.
I’d like to store the data in a central location with options to query subsets of it for future projects.
I may try using other tokenizers in the future.
VSCode has wonderful SQLite integration and sqlalchemy makes it easy to access with pandas.

Our dataset is now complete and can be accessed here!

Part 2: Fine-tune DistilBERT

We’ve created the dataset and now it’s time to prepare the dataset for fine-tuning on our downstream task.

Transformer models are pre-trained on immense text corpora for many days and to the tune of thousands of dollars. Luckily, the process of fine-tuning allows mere mortals to harness the power of these models, while only requiring small datasets of task-specific data (i.e. our tweets dataset). (If you’re not familiar with fine-tuning I suggest you start here.)

I chose to fine-tune DistilBERT, a “smaller, faster, cheaper, lighter” version of BERT because it has a history of performing well on text classification tasks and retains 97% of the language understanding capabilities of the original BERT model. This makes it ideal to train on Google Colab, using its free tier.

Here’s the outline for fine-tuning:

Set up Google Colab.
Do some last-minute pre-processing.
Tokenize the dataset.
Specify model configuration and hyperparameters.
Train!
Evaluate model

The full fine-tuning notebook can be found here.

Set up Google Colab

First, we install all the necessary packages. Make sure to specify datasets==1.18.3 and transformers[sentencepiece]==4.16.2 if you want to reproduce the code in this notebook.

Mount your Google Drive and define a path to your Colab data to ensure your model saves before the runtime session times out.

Last-minute pre-processing

First, we manually upload TWEETS.db to Google Drive, connect to the database, and read in the tweets.

We clean up the tweets using the preprocessor package and assign both Independent senators, Bernie Sanders and Angus S. King, Jr., to the Democrat class because they both caucus with Democrats.

We want to confirm that we only have our two target classes in the dataset and create a label mapping to later convert them to integers.

The DistilBERT tokenizer will only accept the training data if it has a labels column, so we’ll add another column with our mapped labels.

Next, we create a Dataset from the data frame and split it into a test and train set.

Then, we cast the labels column as a class-encoded column, which tells the dataset where to find the labels for the data and automatically formats it for training.

Finally, the Dataset is ready for tokenizing!

Tokenize dataset

We start by loading the Transformers AutoTokenizer for DistilBERT.

Next, we make a list of columns to remove from the dataset upon tokenization and tokenize the tweets in the text column.

In the tokenize() function, we set padding=True to pad each tweet to the maximum length tweet in the batch. We also set truncation=True and max_length=512 to truncate each tweet to the maximum sequence length for DistilBERT.

After tokenization is complete, we prepare the dataset to be fed to the model by setting its format to "torch" and creating data loads to correctly reshape the dataset.

Now that the encoded dataset contains all the necessary inputs for DistilBERT, we instantiate a DataLoader for each split of the dataset to feed it to the model.

Load the model

Before training the model, we need to load it from the HuggingFace Model Hub and specify its configuration and hyperparameters.

Besides providing the model checkpoint, the only other configuration we need to specify is num_labels. I programmatically define the number of labels by setting it equal to num_classes in the Dataset.

Define hyperparameters, optimizer, and learning rate scheduler

The learning_rate and num_epochs are the only hyperparameters we will define for this task.

Epochs are the number of training iterations over the dataset. In the past, I’ve had success with num_epochs=5.
Learning rate specifies how much the model parameters are adjusted after each training epoch. It’s typical to start with learning_rate=5e-5 and increase it in subsequent training sessions, if necessary.

Optimization is the process in which model parameters are adjusted after each training epoch (or batch) to iteratively reduce model error. The optimizer is the specific algorithm used in the process. For our task, we follow HuggingFace’s lead and use the AdamW() optimizer, to which we pass our selected learning_rate.

Next, we move the model to the GPU, since we are using Google Colab.

Train model

Finally, we are ready to train! I copied this code, and the following evaluation loop code, from Chapter 3 of HuggingFace’s Course. I highly recommend the course for both learning and as a reference.

Save model

I usually step away from the computer while my models are training, so I always execute a code cell that saves the model locally in case the notebook times out.

Because Google Drive has limited storage space and model files tend to be quite large, pushing the model to Huggingface’s Model Hub is a great option to both store the model and share it with others. There are several ways to upload a model to the Hub, but I found the easiest and quickest way was to use huggingface_hub within the training notebook.

We start by importing the module and installing git-lfs, an open-source Git extension for versioning large files.

Next, we log in to our Huggingface account (create an account here, if you haven’t already).

The last step is to configure your Git settings and push the model.

Our model is saved and can be access here!

Evaluate model

Now that we’ve saved the model, we need to assess its performance. The 🤗 datasets library offers several NLP performance metrics that we can easily import to evaluate the model.

Specifying metric = load_metric("glue", "mrpc") will instantiate a metric object from the HuggingFace metrics repository, that will calculate the accuracy and F1 score of the model.

If you need a refresher on the confusion matrix and binary performance matrix, I highly recommend checking out this article.

Both the accuracy and F1 score are looking good! Let’s inference the model with some more recent tweets that weren’t in the training or test dataset.

Here’s a tweet from Texas Senator Ted Cruz, a well-known Republican:

After sending the tweet through the tokenizer and feeding it to the model, we are given the logits.

Because the raw output of the model, called logits, are relatively uninterpretable, we apply a softmax function, which re-scales the values so that we get probabilities between [0, 1]. Because our model is binary, we get a probability for each class. Recall that during data preprocessing, we assigned the tweets to numeric classes: {'Republican': '0', 'Democrat': '1'}. This means that the first value in the resulting tensor is the probability that the tweet belongs to the Republican class. Below, you can see that the model predicts Senator Cruz’s tweet as Republican with a probability of 99.999%!

Now let’s try a tweet from Massachusetts Senator Elizabeth Warren, a Democrat:

The model assigns Senator Warren’s tweet to the Democrat class with a probability of 100%!

Conclusion

In this article, I illustrate how I created a custom dataset of US Senator tweets and used it to fine-tune an accurate DistilBERT text classification model that predicts the political party affiliation of a senator based on a tweet they have posted.

Here’s a brief summary of all the steps involved:

1. Scrape historical tweets from Twitter using snscrape.

2. Clean and store tweets in a local SQLite database.

3. Tokenize the dataset and fine-tune a DistilBERT model using 🤗 transformers and PyTorch in Google Colab.

4. Push the model to the 🤗 Model Hub.

5. Evaluate model performance on the test set and inference the model on two unseen tweets.

Next steps

Both the senator-tweets dataset and distilbert-political-tweets model have many potential applications. A model that predicts the “partisan-ness” of senators in real-time based off of their tweets could be an interesting tool for prospective voters. A semantic search or question answering application to determine the political positions that senators take on specific issues could be similarly enlightening for voters.

Final thoughts

I completed this project without any fancy API keys, EC2 instances, or S3 buckets. In fact, I only used two Google Colab (free edition) notebooks to create the dataset and fine-tune the model. In other words, this project is accessible! I hope this article inspires you to get out there and build something you’re interested in, something original, with all of the phenomenal open-source resources available to us during this “golden age” of NLP.

Project links

Github repository 🐍
Final version of fine-tuned model ⚙️
Senator tweets dataset 💾
Streamlit app (coming soon!) 🎈

If you’d like to stay up-to-date on the latest data science trends, technologies, and packages, consider becoming a Medium member. You’ll get unlimited access to articles and blogs like Towards Data Science and you’ll be supporting my writing. (I earn a small commission for each membership).