How to Scrape Tweets From Twitter
An up-to-date guide on scraping tweets from Twitter using Twitter’s API
Overview
Originally I wrote an article back in 2020 that covered how to scrape tweets from Twitter. Since then a lot of things have changed, including major changes to Twitter’s public search API impacting open source scrapers, Twitter releasing Twitter API v2 back in November 2021, and more recently the removal of most of their free scrape APIs. This follow-up guide was written to provide updated ways of scraping tweets and to answer any potential question people may have about using Twitter API v2 to scrape data.
This guide is meant to be a quick straightforward introduction to scraping tweets from Twitter using Twitter API V2. I’ll be covering setup, and two common use cases while using the API.
Why Should I Scrape Tweets
Social media can provide insights that normally would not be provided via traditional methods such as surveys, census data, or studies with it’s valuable access to people’s unfiltered opinions. This is due to the nature of how social media is used. You’re able to get answers to questions that normally wouldn’t be so easily accessible at such a scale.
Setup
Before we can get started, we’ll need to set up our tools first!
Setting up Tweepy
We’re using Python to interact with Twitter’s official API. Luckily there’s a Python library called Tweepy that makes this process as seamless as possible. However, to use the official API you’ll also need to set up a Twitter Developer account. We’ll go over that first then hop into setting up Tweepy.
Setup Twitter Developer
Before you can move forward it’s important to note that you will need to create a Twitter account or use your current one!
To set up your Twitter Developer account you’ll need to head over to Twitter Developer Portal Projects & Apps page. Where you’ll be prompted about the app you’re setting up. It will bring you to a page where you must fill out the information on the app you’re hoping to build.
Eventually, you’ll be asked to accept the terms and agreement. This will send an email verification.
After that, the application is sent off for review, and at this point, it’s just a waiting game. You may be requested to fill out more information regarding your app and use case.
Approval can take a couple of days up to potentially a week. It will take time, and developer support should reach out to you if there are any questions about the application you submitted.
Once approved, you'll need to grab your tokens in order to interact with the API. But you’ll first need to set up a project and app to get the tokens. You’ll need to navigate within the Developer Portal to Projects and Apps. Where you’ll create a new project. This will lead you through a prompt detailing your use case.
You’ll then need to add an existing app you have or create a new app for your project.
After your app is created you should then be able to finally get your keys and tokens!
If you already have a project and app you’ll need to go to Developer Portal > Projects & Apps > Overview > {App Name} > Keys and Tokens you’ll need to regenerate them if you don’t have access to them. For this article, you’ll need to generate and use a Bearer Token.
Now that you’ve got your tokens ready we can move on to setting up Tweepy!
Scraping with Tweepy
Setup Tweepy
Tweepy is a Python library for accessing the Twitter API. There are several different levels of API access that Tweepy offers as shown here, but those are for very specific use cases. Tweepy is able to accomplish various tasks beyond just scraping tweets. However, this article will only focus on using Twitter’s API to scrape data.
After having grabbed your Bearer Token, working with Tweepy from this point forward is pretty straightforward.
Tweepy is Available for Python versions 3.7 and later. This article won’t cover specifics on installing Python as it has been covered extensively and is a Google search away.
As to setting up Tweepy, it’s a pretty basic Python command. You’ll just need to do a pip install for the Tweepy library.
pip install tweepy
Also important to note I’ll be using the Pandas library for storing tweet data and modifying it.
Setting up Tweepy Credentials
import tweepy
bearer_token = "XXXXXXXXX"
client = tweepy.Client(bearer_token)
Scraping a specific Twitter user’s Tweets:
import tweepy
import pandas as pd
# Input search query to scrape tweets and name csv file
username = 'BillGates'
count = 10
try:
# grabbing user id from username
user_id = client.get_user(username=username).data.id
# Creation of query method using parameters
tweets = tweepy.Paginator(client.get_users_tweets, user_id, tweet_fields=["author_id", "created_at", "lang", "public_metrics"], expansions=["author_id"], max_results=100).flatten(limit = count)
tweets_list = []
# Pulling information from tweets generator
tweets_list = [[tweet.created_at, tweet.id, tweet.text, tweet.public_metrics["retweet_count"], tweet.public_metrics["like_count"]]for tweet in tweets]
# Creation of dataframe from tweets list
tweets_df = pd.DataFrame(tweets_list, columns=["Created At", "Tweet Id", "Text", "Retweet Count", "Like Count"])
# Converting dataframe to CSV
tweets_df.to_csv("{}-tweets.csv".format(username), sep=",", index = False)
print("Completed Scrape!")
except BaseException as e:
print("failed on_status,",str(e))
Scraping Tweets Using Keyword Search:
import tweepy
import pandas as pd
# Input search query to scrape tweets and name csv file
keyword_search = 'Dogs'
count = 10
try:
# Creation of query method using parameters
tweets = tweepy.Paginator(client.search_recent_tweets, keyword_search, tweet_fields=["author_id", "created_at", "lang", "public_metrics"], user_fields=["username"]).flatten(limit = count)
tweets_list = []
# Pulling information from tweets generator
tweets_list = [[tweet.created_at, tweet.id, tweet.text, tweet.public_metrics["retweet_count"], tweet.public_metrics["like_count"]]for tweet in tweets]
# Creation of dataframe from tweets list
tweets_df = pd.DataFrame(tweets_list, columns=["Created At", "Tweet Id", "Text", "Retweet Count", "Like Count"])
# Converting dataframe to CSV
tweets_df.to_csv("{}-tweets.csv".format(keyword_search), sep=",", index = False)
print("Completed Scrape!")
except BaseException as e:
print("failed on_status,",str(e))
How Can I Access Other Tweet Information?
For the most part, the above code samples will provide you access to a majority of the Tweet information that people tend to utilize. However, it is possible that you might want different data available from Tweets.
By default, the tweet object returned by the V2 API will only provide id and text fields. Everything else must be specified via either a field parameter or an expansion. If you’d like to pull other tweet information available as shown in the data dictionary here, you’ll need to include that in the tweet_fields section when making the API call.
Tweet Fields
For example, if I wanted to pull the language of a Tweet, I can modify the tweet_fields
to include that as shown below.
tweets = tweepy.Paginator(client.search_recent_tweets, "dogs", tweet_fields=["lang"], user_fields=["username"]).flatten(limit = count)
Expansions
Not all tweet data is available in the tweet fields. There is additional information available through tweet expansions.
Similar to tweet_fields you can add to the Paginator method in order to pull more information in.
tweets = tweepy.Paginator(client.search_recent_tweets, "dogs", expansions=["author_id"], max_results=100).flatten(limit = count)
This will then grab additional information to allow you to query through.
FAQs
Is This Method legal to Scrape Tweets?
Yes, we are using Tweepy which leverages Twitter’s official API for searching tweets and pulling that data. This is supported by Twitter’s TOS as shown by the following excerpt pulled from Twitter’s Terms of Service as of August 10th, 2023:
“… search or attempt to access or search the Services by any means (automated or otherwise) other than through our currently available, published interfaces that are provided by us(and only pursuant to the applicable terms and conditions) …”
How Many Tweets Can I Scrape?
With the basic level, you can scrape up to 10,000 tweets a month for $100/month. Or if you need more than that you can instead pay $5,000/month for scraping up to 1 million tweets instead. If neither of these is sufficient for your needs. You’re able to request Enterprise level or higher if needed.
How Can I Scrape Tweets Without Coding?
There are a couple of solutions such as Scrape Hero, Stevesie, or web scraping automation tools like Octoparse that require learning how to use the app in the first place. However, with Twitter updating their API access a lot of these tools have been impacted in how they can pull that data and how much is possible to scrape.
References
GitHub containing this tutorial’s scraping files: https://github.com/MartinKBeck/TwitterScraper/tree/master/ScraperV4
Tweepy GitHub: https://www.tweepy.org/
Twitter API v2 with Tweepy in Python Guide: https://dev.to/twitterdev/a-comprehensive-guide-for-using-the-twitter-api-v2-using-tweepy-in-python-15d9