How to Deploy Python Scrapy Spiders for Free On Cloud
Tips to deploy and periodically run your Scrapy crawler on the cloud for free
Running Scrapy spiders in your local machine can be very convenient during the early stage of development, but we would eventually need to deploy and run our spiders in the cloud periodically.
In search of free options for running periodic crawling jobs, I have come across a few options available in the market such as Zyte Scrapy Cloud where you can easily deploy, run and scale your web crawlers. However, if you intend to multiple spiders periodically, you would have to start paying.
Goal
By the end of this article, you should be able to deploy your Scrapy project and schedule a periodic job to crawl data from your target site for free.
Note that this might not be an ideal solution for large-scale crawler applications where you would need to manage hundreds of spiders.
In that case, using paid service to scale and manage your spiders (with distributed crawling) should be part of your considerations.
Prerequisite
To follow, I would expect you to have your Scrapy project ready to be deployed and running scrapy crawl examplespider
command should work.
Your project root directory should contain either one of these — requirements.txt
, Pipfile
or setup.py
.
If none of these files is present in your app’s root directory, the Python buildpack will fail to identify your application correctly. (source)
Enters Heroku
1. Heroku account registration
First and foremost, you would need to register an account with Heroku, it’s free!
2. Installing Heroku CLI
After your account registration, let’s use Heroku CLI to create and manage our project. You may check out the installation steps for other OS here.
# For macOS:
$ brew tap heroku/brew && brew install heroku
3. Login
To log in using Heroku’s CLI, simply cd
to your project folder and run heroku login
.
$ cd scrapy_example_project/
$ heroku login
heroku: Press any key to open up the browser to login or q to exit:
Opening browser to https://cli-auth.heroku.com/auth/cli/browser/xxxx-xxxx-xxxx-xxxx-xxxx?requestor=xxxx.xxxx.xxxxLogging in... done
Logged in as [email protected]
...
4. Git
At this stage, you should already have set up your project with git init
and git commit
etc. The next steps are to turn your project into a git repository and push it to Heroku.
# i. To create a Heroku application:
$ heroku apps:create scrapy_example_project
# ii. Add a remote to your local repository:
$ heroku git:remote -a scrapy_example_project
# iii. Deploy your code
$ git push heroku master
Once we’re done here, let’s open up our Heroku dashboard page for your newly created Heroku application.
Optional: Setup free tier managed PostgreSQL service on Heroku
You would only need this step if your Scrapy project has a pipeline that stores scraped items in a PostgreSQL database.
Take note that the free tier only has a limit of 10,000 rows at the time of writing this.
This step is fairly simple, simply go to the ‘Resources’ tab on your Heroku dashboard and look for ‘Heroku Postgres’, select the free tier (or whichever you deem fit).
To look for your database credentials, simply click on your ‘Heroku Postgres’ add-on → select ‘Settings’ → ‘View Credentials’
Finally, configure your Scrapy project’s connection to your database accordingly.
# settings.py
# This is just an example, you might be using a different variable name
DATABASE_CONNECTION_STRING = '{drivername}://{user}:{password}@{host}:{port}/{db_name}'.format(
drivername='postgresql',
user=os.environ.get('PG_USERNAME', 'postgres'),
password=os.environ.get('PG_PASSWORD'),
host=os.environ.get('PG_HOST', 'localhost'),
port=os.environ.get('PG_PORT', '5432'),
db_name=os.environ.get('PG_DATABASE', 'burplist'),
)
# Or alternatively:
DATABASE_CONNECTION_STRING = ‘postgres://xxxx:[email protected]1.amazonaws.com:5432/xxxxxx
5. End Results
By the end of this, if you were to visit your Heroku dashboard activity feed, you should see your application there where the latest activity says that your app has been deployed.
If you try to run heroku run scrapy crawl examplespider
on your local terminal, you should see that it will attempt to run the crawler on your Heroku server.
Heroku Scheduler
This section of the article shows you how you can run your crawlers/spiders periodically.
Though Heroku offers several different schedulers that could run your application periodically, I personally prefer ‘Heroku Scheduler’ as it has a free tier and it is super simple to use.
To use the free tier of this add-on, Heroku requires you to add a payment method to your account.
1. Configuration
Inside your newly added ‘Heroku Scheduler’ add-on, simply select ‘Add Job’ on the top right corner and you should see the screen as shown in the picture beside.
To run the scrapy crawl examplespider
command periodically, simply select a time interval and save the job.
2. How do I run all my spiders in one single command?
Locally:
$ scrapy list | xargs -n 1 scrapy crawl
On Heroku from your local terminal:
$ heroku run scrapy list | xargs -n 1 heroku run scrapy crawl
3. How do I schedule a weekly, monthly job?
Simply configure our ‘Heroku Scheduler’ to run a bash script every day at a time where the bash script checks if today is the day of the week before running your actual scrapy crawl examplespider
commands.