avatarAnna Geller

Summary

The website content outlines how to set up scheduled data pipelines using Prefect and GitHub Actions within five minutes, providing a guide for running serverless workflows in Python.

Abstract

The article discusses the importance of scheduling in data platforms and introduces a streamlined approach to scheduling serverless workflows using Prefect and GitHub Actions. It provides a step-by-step guide to set up Prefect Cloud, configure a GitHub repository, and manage workflow triggers and secrets. The process involves creating an API key, adding repository secrets, and using a GitHub template for workflow scheduling. The article also explains how Prefect recognizes triggers from various sources, distinguishes between manual and scheduled runs, and allows for custom package dependencies and running multiple flows simultaneously. While the method is suitable for getting started, the article suggests considering other serverless options like AWS ECS integration or AWS Lambda for more critical deployments and provides resources for Azure and GCP users, as well as Prefect's documentation and community support channels for further assistance.

Opinions

  • The author emphasizes the ease and speed of setting up scheduled workflows with Prefect and GitHub Actions.
  • Prefect is presented as a versatile tool that can integrate with various schedulers and APIs, enhancing workflow observability, reliability, and maintainability.
  • The article suggests that while GitHub Actions is a good starting point for scheduling, other serverless options may be more suitable for mission-critical tasks.
  • The author advocates for the use of Prefect's features such as retries, caching, notifications, and secure configuration of building blocks to improve workflow management.
  • Community engagement and support are highlighted as important resources for users getting started with Prefect.

Scheduled Data Pipelines in 5 Minutes with Prefect and GitHub Actions

The easiest way to get started with scheduled serverless workflows built in Python

Prefect as a bot in space: scheduling, coordinating, and observing dataflow in the galaxy

Scheduling is a critical component of any data platform. Whether you are running nightly ETL jobs, Sunday-night maintenance scripts, or triggering high-velocity workflows every couple of minutes, some data workloads need to be executed at the right time. This holds true even when moving to (near) real-time data ingestion. Because of its importance, scheduling has become table stakes. There are countless enterprise tools and open-source frameworks allowing you to run some work on schedule, but they are usually difficult to use and maintain. In contrast, Prefect flows can be triggered from anywhere. If you are not ready to fully migrate to scheduling your workflows using Prefect deployments, you don’t have to.

In this post, we’ll demonstrate the easiest way to run scheduled serverless workflows. Using the combination of Prefect and GitHub Actions, you’ll be able to schedule your first Python-based workflows running in the cloud in under 5 minutes.

Getting started with scheduled workflows

This demo follows a top-down approach. First, we’ll build something (just five minutes, as promised!). Then, we’ll explain how it works under the hood.

Prefect Cloud setup

To get started, sign up for the free tier of Prefect Cloud. Once logged in, create a workspace and an API key.

Creating an API key in Prefect Cloud — image by author

GitHub repository setup

Use the following GitHub template to create your own GitHub repository.

Adding secrets

Add the previously created API key and the name of your workspace as repository secrets, as shown in this image:

Configuring GitHub Actions Secret — image by author
  1. The PREFECT_API_KEY is the API key you created before (should start with pnu_).
  2. The PREFECT_WORKSPACE is a combination of your account/workspace e.g. annaprefect/developmemt — you can also retrieve it from your workspace settings:
Retrieving workspace name in Prefect Cloud — image by author

That’s it, you’re all set! From now on, all flows listed in the directory flows will run on schedule.

How can I trigger the workflow manually?

To manually trigger any workflow, go to Actions (#1), select the “Run flow” workflow (#2), then choose the flow you want to run from the dropdown menu and click on “Run workflow” (#3):

Running GitHub Actions workflow — image by author

The image below shows that the flow run of the flow healthcheck.py has been triggered.

Observing GitHub Actions workflow — image by author

You can see the same workflow run in the Prefect Cloud UI:

Flow run page in Prefect Cloud — image by author

What happened under the hood?

How did Prefect “know” that GitHub Actions triggered the flow run?

As long as your Prefect settings point to a URL of your Prefect Cloud workspace, you can schedule your flows from GitHub Actions, a legacy scheduler, serverless functions, or even from your company’s internal APIs. Prefect will make those workflows observable and improve their reliability and maintainability with retries, caching, notifications, and secure configuration of your building blocks.

Scheduled vs. manually triggered runs — how to tell the difference?

The image below shows how you can distinguish between manually triggered workflows and those that ran on schedule.

Manual vs. scheduled GitHub Actions workflow — image by author

How can I modify the schedule of each workflow?

The repository template contains a separate YAML file for each scheduled Prefect flow. To modify the schedule, you can adjust the CRON string:

name: healthcheck
on:
  schedule:
    - cron: '42 * * * *'

How can I add new flows?

Add your flow to the flows directory and add a new GitHub Actions workflow (akin to healthcheck.yaml) to configure your schedule for that new flow. Once you push the changes to the default branch, you’re all set.

What if I have custom package dependencies?

Add those to requirements.txt and to modules in your repository code. Then add the pip install . command to your scheduled workflows as shown here.

How can I run all flows at once?

To trigger all flows from the flows directory, you can use this GitHub Actions workflow. It will:

  • detect all flows in this flows directory, and
  • start runs for all those flows in parallel.

What are the limitations of scheduling flows using GitHub Actions?

The approach demonstrated here is best suited to get started. For mission-critical scheduled serverless deployments, consider the alternatives mentioned below.

Other serverless options to schedule Prefect flows

If you are on AWS, there are two convenient ways to easily get started:

1) Prefect AWS ECS integration with the ECSTask infrastructure block — the following repository template provides automation to deploy a Prefect agent to ECS and run serverless flow run containers on AWS.

2) AWS Lambda running Prefect flows.

If you are on Azure or GCP, the DataflowOps Prefect Discourse category provides plenty of resources to help you get started.

Lastly, check our documentation, including the catalog of Prefect Recipes.

Next steps

Thanks for reading. If you have any questions about getting started with Prefect and scheduled workflows, you can reach me via Prefect Community Slack or Prefect Discourse.

Happy Engineering!

Python
Scheduling
Data Engineering
Data
Serverless
Recommended from ReadMedium