Building a Data warehouse in the Cloud using BigQuery [Part 1]. Sync your AWS S3 data with Google Storage in 20 minutes.

Sync your AWS S3 data with Google Storage in 20 minutes.

Have you heard about “Big Data” analytics, ETL, Data Warehouses and been confused about how to use it?

Have you ever wondered how “AWS S3” might be synchronizing your files with “Google Cloud Storage” in the background?

In this 20 minute tutorial, we’ll walk through building a Node.js Lambda function in AWS to transfer data between accounts. We will create a Lambda function which listens for new files added to the source S3 bucket and then triggers a process which copies them to Google Storage.

Part 2 of this series provides a tutorial on building a Data Warehouse In the Cloud using Google BigQuery. It is a step-by-step guide to building a Big Data Warehouse solution including load monitoring and error handling.

Project layout:

You are here >>> Part 1: Sync only new/changed files from AWS S3 to GCP cloud storage.

Part 2: Loading data into BigQuery. Create tables using schemas. We’ll use yaml to create config file.

Part 3: Create streaming ETL pipeline with load monitoring and error handling.

Prerequisites:

AWS account. If you don’t have it just create it now. It’s Free. How to create AWS account
AWS cli (Command Line Interface). You will need Python first. Then follow this article to install cli if you don’t have it yet.
Google developer account.
Google Command Line Interface. Follow this article to install it: https://cloud.google.com/sdk/ .
Finally, we’ll need Node.js v8.6 or above to run our core application.

If you don’t have any of these just create it. It’s Free.

Before starting, let’s think about what we need to do here. We will create a lambda function to transfer files from AWS S3 to Google Storage which does the following:

New object (file) is created / updated in AWS S3. That will trigger the lambda.
Our Lambda will get file location (key) from that event.
Lambda will authenticate with Google Storage.
Lambda will save our new file to Google Storage destination bucket.

Sounds simple.

How do we do it

We will be using AWS Node.js SDK in order to copy files from S3. You can find usage guides for S3 SDK here: docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html.

We will also need to create a Google service account in order to use Google account credentials. This page contains the official instructions for Google managed accounts: How to create service account. After this you will be able to download your credentials and use Google API programmatically with Node.js. Your credentials file will look like this (example):

./your-google-project-12345.json

Clone Starter Repository

Run the following commands to clone the starter project and install the dependencies:

The starter branch includes a base directory template, with dependencies declared in package.json.

Your lambda’s package.json file should look like that:

After you run npm install command all required dependencies will be installed.

Getting Started With Lambda function.

Let’s assume we have files in different S3 buckets and as soon as new files are there we’d like them to be copied to Google Storage.

Let’s create our lambda function file called:

./index.js

Have a look at getGoogleCredentials function inside. We keep our Google credentials file your-google-project-12345.json in S3 bucket. It’s a good security practice to do so.

We have also created let config = require(‘./config.json’); where we will keep our settings.

For example, my settings look like this:

You can see table names there. Essentially these are file names prefixes/suffixes so you will need to adjust your test files to reflect it accordingly. For example, filename in your source bucket must contain one of those names in config tables, e.g. s3://your-source-bucket/daily_table1–2019–10–01.json

You’ll also need to change the “cred_bucket” entry to match your S3 bucket where you store your credentials. Change other variables to reflect your project too.

Test your lambda locally

Now let’s test our lambda locally. You might notice that I am using test.js file in our project.

./test.js:

You can see that we declared ./event.json file to simulate AWS S3 bucket Create object event. This file should have these contents:

./event.json:

All we really care about is the name of the bucket and object key here. So change it to reflect your S3 test files and folders structure accordingly.

Now run node test in your command line. It will trigger your local lambda which will use the file in S3 you specified in event object.key:

Now if you go to Google Cloud console you should be able to see a new file in Storage.

Deploy the lambda

Let’s deploy the lambda to AWS and add S3 event trigger to invoke it.

We will use a shell script for this. Create a file ./deploy.sh.

./deploy.sh:

You can see that we are using a role arn:aws:iam::12345693471:role/gcs-transfer_lambda here.

This role is required by AWS security rules so your lambda could access files in S3.

This article explaines how to create a role: Execution role

Great! Now let’s run ./deploy.sh. The output will be some json confirming that lambda has been successfully deployed or updated.

Cool! Now we have a totally useless Lambda which is supposed to copy files between accounts. Let’s add a trigger.

Go to AWS Console -> Lambda -> Functions -> your function -> Configuration and add S3 trigger with source bucket name.

Now try uploading some file to your S3 source bucket. Our lambda will pick it up and copy it. You should see the same file in Google Cloud destination bucket.

Part 1 is done!

We have just created a lambda to sync our AWS S3 cloud with Google Storage, tested it locally and created a shell script to deploy and update it. Great, we now have an AWS Lambda function which transfers the files between accounts. Time to add some ETL scripts to load data into BigQuery.

In Part 2 we will see how to load different file formats into BigQuery, do load monitoring and error handling and will create a few tables.

Thanks for reading!

Originally published at

Summarize