Sync your AWS S3 data with Google Storage in 20 minutes.
Have you heard about “Big Data” analytics, ETL, Data Warehouses and been confused about how to use it?
Have you ever wondered how “AWS S3” might be synchronizing your files with “Google Cloud Storage” in the background?
In this 20 minute tutorial, we’ll walk through building a Node.js Lambda function in AWS to transfer data between accounts. We will create a Lambda function which listens for new files added to the source S3 bucket and then triggers a process which copies them to Google Storage.
Part 2 of this series provides a tutorial on building a Data Warehouse In the Cloud using Google BigQuery. It is a step-by-step guide to building a Big Data Warehouse solution including load monitoring and error handling.
Project layout:
You are here >>> Part 1: Sync only new/changed files from AWS S3 to GCP cloud storage.
Part 2: Loading data into BigQuery. Create tables using schemas. We’ll use yaml to create config file.
Part 3: Create streaming ETL pipeline with load monitoring and error handling.
Finally, we’ll need Node.js v8.6 or above to run our core application.
If you don’t have any of these just create it. It’s Free.
Before starting, let’s think about what we need to do here. We will create a lambda function to transfer files from AWS S3 to Google Storage which does the following:
New object (file) is created / updated in AWS S3. That will trigger the lambda.
Our Lambda will get file location (key) from that event.
Lambda will authenticate with Google Storage.
Lambda will save our new file to Google Storage destination bucket.
We will also need to create a Google service account in order to use Google account credentials. This page contains the official instructions for Google managed accounts: How to create service account. After this you will be able to download your credentials and use Google API programmatically with Node.js. Your credentials file will look like this (example):
./your-google-project-12345.json
Clone Starter Repository
Run the following commands to clone the starter project and install the dependencies:
The starter branch includes a base directory template, with dependencies declared in package.json.
Your lambda’s package.json file should look like that:
After you run npm install command all required dependencies will be installed.
Getting Started With Lambda function.
Let’s assume we have files in different S3 buckets and as soon as new files are there we’d like them to be copied to Google Storage.
Let’s create our lambda function file called:
./index.js
Have a look at getGoogleCredentials function inside. We keep our Google credentials file your-google-project-12345.json in S3 bucket. It’s a good security practice to do so.
We have also created let config = require(‘./config.json’); where we will keep our settings.
For example, my settings look like this:
You can see table names there. Essentially these are file names prefixes/suffixes so you will need to adjust your test files to reflect it accordingly. For example, filename in your source bucket must contain one of those names in config tables, e.g. s3://your-source-bucket/daily_table1–2019–10–01.json
You’ll also need to change the “cred_bucket” entry to match your S3 bucket where you store your credentials. Change other variables to reflect your project too.
Test your lambda locally
Now let’s test our lambda locally. You might notice that I am using test.js file in our project.
./test.js:
You can see that we declared ./event.json file to simulate AWS S3 bucket Create object event. This file should have these contents:
./event.json:
All we really care about is the name of the bucket and object key here. So change it to reflect your S3 test files and folders structure accordingly.
Now run node test in your command line. It will trigger your local lambda which will use the file in S3 you specified in event object.key:
Now if you go to Google Cloud console you should be able to see a new file in Storage.
Deploy the lambda
Let’s deploy the lambda to AWS and add S3 event trigger to invoke it.
We will use a shell script for this. Create a file ./deploy.sh.
./deploy.sh:
You can see that we are using a role arn:aws:iam::12345693471:role/gcs-transfer_lambda here.
This role is required by AWS security rules so your lambda could access files in S3.
This article explaines how to create a role: Execution role
Great! Now let’s run ./deploy.sh. The output will be some json confirming that lambda has been successfully deployed or updated.
Cool! Now we have a totally useless Lambda which is supposed to copy files between accounts. Let’s add a trigger.
Go to AWS Console -> Lambda -> Functions -> your function -> Configuration and add S3 trigger with source bucket name.
Now try uploading some file to your S3 source bucket. Our lambda will pick it up and copy it. You should see the same file in Google Cloud destination bucket.
Part 1 is done!
We have just created a lambda to sync our AWS S3 cloud with Google Storage, tested it locally and created a shell script to deploy and update it. Great, we now have an AWS Lambda function which transfers the files between accounts. Time to add some ETL scripts to load data into BigQuery.
In Part 2 we will see how to load different file formats into BigQuery, do load monitoring and error handling and will create a few tables.