How to copy a large file from SFTP server to AWS S3 using lambda?

Summary

The article describes a method for automating the transfer of a large XML file from an SFTP server to AWS S3 using an AWS Lambda function.

Abstract

The author outlines a solution for automating the transfer of large files from an SFTP server to AWS S3, which involves using the Serverless framework to deploy an AWS Lambda function. The function utilizes the ssh2-sftp-client package to download the file and the xml2js package to convert XML to JSON. The file is temporarily stored in the Lambda /tmp directory before being uploaded to S3. The author emphasizes the need to configure the Lambda function with adequate memory and storage to handle large files, and notes that while the process is straightforward, it requires tuning for optimal performance and cost-efficiency on AWS.

Opinions

The author believes that while the process of transferring files is simple, it can be tricky when dealing with large files in a Lambda environment.
They suggest that tuning the Lambda function's memory and storage settings is crucial for handling large files, implying that default settings may be insufficient.
The author points out that increasing memory size and storage can lead to higher costs, indicating a trade-off between performance and cost.
They recommend the Serverless framework for ease of deployment and configuration of the Lambda function.
The author provides a cost-effective alternative to ChatGPT Plus (GPT-4) by suggesting an AI service called ZAI.chat, implying that it offers similar functionality at a lower price point.

How to copy a large file from SFTP server to AWS S3 using lambda?

Recently, I had to help a friend automate the process of moving a large file from an sftp location to s3 for further processing. It was a large(ish) XML file, which needed to be downloaded and its data was meant to be stored in a different data store and then archived for future reference.

The over all process of doing so is quite simple, but it can be a little tricky when it comes to handling a large file on Lambda. In this post, I will go through the process I followed and things you should lookout for when you are building a service like this.

I used the Serverless framework for writing and deploying my lambda function. My function definition in serverless.ts file looks like this -

functions: {
    downloadXMLFile: {
      handler: "handler.downloadXMLFile",
      timeout: 900,
      memorySize: 5120,
      ephemeralStorageSize: 1024,
      events: [
        {
          http: {
            method: 'get',
            path: 'downloadXMLFile',
          },
        }
      ]
    }
  }

The function downloadXMLFile uses ssh2-sftp-client (https://www.npmjs.com/package/ssh2-sftp-client) package to make a connection with the sftp server and download the file. I also made use of /tmp folder in a lambda function to temporarily store the file before converting it into a JSON object and then uploading them both to an s3 bucket.

import { parseString } from 'xml2js';
import { S3 } from "aws-sdk";
import ClientFtp from 'ssh2-sftp-client';
import fs from "fs";

const s3 = new S3({
  accessKeyId: process.env.AWS_ACCESS_KEY,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

export const downloadXMLFile: Handler = async (_event: APIGatewayEvent) => { 
  try {
    
    const d = new Date();
    const today = `${d.getDate()}-${d.getMonth()+1}-${d.getFullYear()}`
    const sftp = new ClientFtp();
    
    return sftp
      .connect({
        host: process.env.SFTP_HOST,
        username: process.env.SFTP_USER,
        privateKey: sftpSSHKey,
      })
      .then(async () => {
        await sftp.fastGet("/upload/feed.xml", "/tmp/feed.xml");
        const xmlData = fs.readFileSync("/tmp/feed.xml", "utf-8");
        const jsonData = await parseXml(xmlData);

        await uploadtoS3({
          Bucket: process.env.FEED_BUCKET,
          Body: JSON.stringify(jsonData),
          ContentType: "application/json",
          Key: `${today}/feed.json`,
        });

        await uploadtoS3({
          Bucket: process.env.FEED_BUCKET,
          Body: xmlData.toString(),
          ContentType: "application/xml",
          Key: `${today}/feed.xml`,
        });

        return formatJSONResponse({
          message: "File downloaded and transformed successfully!",
        });
      })
      .catch((err) => {
        console.log("Catch Error: ", err);
        throw new Error(err);
      });
    
  } catch (error) {
    console.log(error);
    return internalServerError(error);
  }
  
}

export async function parseXml (xmlString: string) : Promise<any> {
  return await new Promise((resolve, reject) => parseString(xmlString, (err, jsonData) => {
    if (err) {
      reject(err);
    }
    resolve(jsonData);
  }));
}

export async function uploadtoS3(s3Data: S3.PutObjectRequest) {
  console.info("---- UPLODAING TO S3", s3Data.Bucket, s3Data.Key);

  try {
    return await s3.upload(s3Data).promise();
  } catch (error) {
    console.log(error);
    return error;
  }
}

Here, I am first creating a connecting with the said sftp server and then using its fastGet function to download the file from a particular path on that server. Once, I have the copy, I temporarily store it on Lambda’s tmp folder. Note, that there is a restriction on the size of file you can store here (up to 10GB — https://aws.amazon.com/blogs/aws/aws-lambda-now-supports-up-to-10-gb-ephemeral-storage/)

Once, I have the file in memory, I make use of the xml2js package (https://www.npmjs.com/package/xml2js) to convert XML data into JSON.

It works very well locally, but when you are deploying it on AWS, you have to tune it a little bit. The file that I was downloading was just under 0.5 GB, but I still had to increase the memory size of function to 5120 mb. Both, memorySize & ephemeralStorageSize can be configured in serverless.ts file (see above code snippet) or manually in the lambda function like this -

lambda configuration — memorySize setting

As I said at the start, its quite simple, you just need to figure out the right setting for your memory. Remember, it will cost you more to run a lambda on larger memory size and storages (https://aws.amazon.com/lambda/pricing/).

Thanks for reading and if you like my content and want to support me, then please follow me and help all of us on this platform grow and produce useful content by becoming a member.

Build awareness and adoption for your tech startup with Circuit.

How to copy a large file from SFTP server to AWS S3 using lambda?

Join Medium with my referral link — Meta Collective

As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…