D20 Technical Services

Custom Software Development and Cloud Experts

Data Ingestion with AWS Data Pipeline, Part 2

In our previous post, we outlined a requirements for project for integrating an line-of-business application with an enterprise data warehouse in the AWS environment. Our goal is to load data into DynamoDB from flat files stored in S3 buckets.

AWS provides a two tools for that are very well suited for situations like this:

Athena allows you to process data stored in S3 using standard SQL. Essentially, you put files into a S3 bucket, describe the format of those files using Athena’s DDL and run queries against them. Athena provides a REST API for executing statements that dump their results to another S3 bucket, or one may use the JDBC/ODBC drivers to programatically query the data. You can have multiple tables and join them together as you would with a traditional RDMBS. Under the hood, Athena uses Presto to do its thing.

Data Pipeline is an automation layer on top of EMR that allows you to define data processing workflows that run on clusters. You can design your workflows visually, or even better, with CloudFormation.

In Data Pipeline, a processing workflow is represented as a series of connected objects that describe data, the processing to be performed on it and the resources to be used in doing so. For our purposes we are concerned with four classes of objects:

  • Activities. Commands and processes that implement the logic of our pipeline. There are various different activity types, but our pipeline will use:
  • Data Nodes. These are the inputs and outputs of of activities, which are:
    • S3 Data Node. A directory or file stored in S3 that is either the input to or output of an activity.
    • DynamoDB Data Node. A DynamoDB table, in our case the table to be loaded
  • Resources. Configurations for the various compute services that will support our pipeline. In our case we are running our activities on an EMR Cluster. A representation of configuration of a cluster that our activities will run on. Includes details like number of machines, machine classes, EMR version, etc.
  • Actions. Other operations to be performed by the pipeline logic such as raising SNS Alarm to notify of completion or failure.

In addition, activities may have dependencies on resources, data nodes and even other activities. Our high-level plan of attack will be:

  1. Create the Athena structures for storing our data
  2. Create a data pipeline that implements our processing logic. This is the most complex step in the process and we’ll detail it in the next few posts.
  3. Create a Lambda Function that activates our data pipeline, passing in details of the file to be processed.
  4. Configure a S3 Event Notifications on the bucket that receives our ZIP extracts to call the function created in #2 above.

In Part 3 (coming soon!) we’ll dig into the details of configuring Athena to store our data.