D20 Technical Services

Custom Software Development and Cloud Experts

Data Ingestion with AWS Data Pipeline, Part 1

Recently, we had the opportunity to work on an integration project for a client running on the AWS platform. It involved designing a system to regularly load information from an enterprise data warehouse into a line-of-business application that uses DynamoDB as its primary data store.

The scenario is:

  • A data syndication process periodically creates extracts from a data warehouse. This warehouse collects and integrates information from various applications across the business.
  • The extracts are flat files consisting of table dumps from the warehouse. There are many tables in its schema and each run of the syndication process dumps out the rows created since its last run. The flat files are bundled up into a single ZIP file which is deposited into a S3 bucket for consumption by downstream applications.
  • The integration warehouse can not be queried directly – the only access to its data is from the extracts.
  • Only a subset of information in the extracts is required by our application and we have created DynamoDB tables in the application to receive the extracted data.
  • Our application’s use of this data is read-only. The only writes to the DynamoDB table will be made by the process that consumes the extracts.
  • The extracts are produced several times per day and are of varying size. Our process should run on-demand and scale to the size of the data to be processed. We want to minimize costs across the process and provision only the compute resources needed for the job at hand.
  • The data should be visible in our application within one hour of a new extract becoming available.
  • Lastly, we need to maintain a rolling nine month copy of the data in our application. We have configured TTL on the DynamoDB table so the data ages out automatically, keeping our storage costs in check.

One of the key challenges with this scenario is that the extracts present their data in a highly normalized form. We need to analyze each file and reassemble their data into a composite, hierarchical record for use with our DynamoDB-based application. There are multiple one-to-many relationships in the extracts that we need to navigate, and such processing would entail making multiple passes over the files with many intermediate results.

If only there were a way to query files in S3 like tables in a RDBMS! Even better if we had a way to run jobs in parallel and a mechanism to glue such tools together without writing a lot of code!

Check out Part 2 for details on how we solved this problem.