DE by DL.AI - Course 2: Source Systems, Data Ingestion, and Pipelines (W2)

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

Data Ingestion Overview

Overview

  • Recall: As a DE, you get raw data somewhere → turn it into something useful → make it available for downstream use cases.
    • Data injection → “get raw data somewhere”
  • Recall all the labs we’ve done so far:
  • Plan for this week:
    • This week takes a closer look at batch and streaming ingestions.
    • Talk to Data Analyst to identify requirements for batch ingestion from a REST API.
    • Talk to Software Engineers to investigate requirements for streaming ingestion: the data payload and event rates + how to configure the streaming pipeline.

Data injection on a continuum

  • Data you’re working with is unbounded (continous stream of events) - the stream doesn’t have particular beginning and ending.
    • If we ingest events individually, one at a time → streaming injection
    • If we impose some boundaries and inject all data within these boundaries → batch injection
  • Different ways to impose the boundaries:
    • Size-based batch ingestion: size-threshold batch ingestion: 10Gb each chunk, total number of records: 1K events each chunk.
    • Time-base batch injestion: Weekly / Daily chunk, every hour,…
    • The more we increase the frequency of the injection → streaming injection.
      It depends on use case and the source system to decide which one to use.
  • Ways to ingest data from databases:
    • Connectors (JDBC/ODBC API) ← Lab 1 of this week.
      • How much can you ingest in one go? How frequently can you call the API? → no fix measure → reading API doc + communicate with data owners + writing custom API connection code.
      • 👍 Recommend: For API data ingestion, use existing solutions when possible. Reserve custom connections for when no other options exist.
    • Ingestion Tool (AWS Glue ETL)
  • Ways to injest data from files: Use secure file transfer like SFTP, SCP.
  • Ways to injest data from streaming systems: choose batch or streaming or setup message queue. ← Lab 2 of this week.

Batch and Streaming tools

  • Batch ingestion tools:
    • AWS Glue ETL: Serverless service for data ingestion and transformation from various AWS sources. Uses Apache Spark for distributed ETL jobs. Enables code-based solutions for data processing. See AWS Glue Components & AWS Glue ETL guidance.
    • Amazon EMR (Big Data platform): Managed cluster platform for big data frameworks like Hadoop and Spark. Useful for ingesting and transforming petabyte-scale data from databases to AWS data stores. Offers serverless and provisioned modes.
    • Glue vs EMR: see again this note.
    • AWS DMS (Database Migration Service): For data ingestion without transformations. Syncs data between databases or to data stores like S3. Supports database engine migrations. Available in serverless or provisioned modes.
    • AWS Snow family: For migrating large on-premise datasets (100+ TB) to the cloud. Offers physical transfer appliances like Snowball and Snowcone, avoiding slow and costly internet transfers.
    • AWS Transfer Family: Enables secure file transfers to/from S3 using SFTP, FTPS, and FTP protocols.
    • Other non-AWS: Airbyte, Matillion, Fivetran
  • Key considerations for Batch vs Streaming Injection:
    • Use cases: Evaluate benefits of real-time data vs. batch processing. Consider stakeholder needs for ML and reporting.
    • Latency: Determine if millisecond-level ingestion is necessary or if micro-batch approach suffices.
    • Cost: Assess complexities and expenses of streaming vs. batch ingestion, including maintenance and team capabilities.
    • Existing systems: Check compatibility with source and destination systems. Consider impact on production instances and suitability for data types (e.g., IoT sensors).
    • Reliability: Ensure streaming pipeline stability, redundancy, and high availability of compute resources compared to batch services.

Batch Ingestion

Conversation with a Makerting Analyst

  • Lean what actions your stakeholders plan to take with the data:
    • DA: We would like to look into what kind of music people are listening to across the various regions where we sell our products and then comparing those with product sales.
    • DE (repeat again + more clearly): you would like to pull in public data from some external sources to get information on the music people are listening to
  • A follow-up conversation may be required:
    • DE: wait for me to look closer on the Spotify API.
  • Key point: to identify the key requirements for the system you’ll need to build.
    • The key things you'll learn here are that you're going to need to ingest data from a third party API. → batch ingestion (no need for real time)

ETL vs ELT

  • ETL (Extract - Transform - Load)
    • When resources in the warehouse is limited (back to the past) → we have to think of a way to transform data to store in an efficient manner.
    • It ensures data quality but take times to transform and may be re-do the transformation processing if needed.
  • ELT (Extract - Load - Transform):
    • Becomes popular nowaday because of lower cost of infrastructure.
    • Store enormous amounts of data for relatively cheap.
    • Perform data transformation directly in the data warehouse.
    • Advantages:
      • It’s faster to implement.
      • It makes data available more quickly to end users.
      • Transformations can still be done efficiently. You can decide later to adopt different transformations.
    • Downsides: you may lost the “T” and only extract and load any data → Data Swamp problem (data has become unorganized, unmanageable and useless)
→ Both ETL & ELT are reasonable batch processing approaches.

REST API

  • APIs solved inefficient data and service exchange between teams at Amazon and elsewhere. They established stable interfaces for sharing functionality, regardless of internal systems. This approach formed the basis for Amazon Web Services and influenced global data sharing practices.
  • API features: metadata, documentation, authentication, error handling.
  • Most common type is REST API. ← use HTTP as the basis for communication.

Lab 1: Batch processing to get data from an API

Streaming Ingestion

Conversation with a Software Engineer

Streaming ingestion details

Kinesis Data Streams Details

Change Data Capture (CDC)

General Considerations for choosing ingestion tools

Lab 2: Streaming ingestion