DE by DL.AI - C3 W2 - Storage Abstractions

Anh-Thi Dinh
👉 List of notes for this specialization

Data Warehouses & Data Lakes

Overview

This week, we focus on the first one in Storage Hierarchy.
 
Week labs:
Simple Data Lake
Buld a Data Lakehouse.

Convesation with Bill Inmon

  • Introduction to Bill Inmon: Known as the creator of the data warehouse and a pioneer in the modern data industry. Began programming in 1965 at White Sands Missile Range, New Mexico.
  • Definition of a Data Warehouse: Described as corporate data that consolidates information across various functions (marketing, sales, finance, management) for a unified corporate view.
  • Challenges Before Data Warehousing: Before data warehousing, each application was developed in isolation, which made data integration across departments difficult.
  • Origins of ETL: Inmon helped create ETL (Extract, Transform, Load) processes to automate the collection, transformation, and loading of data into warehouses, reducing the need for manually coded programs.
  • Early Industry Resistance: IBM initially opposed data warehousing, focusing on transaction processing. Inmon’s suggestion of using data beyond transactions led to significant industry criticism.
  • Role of Marketing: Marketing departments were the early adopters and primary supporters of data warehousing, while technical teams were initially skeptical.
  • Development of Textual ETL: Inmon pioneered textual ETL to integrate text data into corporate databases, addressing an area where text was underutilized.
  • Legacy of Ed Yourdon: Inmon highlighted Ed Yourdon’s influence on programming, particularly in introducing structured design and analysis, which brought organization and methodology to early coding practices.

Data Warehouse - Key architectural ideas

Data warehouse is suitable for OLAP rather than OLTP (check these terms in C3W1).
Data Warehouse-Centric Architecture
Data Warehouse Implementation. In this figure: MPP (Massively Parallel Processing), Amazon Redshift, Google Big Query, snowflake.

Modern Cloud Data Warehouses

MPP Architecture for Amazon Redshift.
When you load data into redshift, the leader node manages how the data is distributed across the node slices.
When need to upgrade → expande more nodes or upgrade a single node type.
With support of MPP → Data Warehouse can support ETL (Extract, Load, and transform) ← This results in faster ingestion so you can quickly provide your downstream stakeholders with data.
Another change we've seen with cloud data warehouses is a shift from row based to columnar architecture (check C3W1) ← Facilitates high performance analytical queries
Data is stored in object storage, allowing virtually limitless storage ← Seperation of compute and storage.
Data warehouses are great for performing analytical workloads on large volumes of structured and semi-structured data brought together from various sources. ← eg: You want to perform fast analytical queries on large volumes of tabular data collected from multiple departments such as marketing, sales, and product.

Data Lakes - Key architectural ideas

  • A data lake is a central repository for storing diverse data types (structured, semi-structured, unstructured) without strict schemas, ideal for handling large volumes of data from various sources, a concept that emerged in the 2000s.
  • Data Lake 1.0 → has many disadvantages
    • Central repository for storing large volumes of data
    • No fixed schema or predefined set of transformations
    • Schema-on-read pattern: Reader determines the schema when reading the data
    • Data swamp: No proper data managementl; No data cataloging; No data discovery tools; No guarantee on the data integrity and quality.
    • Data Manipulation Language (DML) operations were painful to implement
    • Difficult to comply with data regulations (eg. GDPR ← Europe)
    • No schema management and data modeling: Hard to process stored data; Data not optimized for query operations such as joins.

Next Generation Data Lakes

  • Use Data Zones to organize data in a data lake. ← Use format like .parquet, .avro, .orc to store data.
    • Input → Zone 1 → Zone 2.1 → Zone 2.2 → … → Zone 3 → Output
  • Data partitioning: in each zone (2 or 3), divide a dataset into smaller, more manageable parts based on a set of criteria (eg. time, date, location recorded in the data) ← to improve query performance
  • Data catalog → metadata, schema → better search and locate data.
  • Separate Data Lakes & Data Warehouses → expensive solution → need Data Lakehouses
  • Data lakes are great for storing large volumes of unstructured data in a low-cost and scalable way.

Lab 1 - Simple Data Lake with AWS Glue

We’re going to use terraform to perform these tasks.
We are going to process 2 JSON files.
  • awswrangler = AWS SDK for Pandas.
  • Test different cases with compression ← the higher compression, the slower speed!
    • Snappy for speed, Gzip for compression.

Data Lakehouses

The Data Lakehouse Architecture

  • databricks — 1st company introduces the notion of a data lakehouse.
  • Zones from Data Lakes + Data Management from Warehouse.
    • Broze (landing/raw), Silver (cleaned/transformed), Gold (Curated/enriched)
  • Support ACID principle.

Data Lakehouse Implementation

  • Eg: How ICEBERGE works?
  • When it comes to choosing between a data warehouse, a data lake, or a data lakehouse, it's really about choosing the right storage abstraction to support your organization's needs.
    • Instead of choosing between Data Lake or Warehouse, we may think of upgrading Data Lake to Data Lakehouse.

Lakehouse Architecture on AWS

  • Using AWS Lake Formation & Amazon Redshift Spectrum
  • AWS Lake Formation steps:
      1. Identify data sources like Amazon S3, Relational DB, NoSQL DB.
      1. Using AWS Lake Formation to move data into the data lake.
      1. Using Crawler (Lake Formation) to crawl through the data, catalog it (using AWS Glue Data Catalog) → get it ready for analytics (Metadata)
      1. Manage permissions (using AWS IAM + AWS Glue)
  • AWS Data Lakehouse architecture (an overview)
  • Storage:
    • S3 is used for structured, semi-structured or unstructured data. ← large, structured, and unstructured datasets
    • Amazon Redshift is used for highly curated, structured or semi-structured trusted data that fits into predefined schemas. ← high performance analytics on more structured datasets.
    • It would be nice if you could somehow integrate your data lake with your data warehouse nativelyAmazon Redshift Spectrum comes in!
      • Redshift Spectrum allows you to run queries on data stored in S3 without having to load it into Redshift first. ← It’s greate because it eliminates the need for complex ETL pipelines to move data between your data lake and your data warehouse. ← preferred way!
  • Catalog: Datasets in storage often have evolving schemas and growing partitions. Thus, metadata catalog population requires ongoing maintenance, not a one-time effort.
  • For efficiently managing evolving schemas and large datasets, Apache Iceberg tables complement Lake Formation and Glue's data management and cataloging capabilities.
    • Apache Iceberg tables make it easier to make changes to your data schema without disrupting existing processes or underlying data. ← Schema & Data Versioning.
  • Consumption:
    • By using Redshift Spectrum, you can reduce data latency. ← by querying data in place, you can get insights faster without waiting for data to be moved or transformed.
    • Amazon Athena: query data in S3 directly using standard SQL. It’s serverless (no infrastructure to setup or manage)
      • It supports also Federated Query ← query data that's outside of S3 (eg: Redshift)

Lab 2 - Building a Data Lakehouse with AWS Lake Formation and Apache Iceberg

An overview of what we’re going to do
Check the lecture notes for more details. Below is the main steps.

Review the week