π List of notes for this specialization
π Lecture notes & Repositoty.
- Instructor: Joe Reis (the author of βFundamentals of Data Engineeringβ β free download)
- Home page of Course 1 β Introduction to Data Engineering
- DeepLearning.AI community for this course.
- My Github repository for resources in the course.
- Data-Centric AI: The discipline of systematically engineering the data used to build an AI system.
- This program is all about framework, principles, getting you to think like a data engineer + building system on AWS.
- Course 1: Intro to DE
- Course 2: Source Systems, Data Ingestion, and Pipelines.
- Course 3: Data Storage and Queries.
- Course 4: Data Modeling, Transformation, and Serving.
- Intermediate Python, Pandas
- Basic SQL
- Basic AWS Cloud.
- What is unique about this program?
- This program teaches you how to think like a data engineer
- Hands-on practice.
- Textbook: Fundamentals of Data Engineering
- Scenario
- First course β a big picture. First week is only about how to think like a DE. No lab, no implementation.
- Week 1: High level look at the firled of DE
- DE lifecycle
- HIstory of DE
- The DE among other stakeholders
- Business value
- Translation of stakeholder needs into requirements
- Week 2: DE lifecycle and undercurrents
- Week 3: Principles of good data architcture
- Week 4: Design and build out a data architecture
- Definition (by the author of the book): Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering
β Your job is to get raw data from somewhere, turn it into something useful, and then make it available for downstream use cases!
- 1960s-1970s: Digital data emerges with computers. Relational databases and SQL are developed.
- 1980s-1990s: Data warehouses and BI tools emerge. Inmon and Kimball introduce data modeling approaches.
- Mid-1990s-Early 2000s: Internet boom drives web app growth. MapReduce and Hadoop revolutionize data processing.
- Late 2000s-2010s: Cloud platforms (AWS, Google Cloud, Azure) transform data applications. Shift to real-time processing and event streaming.
- Present: Data engineering focuses on scalable systems, cloud-first solutions, and technology integration to serve business goals.
- The DE among other stakeholders: 2 ways (upstream and downstream)
- Business Value:
Focus on creating business value in data engineering. Don't chase every new technology. Prioritize solutions that deliver tangible benefits to the organization. Ultimately, business value is the driving force behind technological decisions in our industry. β (Bill Inman's advice)
- System Requirements: Before we start writing any code or spinning up resources on the Cloud
The most important step is Requirements Gathering
(mock conversation between a Data Scientist and a DE)
- DS/DA receive requests from marketing for real-time dashboard and recommendations, but lack direct data access. They must process dumped data, 90% of which is irrelevant, spending 80% of time on formatting (2 days), eliminating real-time capability.
- Continuous data structure changes further delay the process by 2 days.
- An automated process for data formatting and handling is needed to allow data scientists to focus on analysis.
- The DE should clarify marketing's objectives (e.g., "real-time" frequency), identify key requirements, and outline their proposed solution for DS confirmation.
Key Elements of Requirements Gathering
- Learn what existing data systems or solutions are in place.
- Learn what pain points or problems there are with the existing solutions.
- Learn what actions stakeholders plan to take with the data. Tip: Repeat what you learned back to your stakeholders.
- Identify any other stakeholders youβll need to talk to if youβre still missing information.
- Thinking like a DE: below steps are in a circle
- High level mental framework and way of thinking like a data engineer is important for everything that follows
- As a data engineer, the actual set of tools and technologies you work with could be quite different from one company to the next.
- Public cloud: AWS, GCP (Google Cloud Platform), MS Azure.
- Intro to the AWS Cloud
- Pay as you go pricing
- IT Resources
- Advantage of building on cloud
- Cloud resources are scalable and elastic.
- No need to worry about the exact storage capacity needed
- No need to manage the scaling operations.
- AWS data centers are all around the world β AWS regions (their names are the same as where they are located). β AWS Global Infrastructure
- Each regions has Availability Zones: one dies, there are others. β Regions & Availability Zones
- A region consists of multiple availability zones and an availability zone contains one or more data centers.
- Example of names:
- us-east-1 = the first one created in the eastern US
- us-east-1a = an availability zone in us-east-1 (Northern Virginia Region)
- To host your applications or data pipelines, you need to choose an AWS region. Consider these four main factors:
- Latency: choose a region close to where your end users are located to minimize latency;
- Cost: the resource costs may differ between regions;
- Compliance: certain regulations may require hosting your data in a specific geographic region;
- Service availability: not all services are available in all regions.
- COMPUTE
- EC2 (Amazon Elastic Compute Cloud): The service that provides virtual machines, or VMs, on AWS.
- instance type naming: t3a.micro (t: family name, 3: generation, a: optional capabilities, micro: size)
- Amazon EC2 Instance types
- Amazon EC2 instance type naming conventions
- Amazon EC2 billing and purchasing options
- Virtual Machines or servers, where you can run any operating system and applications (as a virtual computer that runs OS)
- Each βcomputer ECSβ called EC2 Instance (you can use multiple instances for horizontal scaling)
- EC2 can be used as a dev machine for programming or to run a web server, container, or ML workload.
- AWS lambda: serverless functions β host code that runs in response to triggers or events.
- Container hosting services: Amazon Elastic Container Service (ESC) or Amazon Elastic Kubernetes Service (EKS)
- NETWORK
- Whenever you create an EC2 instance or many other types of AWS resources, you need to place it into a network of some kind β Amazon Virtual Private Network (VPC)
- VPCs are isolated from other networks.
- You choose the size of the private IP space.
- Partition space into smaller networks called subnetworks or subnets.
- Your data and resources donβt leave the region unless you specifically build your solutions to behave that way
β Whenever you create certain AWS resources, like EC2 instances or instance based databases, you need to select which VPC you want and which AZ you want to place it in
- STORAGE: 3 types
- Object Storage: most often used for storing unstructured data (logs, documents, photos, videosβ¦ or any kind of data) β Amazon Simple Storage Service (S3)
- Block Storage: used for database storage, virtual machine file systems, and other low-latency environments. β Amazon Elastic Block Store (EBS)
- File Storage: (the most familiar type of storage for non tech user) Data is organized into files and directories in a hierarchical structure (like file system on your laptop) β Amazon Elastic File System (EFS)
- DATABASES: uses block storage behind the scene + profide special functionality for managing structured data (complex querying, data indexing,β¦). In these courses, youβre going to become very familiar with
- Amazon relational Database service (RDS) β A cloud based relational database service
- Amazon Redshift β a data warehouse service that allows you to store transform and serve data for end use cases.
- SECURITY (ref) β Shared Responsibility Model β AWS is responsible for security OF the cloud (like toΓ nhΓ chα»c trα»i trang bα» rαΊ₯t nhiα»u cΓ΄ng nghα» bαΊ£o mαΊt), and you are responsible for security IN the cloud (like bαΊ‘n phαΊ£i khoΓ‘ cα»a + tuΓ’n thủ cΓ‘c yΓͺu cαΊ§u)
Β
IMPORTANT: Donβt forget to stop or delete any resources when you are not using them to avoid getting billed for them.
- EC2 β only get charged for EBS attached to the instance.
- Account ID + regions
- Databases β Relational Databases or NoSQL Databases (Key-Value, Document Stores)
- Files β Text, MP3, MP4
- API β request and get back data formatted as .xml, .json, etc
- Data Sharing Platform β Internal Data User or Third Party
- IoT devices (internet of things) β βswarmβ of IoT, streaming data
- In real world, source systems are unpredictable systems
- Systems go down
- Change in format/schema of data
- Change in data
- When accessing the source systems:
- How are the systems set up?
- What kind of changes are to expect?
- Itβs good to work directly with source system owners to know: β good relation is the crucial part of successful DE
- How they generate data
- How the data may change over time
- How the changes will impact the downstream systems
Means moving raw data from source systems into your data pipeline for further processing.
- Source systems and data ingestion represent the biggest bottlenecks of DE. β work with the owners
- Frequency of ingestion (how often) you need to move data from source systems in to your data pipeline.
- Batch injection: In batches, once every hour or day
- Streaming injection: Ingest data as a consrant stream of events in real time. Events like clicks on websites, sensor measurement,β¦
- available to downstream systems a short time after it's produced. β use tools like Event-streaming platform or a message queue
- Cost more than batch injection: time, money, maintenance, downtime
- Change data capture (CDC): whether a source system pushes data to you or youβll be actively pulling it from the source?
- Raw hardward ingredients:
- Solid-state storage (usb, sd card, ssd)
- Magnetic disk (hdd): backbone of moden data storage system. Cheaper 2-3x than Solid-state
- RAM (Random Access Memory): faster read and write, 30-50x more expensive than solid-state, volatile.
- In most modern architecturtes, data will pass through: magnetic β solid state β memory
- Storage systems: As a DE, you work with storage systems like Database Management Systems, Object Storage like S3, APache Iceberg, Cache / Memory-based Storage or Streaming Storage.
- Stopratge Abstractions: combinations of storage system arranged into storage abstractions like
Choose configuration params: latency, scalability, cost.
- From the bottom to the top: Raw storage ingredients > Storage systems > Storage abstractions.
- Recall: a big picture of DE β get raw data, turn it into something useful and then make it available to end users.
- Transformation = turn it into something useful!
- DE Life cycle transformation = query, modeling and transformation.
- Query: issuing a request to read records from a database or other storage systems. In this course, we focus on SQL.
- Poor query: negative impact on the source database, cause row explosion, cause downstream delays,β¦
- Data modeling: choosing a coherent structure for your data to make it useful for the business.
- Data transformation: Data manipulated, enhanced and saved for downstream use.
- Manipulate the data source as adding timestamp,β¦
- At any stages, before/in-fly/after ingest β as map to correct types, standard formats,β¦
- Enrich records with additional fields and calculations,..
- Even in the downstream: apply large-scale aggregation for reporting or featurize data for ML.
- Final stage of DE Lifecycle.
- Analytics: the process of identifying key insights and patterns within data.
- 3 common forms: business intelligence (BI), operational analytics, embedded analytics.
- BI: explore historical and current business data to discover insights.
- Operational Analytics: monitoring real-time data for immediate action.
- Embedded Analytics (new trend): External or customer-facing analytics. As a DE, your job would be servign real time and historical data for use in user facing applications
- Machine Learning will be treated separatedly from other serving βcause it involve addition complexities.
- Reverse ETL (Extract, Transform, Load): take transformed data as well as analytics and perhaps machine learning model output and feed it back into source systems.
DE no wencompasses fare more than just tools and technologies.
- Clients trust you with their information and private data. DE must follow set of principles, protocols and best practices.
- Principle of Least Privilege: Give users or applications access to only the essential data and resources they need for only the duration required.
- Donβt give and operatie as root or superuser permission when not neccessary!
- Data sensitivity (hide number of digits in credit cards,β¦). Not inject the full data (with sensitive inform) into your system at the first place.
- Secutiry in the Cloud: Identify and Access Management (IAM), Encryption Methods, Networking Protocols.
- Security is also about people! β definsive mindset (be cautious with sensitive data, design for potential attacks).
- DAMA International provides resources for effective data management. Their DAMA-DMBOK guide is a key reference.
- Data Management: Plans and practices that optimize data value throughout its lifecycle.
- Data Quality: High (accurate, complete, timely) vs Low (inaccurate, incomplete, delayed).
- DA = roadmap or blueprint for your data systems.
- Being able to think like an architect will make you more successful in your role as a DE.
- Principle of Good Data Architecture
- Choose common components wisely (CC β used across your org)
- Plan for failure!
- Architect for scalability
- Architecture is leadership
- Always be architecting (constantly avaluating your systems)
- Buld loosely coupled systems
- Make reversible decisions
- Prioritize security (Principle of least privilege, zero-trust principle)
- Embrace FinOps (Finance and DataOps/DevOps) β optimize cost and revenue
- DevOps β Software Dev (write test code) & Software deployment team (deploy and maintain code). β The DevOps movement has resulted in increased release cycles and enhanced quality for software products.
- Similar idea as DevOps when data comes in β DataOps: improves the dev poocess and quality of data products. Itβs a set of cultural habits and practices: Communication & Collaboration, Continuous Improvement, Rapid Iteration.
- DevOps practices β Agile methodology
- Pillars of DataOpes:
- Automation: CI/CD (Continuous Integration & Continuous Delivery) β example: Airflow
- Observability & Monitoring: keep in mind that βEverything fails all the timeβ (Werner Vogels, CTO of AWS) β crucial aspect of the data systems you build
- Incident Response: As a data engineer, you should be proactively finding issues before they are reported to you by other stakeholders in your organization.
β Goal: provide high-quality data products.
- Pure scheduling: get some specific tasks to run auto.
- Problem:
- Orchestration Framworks: Apache Airflow, Dagster, Prefect, Mage.
- Automate pipeline with complex dependencies.
- Monitor pipeline.
- Set up monitoring & alerts.
- Directed Acyclic Graph (DAG)
- SE: the design, dev, deployment and maintenance of software applications.
- SE becomes DE
- DE writes much less codes than SE does but it's more important than ever that you can write great code and that the code you'd write is of top quality.
- SOURCE SYSTEMS
- Databases:
- Amazon Relational Database Service (RDS): MySQL, PostgreSQL.
- Amazon DynamoDB: serverless NoSQL database options.
- virtually unlimited in their total size
- suited for low-latency access to large volumes of data like gaming, IoT, mobile apps and real time analyse
- flexible schema
- Streaming sources:
- Amazon Kinesis Data Streams: set up as a source system streaming real-time user activities from a sales platform log.
- Amazon Simple Queue Service (SQS): handle messages when building your own data pipelines outside of these courses.
- Apache Kafka β Amazon Managed Streaming for Apache Kafka (MSK)
- INGESTION
- From a Database:
- AWS Database migration Service (DMS): can migrate and replicate data from a source to a target in an automated way.
- AWS Glue (most in these courses): Offers features that support data integration processes.
- From a streaming source: Amazon Kinesis Data Streams, Amazon Data Firehose, Amazon SQS, Amazon MSK.
- STORAGE
- Traditional data warehouse: Amazon Redshift
- Object storage for a data lake: Amazon Simple Storage Service (S3)
β Combine both: Lakehouse Arrangement (Access structured data in your data warehouse and unstructured data in an object storage data lake)
- SERVING β 2 use cases
- Business Intelligence or Analytics
- Amazon Athena, Amazon Redshift: for querying structured and unstructured data. Also work with Jupyter notebooks,β¦
- Amazon QuickSight, Superset, Metabase: Dashboarding tools
- AI or Machine Learning: serve batch data for model training, and work with some vector database β product recommenders and large language models.
- Undercurrents aspects on AWS are more conceptual and more tools oriented.
- SECURE
- Identity and Access Management (IAM): set up roles and permissions.
- Amazon Virtual Private Cloud (VPC), Security Groups (Instance level firewalls)
- DATA MANAGEMENT
- AWS Glue, AWS Glue Crawler, AWS Glue Data Catalog β discover, create and manage metadata for data stored in Amazon S3 or other storage and database systems.
- AWS Lake Formation β Centrally manage and scale fine-grained data access permissions.
- DATAOPS
- Amazon CloudWatch: Collects metrics and provides monitoring features for cloud resources, applications and on-premises resources.
- Amazon CloudWatch Logs: Store and analyze operational logs.
- Amazon Simple Notification Service (SNS): Sets up notifications between applications or via text/email that are triggered by events within your system.
- Opensource tools: Monte Carlo, Bigeye.
- ARCHITECTURE: AWS Well-Architected (a set of principles and practices developed by AWS that can help you build systems with an eye towards operational efficiency, security, scalability, and sustainability)
- SOFTWARE ENGINEERING
- AWS Cloud9 (IDE for devs) hosted on Amazon Elastic Compute Cloud (EC2)
- AWS CodeDeploy (automate code deployment)
- Git, Github.
Make sure to log out of your personal account before practicing the lab in these courses!
- The main goal of this lab is to help you get started interacting with a data pipeline on AWS.
- Pipeline Scenario
- You are an DE who work with a retailer for scale models of classic cars and other vehicles.
- Customer stores data in a relational database.
- Youβre asked to build a pipeline to transform and serve to Data Analyst in the marketing team.
Β
- Data Modeling (course 4): Transform the data into a structure that is easier to understand and faster to query.
- In general, what we will do:
- Amazon RDS: the source system contains the SQL tabls (provided)
- Glue ETL: a tool that allows you to ingest data from the source database and apply transformations on the fly to the ingested data
- Glue job: connecting to the RDS database β Extracting the raw data + Transforming the data by modeling it using the provided star schema, and finally loading the transformed data into AWS object storage in an S3 bucket
- ETL = Extract + Transform + Load
- Glue Crawler: crawl over S3 and write metadata to a data catalog.
- Amazon Athena: query service to retrieve data from S3.
- We can manually create bottom 3 resources (Glue ETL, S3, Glue Crawler) using the AWS console or programmatically create them using Terraform (Infrastructures as Code, IaC). (given and we learn more in Course 2)
- It enables users to define and provision infrastructure using a declarative language, describing components without specifying detailed implementation steps.
- Introduction to HashiCorp Terraform with Armon Dadgar - YouTube
- Juptyer notebook (AWS Cloud9) to perform some DA tasks.
- Claude9: to open IDE (a VSCode like environment). Choose machine
t3.small
and enable SSH.
- Download required resources into IDE (donβt forget to βAllow all cookiesβ)
1aws s3 cp --recursive s3://dlai-data-engineering/labs/c1w2-187976/ ./
2# then install
3source scripts/setup.sh
- Database: AWS Console β AWS RDS β Databases β check the βDB identifierβ, eg.
de-c1w2-rds
1aws rds describe-db-instances --db-instance-identifier de-c1w2-rds --output text --query "DBInstances[].Endpoint.Address"
2
3# return the endpoint, something like
4# de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com
- Connect the database / Establish the connection to the RDS instance
1mysql --host=de-c1w2-rds.xxxx.us-east-1.rds.amazonaws.com --user=admin --password=adminpwrd --port=3306
- Check the database
1# Don't forget the semicolon ";"
2use classicmodels;
3show tables;
4
5# exit the sql env
6exit;Bye
- ETL Process Overview
- Extract: AWS Glue Job retrieves data from the OLTP database in RDS.
- Transform: Glue reshapes data into a star schema, improving readability and query efficiency for analysts. This may involve denormalization and aggregation.
- Load: Transformed data is stored in Amazon S3 as Parquet files, optimized for analytics in data lakes and warehouses.
- Terraform: init β plan β apply
- plan: Previews infrastructure changes. Terraform analyzes configs, compares desired and current states, and calculates necessary actions.
1cd infrastructure/terraform
2terraform init
3terraform plan
4terraform apply
- Check Glue jobs in AWS Glue β ETL jobs β tab βRunsβ
1# Start the Glue job
2aws glue start-job-run --job-name de-c1w2-etl-job | jq -r '.JobRunId'
3# return JobRunID
4
5# Check the status
6aws glue get-job-run --job-name de-c1w2-etl-job --run-id <JobRunID> --output text --query "JobRun.JobRunState"
- In jupyter notebook
1# Interact with AWS
2import awswrangler as wr
3
4# Interative data
5import ipywidgets as widgets
- S3 β Buckets β
...-datalake-...