DE by DL.AI - Course 1: Introduction to DE

Anh-Thi Dinh
draft
⚠️
This is a quick & dirty draft, for me only!

Information

Introduction to Data Engineering

  • Data-Centric AI: The discipline of systematically engineering the data used to build an AI system.
  • This program is all about framework, principles, getting you to think like a data engineer + building system on AWS.
  • Program:
      1. Course 1: Intro to DE
      1. Course 2: Source Systems, Data Ingestion, and Pipelines.
      1. Course 3: Data Storage and Queries.
      1. Course 4: Data Modeling, Transformation, and Serving.
  • Prerequisite
    • Intermediate Python, Pandas
    • Basic SQL
    • Basic AWS Cloud.
  • What is unique about this program?
    • This program teaches you how to think like a data engineer
    • Hands-on practice.
  • Scenario
    • Most of dev focuses only on the last stage → waste time and less effective
  • First course → a big picture. First week is only about how to think like a DE. No lab, no implementation.
  • Plan for course 1
    • Week 1: High level look at the firled of DE
      • DE lifecycle
      • HIstory of DE
      • The DE among other stakeholders
      • Business value
      • Translation of stakeholder needs into requirements
    • Week 2: DE lifecycle and undercurrents
    • Week 3: Principles of good data architcture
    • Week 4: Design and build out a data architecture
  • Software Engineering (SE) → DE
  • Definition (by the author of the book): Data engineering is the development, implementation, and maintenance of systems and processes that take in raw data and produce high-quality, consistent information that supports downstream use cases, such as analysis and machine learning. Data engineering is the intersection of security, data management, DataOps, data architecture, orchestration, and software engineering
    • → Your job is to get raw data from somewhere, turn it into something useful, and then make it available for downstream use cases!
  • DE lifecycle
    • Data Pipeline
  • History of DE
    • 1960s: The advent of computers marks the beginning of digital data. Computerized databases are introduced.
    • 1970s: Relational databases emerge, leading to the development of SQL (Structured Query Language) by IBM.
    • 1980s: The first data warehouse is developed by Bill Inmon, enabling data transformation for analytical decision making.
    • 1990s: Dedicated tools and data pipelines for reporting and business intelligence are developed. Data modeling approaches for analytics, such as Ralph Kimball and Bill Inmon's approaches, are introduced.
    • Mid-1990s: The Internet goes mainstream, leading to the growth of web applications and the need for backend systems like servers, databases, and storage solutions.
    • Early 2000s: The dotcom boom and subsequent bust highlight the need for handling large volumes of data. Google's publication on MapReduce inspires the development of Apache Hadoop by Yahoo, revolutionizing data technologies.
    • Late 2000s: Amazon creates Amazon Web Services (AWS), offering scalable computing and storage solutions. Public cloud platforms like AWS, Google Cloud, and Microsoft Azure become popular, transforming the way data applications are developed and deployed.
    • 2010s: The transition from batch computing to event streaming enables handling real-time data. The term "big data" loses momentum as data processing becomes more accessible and every company aims to derive value from their data.
    • Present: Data engineering plays a crucial role in building powerful, scalable data systems using tools and technologies developed by pioneers. Cloud-first, open-source, and third-party products simplify working with data at scale. Data engineering is increasingly focused on interoperation and connecting technologies to serve business goals.
  • The DE among other stakeholders: 2 ways (upstream and downstream)
  • Business Value:
    • I'm going to give them the same advice as if they were a bank robber. Go to where the money is if you want to have long term, great success in our industry, find business value. Don't get hung up on every technology that comes out. Every new fangled thing that comes out, go to where there's business value. Because at the end of the day, business value drives everything we do in technology. — (Bill Inman’s advice)
  • System Requirements: Before we start writing any code or spinning up resources on the Cloud
    • The most important step is Requirements Gathering
      Know to translate from the high level goals to requirements
  • Requirements Gathering Conversation (mock conversation between a Data Scientist and a DE)

    Data Engineering on the Cloud