Data Engineering Foundations in Python Course Syllabus
Full curriculum breakdown — modules, lessons, estimated time, and outcomes.
Overview (80-120 words) describing structure and time commitment.
Module 1: Getting Started
Estimated time: 0.5 hours
- Introduction to data engineering roles and responsibilities
- Understanding team structures in data organizations
- Setting up Google Cloud Platform (GCP) environment
- Reviewing the data engineering lifecycle stages
Module 2: Team Structures
Estimated time: 0.75 hours
- Differences between embedded and centralized data teams
- Role breakdown: Data Engineers, Analysts, and Data Scientists
- Collaboration patterns across data functions
- Strategic alignment of data teams with business goals
Module 3: Data Lifecycle & Cloud Arch
Estimated time: 1.25 hours
- End-to-end data engineering lifecycle
- Data lakes vs data warehouses
- Cloud architecture patterns: Lambda and Kappa
- Key checkpoints in pipeline development
Module 4: Data Ingestion
Estimated time: 1.5 hours
- Batch vs streaming ingestion methods
- Change Data Capture (CDC) techniques
- API-based and file system data ingestion
- Building ingestion pipelines with pandas and PySpark
Module 5: Data Modeling & SQL
Estimated time: 1 hour
- Dimensional modeling using Kimball methodology
- Writing DDL and DML statements in SQL
- SQL query lifecycle in BigQuery
- Solving real-world SQL challenges
Module 6: Orchestration Tools
Estimated time: 1.5 hours
- Directed Acyclic Graphs (DAGs) in Apache Airflow
- Introduction to Dagster for workflow orchestration
- Using dbt for transformation workflows
- Building and managing end-to-end DAG pipelines
Module 7: Data Quality
Estimated time: 0.75 hours
- Schema validation with Avro and Protobuf
- Implementing data quality checks in pipelines
- Testing and monitoring data integrity
- Integrating dbt for automated testing
Module 8: Capstone & Epilogue
Estimated time: 0.5 hours
- Building an end-to-end Formula-1 data pipeline
- Integrating ingestion, transformation, and orchestration
- Reviewing GCP billing and cost management
Prerequisites
- Familiarity with Python programming
- Basic understanding of SQL
- Access to a Google Cloud Platform account
What You'll Be Able to Do After
- Design and implement data engineering pipelines on GCP
- Use Python, PySpark, and SQL for data processing
- Orchestrate workflows using Airflow and dbt
- Apply data modeling and quality assurance practices
- Build a production-grade data pipeline for portfolio use