Data systems as pipelines
Students connect storage, processing, orchestration, streaming, and observability as parts of one architecture rather than isolated product names.
Washington University in St. Louis
A graduate data engineering course I created for students who need to reason about real systems: batch and streaming pipelines, warehouses, orchestration, probabilistic sketches, and the tradeoffs that show up once data stops fitting neatly in a notebook.
Course shape
The course is intentionally practical without being tool worship. Tools change. The important part is learning how to ask what each component is responsible for, where the bottlenecks are, and what tradeoffs follow from a design choice.
Students connect storage, processing, orchestration, streaming, and observability as parts of one architecture rather than isolated product names.
Assignments ask students to make real tools work, read documentation, debug setup issues, and turn conceptual diagrams into running systems.
The oral midterm and project work emphasize tradeoffs, scaling, failure modes, communication, and explaining why a system is built the way it is.
Topics
Students should leave with a mental model of how data moves through real organizations and how to reason when a tool, scale target, or constraint changes.
S3, data lakes, Snowflake, schema design, cost, performance, and the difference between storing data and making it usable.
Airflow, Spark, shuffles, joins, scheduling, observability, and the operational shape of repeatable data workflows.
Kafka, Flink, Spark Streaming, real-time pipelines, late data, fault tolerance, and the discipline of testing on smaller streams first.
Sketches and approximation as tools for making scale manageable, with attention to what is gained and what uncertainty remains.
Student voice
I do not want that advice trapped in a spreadsheet. The course advice page shares cleaned, anonymous student responses from previous semesters, grouped by semester and response type.
"Do not just aim to finish the assignments. Try to leave each topic with a mental picture of what problem that tool solves and why someone would choose it."
Evidence and context
I publish one representative raw evaluation report for CSE 5114 and keep the larger private archive private. The advice page is intentionally anonymized and edited only to make responses readable.