Data systems as pipelines
Students connect storage, processing, orchestration, streaming, and observability as parts of one architecture rather than as isolated product names.
Washington University in St. Louis
A graduate data engineering course I created for students who need to reason about real systems: warehouses, NoSQL databases, batch and streaming pipelines, orchestration, cardinality and frequency estimation, responsible data use, and the tradeoffs that appear once data stops fitting neatly in a notebook.
Course shape
The course stays practical without tool worship. Tools change. Students learn to ask what each component owns, where the bottlenecks are, and what tradeoffs follow from a design choice.
Students connect storage, processing, orchestration, streaming, and observability as parts of one architecture rather than as isolated product names.
Assignments ask students to make real tools work, read documentation, debug setup issues, and turn conceptual diagrams into running systems.
The oral midterm and project work emphasize tradeoffs, scaling, failure modes, communication, and explaining why a system is built the way it is.
Topics
Students should leave with a mental model of how data moves through real organizations and how to reason when a tool, scale target, or constraint changes.
S3, data lakes, Snowflake, schema design, cost, performance, and the difference between storing data and making it usable.
Airflow, Spark, shuffles, joins, scheduling, observability, and the operational shape of repeatable data workflows.
Kafka, Flink, Spark Streaming, real-time pipelines, late data, fault tolerance, and the discipline of testing on smaller streams first.
Responsible data use, de-identification, cardinality and frequency estimators such as count-min sketches, HyperLogLog, and historical inverse probability estimators, plus a semester-long project from proposal through final paper and code.
Student voice
I do not want that advice trapped in a spreadsheet. The course advice page shares cleaned, anonymous student responses from previous semesters, grouped by semester and response type.
"Aim past finishing the assignments. Try to leave each topic with a mental picture of what problem that tool solves and why someone would choose it."
Evidence and context
The Spring 2026 numbers above reflect the latest run of CSE 5114. For the full raw public sample, I share the Fall 2025 evaluation report: one complete section report from the first offering, with the rough edges and praise left in place. The advice page anonymizes student reflections and edits only for readability.