Washington University in St. Louis

CSE 5114: Data Manipulation and Management at Scale

A graduate data engineering course I created for students who need to reason about real systems: batch and streaming pipelines, warehouses, orchestration, probabilistic sketches, and the tradeoffs that show up once data stops fitting neatly in a notebook.

Course shape

Students build enough to understand what the abstractions are buying them.

The course is intentionally practical without being tool worship. Tools change. The important part is learning how to ask what each component is responsible for, where the bottlenecks are, and what tradeoffs follow from a design choice.

01

Data systems as pipelines

Students connect storage, processing, orchestration, streaming, and observability as parts of one architecture rather than isolated product names.

02

Hands-on assignments

Assignments ask students to make real tools work, read documentation, debug setup issues, and turn conceptual diagrams into running systems.

03

Design judgment

The oral midterm and project work emphasize tradeoffs, scaling, failure modes, communication, and explaining why a system is built the way it is.

Topics

Modern data engineering, taught as connected decisions.

Students should leave with a mental model of how data moves through real organizations and how to reason when a tool, scale target, or constraint changes.

Storage and warehouses

S3, data lakes, Snowflake, schema design, cost, performance, and the difference between storing data and making it usable.

Orchestration and batch work

Airflow, Spark, shuffles, joins, scheduling, observability, and the operational shape of repeatable data workflows.

Streaming systems

Kafka, Flink, Spark Streaming, real-time pipelines, late data, fault tolerance, and the discipline of testing on smaller streams first.

Probabilistic reasoning

Sketches and approximation as tools for making scale manageable, with attention to what is gained and what uncertainty remains.

Student voice

Past students have practical advice for future students.

I do not want that advice trapped in a spreadsheet. The course advice page shares cleaned, anonymous student responses from previous semesters, grouped by semester and response type.

"Do not just aim to finish the assignments. Try to leave each topic with a mental picture of what problem that tool solves and why someone would choose it."

Evidence and context

A representative public record, with student privacy kept in mind.

I publish one representative raw evaluation report for CSE 5114 and keep the larger private archive private. The advice page is intentionally anonymized and edited only to make responses readable.