The scenario
Right now, as you read this, someone is vandalizing Wikipedia and someone else is fixing a typo. Across all languages, Wikipedia absorbs hundreds of edits every second, and almost all of them are made in good faith. A steady trickle are not: subtle hoaxes, sneaky vandalism, paid promotional edits, the occasional blank-the-whole-page.
Students join the team that guards the record. Their weapon is a language model that can read an edit the way a human reviewer would and judge whether it smells like vandalism. The catch is that the model is slow and rate-limited: each call takes about a second, and the free tier allows only a handful of requests per minute, while the edit stream runs thousands of times faster. Pipe every edit into the model and you fall hours behind within the first minute. The assignment is to build a two-tiered bouncer: a fast, cheap filter in Flink that handles the full firehose and resolves the obvious cases instantly, escalating only the genuinely ambiguous edits to the expensive model.
That cheap-filter-then-expensive-model cascade is exactly how real infrastructure works. Cloudflare drops or allows the vast majority of web requests with cheap edge rules and escalates only the ambiguous remainder to heavier machine learning. Wikipedia's own anti-vandalism bots score edits with a fast model before anything reaches a person. Students are building a real, enterprise-grade streaming archetype on live data.
The aha that the assignment is built around
Every beginner reaches for the same obvious rules, and every one of them is wrong on its own. Flag all anonymous edits? Most anonymous edits are good-faith. Flag all large deletions? Most large deletions are legitimate cleanup and merges. Flag the bots? The bots are mostly the good guys. No single cheap rule separates good from bad.
So the cheap tier's job is not to judge. It is to triage: cheaply resolve the obvious majority, and spend the precious model only on the genuinely ambiguous edges. Internalizing that reframing is the whole point of the assignment, and it is the kind of insight a student cannot get by pasting the prompt into a chatbot, because the naive rules are precisely what the chatbot will reach for first.
The architecture and the constraints
The pipeline runs in five parts, each with a real engineering constraint behind it. Edits arrive from Kafka. A Flink job keys the stream by editor, keeps per-user state such as a rolling edit count, and runs a triage decision on every element. Borderline edits, and only borderline edits, are sent to the model. Every escalation is written to an append-only audit log, one JSON object per line, and a separate batch job reads that log the next morning to report what happened.
The constraints are where the learning lives:
- Budget. The model sustains only about 15 calls per minute while the stream delivers roughly 1,500 edits per minute, so students can escalate on the order of one percent of traffic. The triage threshold is a design decision they make and defend, not a value I hand them.
- Resilience. The free tier pushes back with rate-limit and server-overload errors. Students implement retry with exponential backoff and random jitter so the pipeline never crashes and never drops a flagged edit, and gives up gracefully with a safe "uncertain" verdict after a bounded number of tries.
- Bounded state. Tracking a per-user count forever would eventually exhaust memory, so students reason about state expiration with timers or time-to-live, the way a production stream processor has to.
- Auditability. Security decisions must be reviewable, so the audit log captures the features, the model's label, its confidence, and its stated reason for every escalation.
Two different AIs, and a policy about both
This assignment runs at what I call an AI-collaborative level, and it deliberately puts two distinct AIs in the room. The first is the student's coding assistant, the co-pilot that helps write PyFlink. Using it is encouraged, because real data engineers lean on coding agents to navigate complex distributed streaming APIs, and I want students building that habit. The second is the language model the pipeline itself calls to classify edits. One is a tool you build with; the other is a component you build into the system. Keeping them straight is part of the work.
The policy on the coding assistant is simple: you are the lead engineer, and you must deeply understand, independently explain, and critically evaluate every line of code you submit. If you could not walk a TA through why a line is there, it is not done. A one-shot copy-paste of the prompt is both prohibited and ineffective, because the deliverables are run-specific artifacts from your own live data, and because the assistant routinely hallucinates streaming configurations and APIs. The way you guarantee real understanding underneath the AI assistance is the same way I do it everywhere: grading that rewards mastery and an oral defense where the chatbot simply is not in the room.
The question underneath the code
The final report includes an accuracy check: of the edits the model called vandalism, how many did Wikipedia actually revert? That comparison turns the assignment into an argument. Where the model's judgment disagrees with what human editors actually did, who was right? And what does that tell you about deploying a language model as an autonomous moderator versus keeping a human in the loop? When should we trust an AI to decide, and when must a person stay involved, is a question these students will face for the rest of their careers.
The assignment then asks them to scale the design in writing: how to partition the stream across many workers, how to keep a single hot key such as a busy bot from swamping one worker, how to keep distributed state bounded, and how to keep the expensive model tier from falling behind. There is no single right answer; I am looking for sound systems reasoning and explicit trade-offs. The same cascade generalizes well past Wikipedia, to fraud detection, content moderation, network security, and IoT anomaly detection, which is the real reason it is worth building once by hand.
Related
This assignment is part of CSE 5114: Data Manipulation and Management at Scale, and it is a concrete instance of the approach in Teaching Computer Science in the Age of AI. The assessment ideas it leans on are written up in Multiplicative Grading and Scaling Oral Exams. Fittingly, the assignment itself was built with heavy AI assistance and a lot of human direction and review.