The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents

The Death of model.fit(): What Data Scientists Actually Do in the Age of AI Agents

A practitioner's view on finding your footing when there's no model to train.

Koren Gast
Koren Gast

A few months ago, I joined a team building two AI-agent products:

  • Monday Magic, which turns natural language into fully configured monday.com solutions
  • Monday Vibe that generates entire web applications from a text description

Both had already shipped. Neither had ever had a data scientist on the team.

My first week, I opened a Jupyter notebook out of habit. Then I closed it. There was no training set, no features to engineer, no model.fit(X_train, y_train) waiting to be called. The agents orchestrated foundation models. The “intelligence” came from a model someone else trained. The entire codebase was TypeScript. No notebooks, no model, no Python. The toolbox I’d spent years filling was, on its surface, irrelevant.

So what, exactly, was I supposed to do?

The answer turned out to be hiding in a simple framework. 

Every AI agent has three layers. The foundation model provides raw intelligence. The engineering provides the body: tools, APIs, orchestration, and product surfaces. But the behavior of the agent – what it actually does when a user shows up – is shaped by the context, prompts, policies, schemas, and guardrails that surround the model. That’s the brain of the system. Not the neural network itself, but the cognitive architecture built on top of it.

Someone needs to own the quality of that brain; to make it legible, to understand its failure modes, measure its consistency, map its weaknesses, and create the feedback loops that systematically make it smarter. That someone, it turns out, is the data scientist. Not as a model trainer, but as the team’s methodologist.

The question that drives this work is deceptively simple: “Is this system actually getting better, or are we just shipping changes?”

The Empty Notebook Is the Point

For a decade, the center of gravity in applied data science was the supervised learning workflow: curate datasets, engineer features, tune hyperparameters, deploy. The model.fit() call was the culmination of all that thinking: the moment where your understanding crystallized into a mathematical artifact.

In agentic systems, that moment doesn’t exist. You’re not training anything. You’re orchestrating pre-trained foundation models that already “know” far more than any model you could train on your company’s data. The agent’s behavior emerges from the interaction between prompts, tool definitions, retrieval contexts, structured output schemas, and orchestration logic – what’s increasingly being called the agent’s context architecture. When the system fails (and it fails in fascinating, compounding ways), the failure isn’t in a loss curve. It’s in the trace.

This is disorienting if your identity is tied to training models. It’s liberating if your identity is tied to understanding systems through data.

The Real Problem isn’t Building Agents. It’s Knowing If They Work

Here’s something that doesn’t get said enough: a first version of an agent product can absolutely be an engineering-only effort. Talented AI engineers can wire together an orchestrator, connect tools, define structured output schemas, and ship a working prototype in days. The barrier to entry for building intelligent features has collapsed. When a team is racing toward a v1, a research-minded data scientist hovering over the shoulder asking “but how will we measure this?” can genuinely slow things down.

The inflection point comes after launch. Once real users show up with real expectations, and the team shifts from “make it work” to “make it work reliably,” the game changes completely. You’re no longer building a demo. You’re maintaining a system that users depend on, and every change you make could break something that was working yesterday.

This is where the barrier to reliability skyrockets.

Autonomous agents introduce a level of non-determinism that traditional software engineering never had to cope with. A seemingly innocent change to a system prompt can trigger cascading regressions in the agent’s reasoning. Adding a new tool to the agent’s repertoire can cause it to abandon a previously used tool. Tweaking a retrieval strategy can surface the right documents but cause the agent to ignore them.

Here’s the thing: these failures are not random. They follow patterns. But you can only see those patterns if you have the methodological discipline to look for them systematically, not through ad-hoc spot-checking, but through structured evaluation.

This is where most agent teams hit a wall. The skills required – systematic measurement, experimental design, statistical rigor – aren’t optional. They’re the difference between a team that ships confidently and one that ships hopefully. Whether that skillset lives in a data scientist, an ML engineer, or a particularly methodical product manager matters less than whether it exists on the team at all.

Evaluation-Driven Development: Your New Training Loop

Anthropic’s guide on evaluating AI  agents provides a useful starting vocabulary: tasks, trials, graders, transcripts, and evaluation harnesses. What the guide doesn’t fully address, and what matters most in practice, is how to build the feedback loop between evaluation results and development decisions. The vocabulary is a starting point, not a methodology.

This is the new model.fit() – the evaluation harness is the development feedback loop.

In Evaluation-Driven Development, evaluation isn’t a terminal testing phase you bolt on before launch. It’s a continuous governing function that drives every decision in agent development. You encode intended behavior as tasks and rubrics. You treat prompts, tool schemas, and orchestration logic as versioned artifacts. You run them against your evaluation suite. You measure. You iterate. The eval suite serves as the functional specification for the agent system.

This maps cleanly onto classical ML, just shifted into a different space. Instead of optimizing a loss function over numerical features, you’re optimizing agent behavior over natural language space. Instead of gradient descent, you use programmatic prompt optimization. Instead of a confusion matrix, you have an error taxonomy. Instead of a validation set, you have a golden dataset.

The intellectual muscles are the same: measurement, sampling, statistical inference, bias detection, and error analysis. The application surface has changed entirely.

What the Work Actually Looks Like

Let me get concrete, because the abstraction can obscure how deeply technical this work really is.

Error taxonomies. When an agent that generates application code produces a broken UI, “it didn’t work” is not a diagnosis. The data scientist’s job is to systematically categorize how it didn’t work. Was it a tool selection error? A parameter error? A schema violation? A reasoning failure? A premature conclusion?

Building this taxonomy means manually reviewing dozens, eventually hundreds, of agent traces. It’s qualitative research applied to AI systems. The output transforms vague “it feels worse” into precise, actionable categories. Once you have the taxonomy, you can construct transition matrices showing exactly where in the agent’s workflow specific failure types cluster. Engineers stop guessing which prompt to tweak and start making targeted fixes.

Golden datasets. The cornerstone of offline evaluation is a curated collection of representative inputs paired with quality criteria. This is not a one-time artifact. It’s a living dataset that must evolve with the product. A critical pitfall is relying exclusively on clean, synthetic test cases. Agents that ace sanitized inputs routinely fail when confronted with real users who provide ambiguous instructions, reference disjointed contexts, or combine contradictory requests in a single prompt. (Users are creative. Frustratingly, beautifully creative.) The data scientist continuously mines production traces to identify actual failures, annotates them with domain experts, and integrates these edge cases back into the golden dataset.

LLM-as-judge calibration. When you need to evaluate open-ended outputs (the quality of a generated application, the appropriateness of a solution’s architecture, the faithfulness of a response to user intent), traditional deterministic metrics fall short. The industry has widely adopted using frontier models as automated evaluators. But indiscriminately delegating evaluation to another LLM introduces its own risks: model bias, sensitivity to evaluation prompt phrasing, and scoring drift over time. The data scientist’s job is to engineer precise evaluation rubrics (not “is this good?” but robust scales for measures like user intent alignment, delightfulness, and structural correctness) and continuously measure the statistical correlation between the automated judge’s and the human expert’s annotations. (think Cohen’s Kappa for inter-rater agreement, not just raw accuracy). When they drift apart, you treat the judge as a model that needs recalibration. Evaluating the evaluator!

Structured output correctness. Agents that generate structured outputs (JSON schemas, API calls, database queries, application configurations) offer a gift to the evaluator: their output format is contractual. Schema adherence, required field presence, type correctness, and enum validity: these can all be checked with deterministic graders. This is where the data scientist builds Level 1 evaluations (pure code assertions) that run in CI/CD and gate every pull request. When a prompt change causes the agent to start omitting a required field in 3% of cases, the regression gate catches it before any user is affected.

The Sprint Velocity Trap

There’s a pattern I’ve seen in fast-moving AI teams: the sprint velocity trap. The team ships prompt changes, adds new tools, and restructures orchestration logic, all at impressive speed. But without systematic measurement, nobody can distinguish between changes that improved quality, changes that were neutral, and changes that introduced subtle regressions that won’t surface until a particular class of user input triggers them.

The data scientist’s role here isn’t to slow the team down. It’s to make sure the team knows where it is. Every change produces data. Every failure, examined carefully, informs the next iteration. Evaluation results don’t just validate; they drive the roadmap. When your error taxonomy shows that 40% of failures in a text-to-app agent stem from incorrect tool parameter generation, that’s not just a quality metric. It’s a strategic signal about where engineering effort will have the highest return.

That’s the “research mindset as strategic quality function”: hypothesis-driven iteration that prevents organizations from confusing shipping with improving.

Hamel Husain put it simply: “Teams that succeed barely talk about tools. They obsess over measurement and iteration.” His observation that “evals are just data science applied to AI” is both validating and clarifying. There’s nothing fundamentally new here. There’s a system producing outputs, and the question is how you measure and debug those outputs. That’s always been the question.

Where DS Ends and Engineering Begins

The rise of the “AI Engineer” role is real and healthy. It resolves a longstanding identity crisis where “data scientist” was an umbrella term for everyone from the analyst writing SQL to the engineer deploying Kubernetes clusters. With AI engineers owning the build, the data scientist can finally specialize in what the academic discipline always intended: rigorous measurement, experimental design, and systematic quality improvement.

This isn’t about gatekeeping. Any engineer can write an evaluation check. But there’s a meaningful difference between writing a test and building a trustworthy evaluation system: one where you understand sampling bias, can calculate inter-rater reliability, know when your automated judge is drifting, and can design rubrics that actually correlate with user-perceived quality. That difference is the data science contribution.

So Is It All About Evals?

Not quite. If you’ve read this far, you might assume I think the data scientist’s job in the agent era is “build evaluation pipelines.” That’s the mechanism. 

And honestly, evals are only half the story. I could write an entire separate post (or a few) about context engineering and management: the art and science of shaping what the model sees before it reasons. If Evaluation-Driven Development is the new model.fit(), then context engineering is the new feature engineering. No amount of hyperparameter tuning could compensate for garbage inputs. The same principle applies here. The quality of the context you construct – what goes into the prompt, what gets retrieved, how instructions are structured, what examples are selected, and how conversation history is managed. This is all determining the ceiling of agent behavior. No amount of eval-driven iteration will fix an agent whose context is poorly designed.

And just as with feature engineering, context engineering demands the same blend of domain expertise, creativity, and staying relentlessly current with rapidly evolving techniques. New retrieval strategies, prompt structures, and approaches to context window management emerge constantly.

The actual value is something more fundamental: the data scientist is the guardian of the quality of the brain of the system. Like they always were.

The Quiet Case for Measurement

I want to end where I started: on something personal. When I joined an agent team that had been shipping successfully without a data scientist, I didn’t walk in and declare that everything was broken. It wasn’t. The engineers had built something impressive, and their intuitions about quality were often right. Getting a product to market without a DS is not a failure of process. Sometimes it’s exactly the right call for that stage of the product.

And the TypeScript thing? It turned out to be clarifying rather than catastrophic. Losing Python forced me to confront which of my skills were language-dependent and which were language-agnostic. Pandas proficiency is language-dependent. The ability to design a sampling strategy for a golden dataset, define a rubric that distinguishes between a tool selection error and a parameter error, or calculate whether your automated judge is drifting from human consensus: that’s language-agnostic. The thinking transfers. The import pandas as pd doesn’t.

What I could offer was resolution. Instead of “this prompt change feels better,” I could show that it improved tool selection accuracy by 12% while causing a 4% regression in schema adherence for a specific class of inputs. Instead of “users seem happier,” I could demonstrate that the error taxonomy had shifted: fewer catastrophic failures, more subtle reasoning errors that required a different intervention strategy. Instead of “we should try this,” I could run the change against the evaluation suite in twenty minutes and know, with calibrated confidence, whether it was worth shipping.

The model may no longer need fitting. But the system still needs understanding. Its evaluations need designing, its context needs engineering, and both demand the same curiosity about what’s working and what’s not, the same commitment to staying at the frontier of rapidly evolving techniques, and the same creative instinct that once went into crafting the perfect feature set. Understanding systems through data is what data scientists have always done.

If you’re a data scientist navigating this shift (or a manager wondering whether you need one), I’d love to hear how your team is handling it.