Using AI to Shift E2E Test Maintenance Left

Uzeyir Abdullayev

Feb 16, 202611 min read

E2E tests rarely fail because of a single dramatic change. More often, they fail because of small, reasonable changes that accumulate over time. A selector is renamed, a component is refactored, or a feature is gated. Each change makes sense on its own, yet the impact on E2E tests is usually discovered only after the code is merged, by which time context is already lost.

I work on the testing infrastructure team at monday.com, and we see this pattern repeatedly. It is not a problem of discipline or test coverage, but of feedback timing. By the time a failure appears in CI, the developer who made the change has often moved on, and the person debugging the failure has to reconstruct intent from scratch.

At some point, this stopped feeling like normal E2E friction. We were seeing the same pattern repeat itself again and again: a harmless-looking UI change, a merge, and then a failing test in CI. The test did its job and failed. But the signal arrived without context.

As the team responsible for testing infrastructure, we were constantly pulled into investigations where the fix was often trivial, yet the effort was not. The real work was reconstructing which change caused the failure and what assumption it broke, long after the original decision had already been made.

That was the moment we started asking a different question. Instead of asking how to fix E2E failures faster, we asked if we could help engineers understand failures sooner, while the change context was still accessible.

Shifting E2E Feedback into Code Review

CI is very good at telling us that something broke. What it is not good at is explaining why.

In practice, many E2E failures are regressions introduced by recent UI changes. The failure shows up in CI, but the connection to the original pull request is no longer obvious. By the time someone looks at the failure, the PR is merged, the reviewer has moved on, and the failure feels disconnected from the original decision.

We wanted to shift that experience. Instead of treating E2E maintenance as a purely post-merge activity, we explored how AI could help reconnect failures to the changes that caused them and surface that information during code review.

The goal was not to replace CI or prevent all failures. CI remains the detector. Our focus was on what happens next.

How It Works in Practice

At a high level, the system treats a failing E2E test as a signal that needs explanation, not something to retry, mute, or “stabilize”.

When CI reports an E2E failure, the goal is not to ask “how do we make the test pass again?”
The goal is to ask a more useful question:

What assumption did this change invalidate?

That shift is what drives the entire design.

Phase 1: Explanation and Analysis

A pull request is updated. CI runs Playwright E2E tests as usual.

If E2E tests fail, a composite action is triggered. That action:

collects failed job logs and stack traces
inspects the pull request diff
invokes an AI agent to reason about the failure

Instead of looking at failures in isolation, the agent correlates what broke with what changed. UI structure, selectors, test IDs, and behavioral changes are analyzed together, in the context of the pull request that introduced them.

The result is posted back into the pull request, inline and contextual:

What changed
Which assumption broke
And what kind of response makes sense

Crucially, this feedback shows up where the change was made, not buried in CI logs or owned by a different team days later. The failure is reattached to its cause, while the intent of the change is still visible.

A Generic Composite Action Example

The intelligence lives in the agent, but CI still needs a thin orchestration layer. A composite action is a good fit because it is portable, explicit, and easy to adopt.

Here is a simplified, generic example:

This is intentionally simplified. The exact mechanics matter less than the pattern. What matters is that CI becomes a place where failures are interpreted, not just detected.

The Agent: Where Judgment Lives

The agent is the most important part of the system. It encodes how to think about E2E failures, not just how to fix them.

Below is a simplified example of such an agent:

How the Agent Thinks

Instead of “test failed → patch test”, the agent classifies impact:

Selector or UI maintenance
– Same UI concept, different identifier (test IDs, labels, structure).
– Safe to suggest small page object updates.
Behavioral change
– A feature is disabled, removed, or altered in a way tests explicitly assert.
– Do not update tests. Flag as a potential regression and require human confirmation.
Ambiguous impact
– The change might affect tests, but the intent is unclear.
– Recommend targeted regression runs instead of code changes.

This distinction is what prevents the system from becoming an automutating test bot.

Phase 2: Execution, by Explicit Intent

One important design choice we made early on was to separate understanding a failure from fixing it.

The analysis agent described above never modifies code. Its responsibility is limited to restoring context: identifying which change caused the failure, what assumption was invalidated, and what kind of response is appropriate.

Importantly, not every analysis results in a suggested fix. Depending on what it finds, the agent can produce one of three outcomes:

Explain-only: the change altered behavior in a way that may be intentional, so the agent explains the impact but does not suggest any code changes.
Fix suggested: the failure is caused by a clear, mechanical mismatch (for example, a renamed selector), and the agent proposes a concrete update to the relevant page object or test.
Further validation recommended: the impact is ambiguous, so the agent recommends targeted regression runs rather than code changes.

When a fix is suggested, it is included directly in the pull request comment. Applying it is always an explicit developer action, triggered via a slash command in the pull request. Nothing happens automatically.

The command:

/apply-e2e-fixes

Our scope can be deliberately limited to a specific source file:

/apply-e2e-fixes SearchBar.tsx

This command triggers a second workflow whose sole responsibility is execution. It does not reason about intent. It does not infer new changes. It applies a narrowly scoped plan that was already reviewed and approved in the pull request.

This clear boundary proved critical.

Phase 1 explained what broke and why
Phase 2 executed a fix, only after human approval

Engineers remained in control at every step.

What the Fix Workflow Actually Does

Under the hood, the fix workflow runs in response to the slash command and performs a very constrained set of actions:

It fetches the pull request comments.
It extracts a structured JSON block describing approved fixes.
It applies only those exact changes to E2E files.
It commits the result back to the same branch.
It reports the outcome in the pull request.

No application code is touched. No new decisions are made. Here is a simplified example of what such a workflow looks like:

This example omits authentication, filtering, and error handling, but it captures the pattern. The workflow is intentionally boring. All the judgment has already happened.

The Fix Agent: Execution Without Interpretation

The fix agent operates on a narrowly scoped, pre-approved plan extracted from the pull request itself. When the analysis phase identifies a safe, mechanical fix, such as updating a selector in a page object, that change is described explicitly in the PR comment.

When a developer triggers the slash command, the fix agent reads only those approved instructions and applies them verbatim. For each change, it locates the exact line, performs the replacement, and moves on. If a change cannot be applied safely, it is skipped and reported back to the pull request.

The agent does not decide whether a fix is correct, nor does it attempt to “make tests pass.” It simply executes what was already reviewed and approved.

That constraint is what keeps the system trustworthy. Engineers know that nothing will change unless they explicitly ask for it, and even then, only the changes they already saw will be applied.

Why This Separation Matters

Technically, this system adds:

a composite action
an analysis agent
a fix agent
and some structured glue

But the real shift is not technical.

E2E failures stop being post-merge surprises, CI noise, or someone else’s problem. They become part of the review and recovery loop, explained in context, owned by the change, and acted on deliberately.

The AI does not take responsibility away from developers.

It gives them their context back.

This system began as an experiment during an internal hackathon.

Not as a prototype of “AI in CI”, but as a controlled attempt to test a different idea: What happens if failures are explained where they originate, instead of being debugged later?

The constraints were intentional. We ran it against real pull requests, real failures, and real production code. There was no silent automation, no background mutation, and no bypassing existing workflows.

The goal wasn’t to ship something clever. It was to see whether restoring context earlier would actually change how failures were understood and handled.

What It Looks Like in Practice

When a regression is detected, the agent leaves a contextual comment directly on the relevant part of the pull request. The comment explains which change caused the failure and how it affected the test.

If the fix is straightforward, the developer can explicitly apply it using a command in the pull request. Nothing happens automatically. The action is intentional and visible.

This balance proved critical. Engineers could get help when they needed it, without losing control over their changes.

Explanation Before Action in Critical Systems

Testing and CI are sensitive systems. Silent automation, even when correct, can quickly erode trust.

From our experience, engineers are far more comfortable with tools that explain before they act. Knowing why a test failed and which change caused it fundamentally changes how the failure is perceived. It turns a frustrating surprise into an understandable consequence.

For that reason, AI is used strictly as an advisory layer. When potential impact is detected, it is explained inline in the pull request, on the relevant lines of code. Suggestions may be offered, but applying them always requires an explicit human decision.

This keeps ownership clear and makes the system predictable. Over time, that predictability mattered more than maximizing automation.

Preserving Consistency Through Existing Frameworks

As AI assistance became part of the workflow, maintaining consistency became essential.

To avoid fragmentation, AI was deliberately constrained to operate within our existing Playwright-based E2E framework. Established abstractions, conventions, and page object patterns remain the source of truth. Assistance aligns with how tests are already written and maintained, reinforcing standards rather than introducing new ones.

This was not about limiting capability. It was about keeping the system understandable as it evolves.

From Automation to Interaction

This started as an attempt to shift E2E maintenance further left, so developers wouldn’t have to deal with broken tests after a merge. What we learned instead was that automation alone wasn’t the answer.

The real issue wasn’t fixing failures but rather, it was understanding them. Failures were arriving without context, and once context is lost, even a correct failure becomes expensive to act on.

That realization changed the direction entirely. Instead of building a system that automatically fixes tests, we started building one that keeps the change, the failure, and the reasoning connected and makes that reasoning accessible to the developer. The goal is not to remove humans from the loop, but to give them something meaningful to engage with.

We’re now applying the same approach beyond E2E, exploring how this model can work for other test types and failure signals across the SDLC, where the cost of lost context shows up in similar ways.

Using AI to Shift E2E Test Maintenance Left

Shifting E2E Feedback into Code Review

How It Works in Practice

Phase 1: Explanation and Analysis

A Generic Composite Action Example

The Agent: Where Judgment Lives

How the Agent Thinks

Phase 2: Execution, by Explicit Intent

What the Fix Workflow Actually Does

The Fix Agent: Execution Without Interpretation

Why This Separation Matters

What It Looks Like in Practice

Explanation Before Action in Critical Systems

Preserving Consistency Through Existing Frameworks

From Automation to Interaction

Read More

Related Post

From API Chaos to Collaborative Graph – making API great again

Related Post

Zero-Downtime Cassandra Migration Between EKS Clusters

Related Post

Every Playwright Needs a Director – How AI Agents Replace DOM Scraping with Component-Aware Static Analysis