Keeping CI Green with

Keeping CI Green with
Keeping CI Green with
Trunk Flaky Tests
BetterUp

How BetterUp Saved 10,000+ PR Runs with Trunk

With 14 days data for free

Challenges

CI reliability dropped to 70%
Flaky tests slipped through homegrown solution
DevEx manually nagged engineers to fix tests

Solution

Unified test history across parallel runs
Automated quarantining unblocks PRs instantly
Custom 350-pass threshold confirms fixes

Results

CI reliability jumped to 90%+
10,000+ PR runs unblocked
Team shifted from firefighting to investing in AI

“We shifted engineering resources from tool maintenance to building internal AI agents."
Travis RobertsStaff Full-Stack Engineer

CI Reliability Was the Top Complaint

When BetterUp's regular DX survey revealed that CI was engineers' number one pain point, the DevEx team knew they had a problem. Their CI success rate had dropped to 70% - meaning nearly one in three builds failed for reasons unrelated to actual code changes.

The existing custom tool for tracking flaky tests only caught failures that passed on retry on main, and even then, test owners routinely ignored the Slack notifications because retries worked.

The DevEx team found themselves manually nagging engineers to fix tests, creating friction without fixing the underlying issue.

Data Challenges Behind the Flakiness Problem (Knapsack + RSpec)

BetterUp's CI architecture made accurate flake detection difficult. They use Knapsack Pro to split tests into parallel chunks, meaning a single CI job contains multiple suite runs. Combined with RSpec's dynamically generated test names, tracking any individual test's history was a data consistency nightmare - flaky tests slipped through while stable tests sometimes got flagged incorrectly.

The DevEx team integrated Trunk's rspec_trunk_flaky_tests gem to solve this at the source. The integration correctly groups dynamic RSpec names across parallel Knapsack chunks, giving every test a single coherent history. With accurate data flowing into Trunk, the team finally had reliable visibility into which tests were actually flaky.

Quarantining Without Blocking PRs

With detection working, BetterUp enabled Trunk's automated quarantining. When Trunk identifies a flaky test, it quarantines failures at runtime so they don't block PRs. The tests still run and collect diagnostic data in the background, but engineers can merge without waiting for a retry lottery.

Since enabling quarantining, Trunk has unblocked 10,729 PR runs - with 3,220 in the last 30 days and 7,810 CI jobs saved from failure. To ensure fixes are confirmed, BetterUp configured Trunk to require 350 consecutive passes before a quarantined test can block PRs again.

Customizable Detection and Transitions

Flaky tests fail intermittently by definition, and they often behave differently across environments. Confirming a fix requires many runs in production. Trunk lets teams customize detection logic and health transitions to match their standards. BetterUp chose 350 consecutive passes before a flaky test is reclassified as healthy, ensuring fixes are truly stable before a test can block PRs again.

From Firefighting to Building AI Workflows

With CI reliability now above 90% and the manual nagging workflow eliminated, BetterUp's DevEx team has shifted its energy to higher-leverage work. They're building internal AI agents that consume Trunk's API to automatically investigate flaky tests and attempt fixes.

As Travis Roberts put it, they moved from maintaining tools to building AI workflows - exactly the kind of work a DevEx team should be doing.