Keeping CI Green withKeeping CI Green with BetterUp
How BetterUp Saved 10,000+ PR Runs with Trunk

How BetterUp Saved 10,000+ PR Runs with Trunk
“We shifted engineering resources from tool maintenance to building internal AI agents."
When BetterUp's regular DX survey revealed that CI was engineers' number one pain point, the DevEx team knew they had a problem. Their CI success rate had dropped to 70% - meaning nearly one in three builds failed for reasons unrelated to actual code changes.
The existing custom tool for tracking flaky tests only caught failures that passed on retry on main, and even then, test owners routinely ignored the Slack notifications because retries worked.
The DevEx team found themselves manually nagging engineers to fix tests, creating friction without fixing the underlying issue.
BetterUp's CI architecture made accurate flake detection difficult. They use Knapsack Pro to split tests into parallel chunks, meaning a single CI job contains multiple suite runs. Combined with RSpec's dynamically generated test names, tracking any individual test's history was a data consistency nightmare - flaky tests slipped through while stable tests sometimes got flagged incorrectly.
The DevEx team integrated Trunk's rspec_trunk_flaky_tests gem to solve this at the source. The integration correctly groups dynamic RSpec names across parallel Knapsack chunks, giving every test a single coherent history. With accurate data flowing into Trunk, the team finally had reliable visibility into which tests were actually flaky.
With detection working, BetterUp enabled Trunk's automated quarantining. When Trunk identifies a flaky test, it quarantines failures at runtime so they don't block PRs. The tests still run and collect diagnostic data in the background, but engineers can merge without waiting for a retry lottery.
Since enabling quarantining, Trunk has unblocked 10,729 PR runs - with 3,220 in the last 30 days and 7,810 CI jobs saved from failure. To ensure fixes are confirmed, BetterUp configured Trunk to require 350 consecutive passes before a quarantined test can block PRs again.
Flaky tests fail intermittently by definition, and they often behave differently across environments. Confirming a fix requires many runs in production. Trunk lets teams customize detection logic and health transitions to match their standards. BetterUp chose 350 consecutive passes before a flaky test is reclassified as healthy, ensuring fixes are truly stable before a test can block PRs again.
With CI reliability now above 90% and the manual nagging workflow eliminated, BetterUp's DevEx team has shifted its energy to higher-leverage work. They're building internal AI agents that consume Trunk's API to automatically investigate flaky tests and attempt fixes.
As Travis Roberts put it, they moved from maintaining tools to building AI workflows - exactly the kind of work a DevEx team should be doing.