How I used Claude Code to crush flaky tests

Leverage GitHub workflows and Claude Code to automatically detect and fix flaky tests, eliminating tedious debugging work and improving CI reliability.

By Katia Wheeler ·

How I used Claude Code to crush flaky tests

TL;DR use GitHub workflows to enable Claude Code to kick flaky tests to the curb.

Imagen 4’s interpretation of Claude Code fixing flaky tests

Flaky tests are a pain in our repos. They’re pesky, hard to reproduce, and can be even trickier to resolve. It’s something that brings toil to our workdays and takes developer time to fix something mundane when we’d rather be focusing our attention on creating and enhancing products. Enter Claude Code and a GitHub workflow.

How to track flaky tests

Every company tracks flaky tests differently, but here’s how we do it atShop.

On each CI run, we take the coverage output, convert it into a JSON schema, and then upload that information to a Google Cloud Storage bucket which then gets converted into a BigQuery table. Next, we have a nightly GitHub workflow that authenticates to GCS, runs a script that pulls the last ~n days worth of CI test runs, and calculates the failure rates of specific tests. If the failure rate is above a specific threshold (we tend to use more than 3 failures), the workflow will create a new GitHub issue with the test failure information (number of failures, file information and test line number, test name) and will assign a Flaky Test label — and this is where I made Claude Code do our dirty work with a little 🪄magic🪄.

Get to the good stuff already

The workflow runs on two events: workflow_dispatch and issues.labeled.

  • workflow_dispatch allows us to manually test the workflow (since GitHub workflows are notoriously a pain to test locally)
  • issues.labeled triggers the workflow automatically when an issue receives a new label

Because we support both triggers, the workflow needs to check for either the github.event object or the inputs object at several points.

The full workflow gist is below. Some parts are tailored specifically to our codebase, but the practices are solid. A good example of this, when Claude Code opens a new PR it will copy all of the labels from the issue (except “Flaky Test) — one of those labels is a team label in which another GitHub workflow is triggered to randomly auto-assign reviewers to the PR that are members of that team.

Try it out. Automate the boring stuff. Fix flaky tests before they slow you down. Happy coding!

Originally published on Medium.