Root Cause Analysis for Flaky Tests: A Systematic Framework
Every engineering team has experienced the frustration: a test fails in CI, you rerun the pipeline, it passes, and you move on. Over time, this pattern erodes trust in your test suite and trains developers to ignore legitimate failures hidden among the noise. The solution is not to retry harder but to perform proper flaky test root cause analysis on every intermittent failure.
Root cause analysis (RCA) for flaky tests is not the same as debugging a broken test. A broken test fails consistently and the cause is usually obvious. A flaky test fails sporadically, and the cause is typically a hidden assumption about the test environment, execution order, timing, or data state. This guide provides a systematic framework that transforms flaky test investigation from guesswork into a repeatable engineering process.
The Five Categories of Test Flakiness
After analyzing thousands of flaky tests across hundreds of codebases, a clear pattern emerges. Nearly every flaky test falls into one of five categories. Identifying which category a failure belongs to is the first and most important step in flaky test root cause analysis.
1. Timing-Dependent Failures
Timing issues are the most common category, responsible for roughly 40% of all flaky tests. They occur when a test makes assumptions about how quickly something will happen.
Symptoms:- Test passes on fast machines, fails on slow CI runners
- Failures correlate with system load
sleep() or increasing timeouts "fixes" the test- Hardcoded timeouts instead of event-based waiting
- Race conditions between test setup and assertion
- Animation or transition timing
- Network request timing assumptions
# Run the test with CPU throttling to expose timing issues
On Linux:
cpulimit -l 25 -- npx jest path/to/test.spec.js
Or use nice to lower priority:
nice -n 19 npx jest path/to/test.spec.js
// Pattern: Replace setTimeout-based waiting with event-based waiting
// FLAKY
await new Promise(resolve => setTimeout(resolve, 2000));
expect(screen.getByText('Loaded')).toBeInTheDocument();
// STABLE
await waitFor(() => {
expect(screen.getByText('Loaded')).toBeInTheDocument();
}, { timeout: 10000 });
2. Order-Dependent Failures
Order-dependent tests pass when run in a specific sequence but fail when run in isolation or in a different order. They account for approximately 25% of flaky tests.
Symptoms:- Test passes in the full suite but fails when run alone
- Test fails when another test is added or removed from the suite
- Random test ordering reveals failures
- Shared mutable state (global variables, singletons, module-level caches)
- Missing teardown of side effects
- Database records created by previous tests
- Environment variables set by earlier tests
# Run tests in random order to expose ordering issues
npx jest --randomize
Run the failing test in isolation
npx jest --testPathPattern="failing-test.spec.ts"
If it passes alone, find the test that it depends on
Binary search through the test suite:
npx jest --testPathPattern="(testA|testB|failing)" --verbose
3. Resource-Dependent Failures
These failures occur when tests compete for shared resources: ports, files, database connections, or external services.
Symptoms:- Failures increase with parallelism
- "Address already in use" or "Connection refused" errors
- Timeouts when connecting to external services
- File lock or permission errors
- Hardcoded ports for test servers
- Shared database without transaction isolation
- File system operations without unique paths
- Connection pool exhaustion
// Diagnose port conflicts
const getPort = require('get-port');
beforeAll(async () => {
// FLAKY: Hardcoded port
// server = app.listen(3000);
// STABLE: Dynamic port allocation
const port = await getPort();
server = app.listen(port);
baseUrl = http://localhost:${port};
});
4. Environment-Dependent Failures
Environment flakiness occurs when tests behave differently across machines, operating systems, or CI environments.
Symptoms:- "Works on my machine" syndrome
- Tests pass on macOS but fail on Linux CI
- Failures only occur in containerized environments
- Timezone or locale-dependent failures
- Filesystem path separator differences (Windows vs Unix)
- Timezone assumptions
- Locale-dependent string formatting
- Missing system dependencies in CI
- Different Node.js or runtime versions
// Diagnose environment differences
beforeAll(() => {
console.log('Node version:', process.version);
console.log('Platform:', process.platform);
console.log('Timezone:', Intl.DateTimeFormat().resolvedOptions().timeZone);
console.log('Locale:', Intl.DateTimeFormat().resolvedOptions().locale);
console.log('CPU cores:', require('os').cpus().length);
console.log('Free memory:', require('os').freemem() / 1024 / 1024, 'MB');
});
5. Data-Dependent Failures
Data-dependent flakiness arises when test outcomes depend on the specific data used, and that data is not fully controlled by the test.
Symptoms:- Tests fail around date boundaries (midnight, month-end, year-end)
- Failures correlate with specific input values generated randomly
- Tests break when seed data changes
- Sorting-related assertions fail intermittently
Date.now() or Math.random() without seeding- Relying on database auto-increment IDs
- Non-deterministic sort order for equal elements
- Floating-point comparison without tolerance
// Freeze time to eliminate date-dependent flakiness
beforeEach(() => {
jest.useFakeTimers();
jest.setSystemTime(new Date('2026-06-15T10:00:00Z'));
});
afterEach(() => {
jest.useRealTimers();
});
// Use deterministic random values
function seededRandom(seed) {
const x = Math.sin(seed) * 10000;
return x - Math.floor(x);
}
The Investigation Checklist
When you encounter a flaky test, work through this checklist systematically. Do not skip steps, and do not assume you know the cause until you can reproduce the failure deterministically.
Step 1: Gather Evidence
Before changing any code, collect information about the failure:
- [ ] Record the exact error message and stack trace
- [ ] Note the CI run URL and build number
- [ ] Check if the test has failed before (search CI history)
- [ ] Record the test execution time (slow tests are more likely to be timing-dependent)
- [ ] Note what else was running on the CI runner
- [ ] Check if the failure correlates with recent code changes
Step 2: Classify the Failure
Use the five categories above to classify the failure. Ask these questions:
| Question | If Yes | Category |
|----------|--------|----------|
| Does it pass when run alone? | Order-dependent | Ordering |
| Does it fail more on slow machines? | Timing-dependent | Timing |
| Does it fail more with parallelism? | Resource contention | Resource |
| Does it work locally but not in CI? | Environment-dependent | Environment |
| Does it fail near date boundaries? | Data-dependent | Data |
Step 3: Reproduce the Failure
This is the most critical step. A flaky test you cannot reproduce is a flaky test you cannot fix. Here are techniques for each category:
# Timing: Run with reduced resources
docker run --cpus=0.5 --memory=512m your-test-image npm test
Ordering: Randomize and repeat
for i in $(seq 1 50); do
npx jest --randomize --bail 2>/dev/null || echo "FAILED on run $i"
done
Resource: Increase parallelism
npx jest --maxWorkers=8
Environment: Test in CI-equivalent container
docker run -it --rm node:20 bash -c "npm ci && npm test"
Data: Run around date boundaries
TZ=UTC faketime '2026-12-31 23:59:50' npx jest path/to/test
Step 4: Isolate the Root Cause
Once you can reproduce the failure, narrow down the cause:
# Git bisect to find the commit that introduced flakiness
git bisect start
git bisect bad HEAD
git bisect good v1.0.0 # Last known stable version
At each step, run the test multiple times:
git bisect run bash -c 'for i in $(seq 1 10); do npx jest path/to/test || exit 1; done'
Step 5: Fix and Verify
After identifying the root cause, apply the fix and verify it eliminates the flakiness:
# Run the test many times to confirm the fix
for i in $(seq 1 100); do
npx jest path/to/test 2>/dev/null || { echo "STILL FLAKY on run $i"; exit 1; }
done
echo "All 100 runs passed - fix verified"
Bisecting Flaky Tests
Git bisect is an underused tool for flaky test root cause analysis. The challenge is that standard bisect expects deterministic pass/fail, but flaky tests are probabilistic. The solution is to run multiple iterations at each bisect step:
#!/bin/bash
bisect-flaky.sh - Run a test N times to determine if it's flaky at this commit
TEST_PATH=$1
ITERATIONS=${2:-20}
FAILURES=0
for i in $(seq 1 $ITERATIONS); do
if ! npx jest "$TEST_PATH" --silent 2>/dev/null; then
FAILURES=$((FAILURES + 1))
fi
done
echo "Failed $FAILURES out of $ITERATIONS runs"
Consider it "bad" if it fails more than 10% of the time
if [ $FAILURES -gt $((ITERATIONS / 10)) ]; then
exit 1 # bad
else
exit 0 # good
fi
Usage:
git bisect start
git bisect bad HEAD
git bisect good abc123
git bisect run bash bisect-flaky.sh "src/__tests__/flaky.test.ts" 30
This approach typically finds the guilty commit within minutes, even for tests that only fail 5-10% of the time.
Building a Flaky Test Knowledge Base
Each flaky test root cause analysis produces valuable knowledge. Capture it systematically so your team learns from every investigation:
## Flaky Test Report: UserDashboard.test.tsx
Test: "should display recent activity after login"
Category: Timing
Failure rate: ~15% in CI, 0% locally
Root cause: The test asserted on the activity list immediately after
calling login(), but the activity data was fetched asynchronously.
On CI runners with limited CPU, the fetch did not complete before the
assertion ran.
Fix: Replaced getByTestId('activity-list') with
await findByTestId('activity-list').
Time to diagnose: 2 hours
Lessons learned: All data-fetching assertions should use findBy
queries, not getBy. Added a lint rule to catch this pattern.
Over time, this knowledge base reveals patterns. You might discover that 80% of your flaky tests are timing-related, pointing to a systemic issue with how your team writes async assertions. Or you might find that a specific test utility is responsible for most ordering issues.
Preventing Flaky Tests Before They Merge
The cheapest flaky test to fix is one that never reaches your main branch. Implement these preventive measures:
Pre-Merge Flake Detection
Run new or modified tests multiple times before merging:
# .github/workflows/flake-detection.yml
flake-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Find changed test files
id: changed
run: |
FILES=$(git diff --name-only origin/main | grep '\.test\.' | tr '\n' ' ')
echo "files=$FILES" >> $GITHUB_OUTPUT
- name: Run changed tests 10 times
if: steps.changed.outputs.files != ''
run: |
for i in $(seq 1 10); do
echo "=== Run $i ==="
npx jest ${{ steps.changed.outputs.files }} || exit 1
done
Code Review Checklist for Tests
Add these items to your PR review template:
## Test Quality Checklist
- [ ] No hardcoded timeouts or sleep calls
- [ ] All async operations use proper await/waitFor patterns
- [ ] Tests clean up after themselves (mocks, state, connections)
- [ ] No dependency on test execution order
- [ ] Environment-specific values are properly configured
- [ ] Date/time values are controlled, not using system clock
The Cost of Skipping RCA
Teams that skip flaky test root cause analysis and simply retry or quarantine flaky tests are accumulating technical debt. Each ignored flaky test:
- Reduces confidence in the test suite by a measurable amount
- Makes it harder to identify real regressions hidden behind flaky noise
- Trains developers to ignore CI failures
- Slows down the entire team when flaky tests block deployments
A proper RCA takes 30 minutes to 2 hours. The cost of not doing it compounds indefinitely.
Automate Your Flaky Test Investigation
Manual investigation does not scale. When your test suite grows to hundreds or thousands of tests, you need automated tools to continuously monitor test reliability, detect flakiness patterns, and prioritize which tests need attention. DeFlaky performs continuous flaky test root cause analysis across your entire test suite, categorizing failures, tracking failure rates over time, and providing actionable fix recommendations.
Start analyzing your test suite today:
npx deflaky run
DeFlaky identifies flaky tests, classifies them by root cause category, and gives your team a prioritized remediation plan. Stop guessing why tests fail intermittently and start applying systematic flaky test root cause analysis to build a test suite your team can trust.