Root Cause Analysis for Flaky Tests: A Systematic Framework

Every engineering team has experienced the frustration: a test fails in CI, you rerun the pipeline, it passes, and you move on. Over time, this pattern erodes trust in your test suite and trains developers to ignore legitimate failures hidden among the noise. The solution is not to retry harder but to perform proper flaky test root cause analysis on every intermittent failure.

Root cause analysis (RCA) for flaky tests is not the same as debugging a broken test. A broken test fails consistently and the cause is usually obvious. A flaky test fails sporadically, and the cause is typically a hidden assumption about the test environment, execution order, timing, or data state. This guide provides a systematic framework that transforms flaky test investigation from guesswork into a repeatable engineering process.

The Five Categories of Test Flakiness

After analyzing thousands of flaky tests across hundreds of codebases, a clear pattern emerges. Nearly every flaky test falls into one of five categories. Identifying which category a failure belongs to is the first and most important step in flaky test root cause analysis.

1. Timing-Dependent Failures

Timing issues are the most common category, responsible for roughly 40% of all flaky tests. They occur when a test makes assumptions about how quickly something will happen.

Symptoms:

Test passes on fast machines, fails on slow CI runners
Failures correlate with system load

Adding sleep() or increasing timeouts "fixes" the test

Common causes:

Hardcoded timeouts instead of event-based waiting
Race conditions between test setup and assertion
Animation or transition timing
Network request timing assumptions

Investigation approach:

# Run the test with CPU throttling to expose timing issues
On Linux:
cpulimit -l 25 -- npx jest path/to/test.spec.js

Or use nice to lower priority:
nice -n 19 npx jest path/to/test.spec.js

// Pattern: Replace setTimeout-based waiting with event-based waiting
// FLAKY
await new Promise(resolve => setTimeout(resolve, 2000));
expect(screen.getByText('Loaded')).toBeInTheDocument();

// STABLE
await waitFor(() => {
  expect(screen.getByText('Loaded')).toBeInTheDocument();
}, { timeout: 10000 });

2. Order-Dependent Failures

Order-dependent tests pass when run in a specific sequence but fail when run in isolation or in a different order. They account for approximately 25% of flaky tests.

Symptoms:

Test passes in the full suite but fails when run alone
Test fails when another test is added or removed from the suite
Random test ordering reveals failures

Common causes:

Shared mutable state (global variables, singletons, module-level caches)
Missing teardown of side effects
Database records created by previous tests
Environment variables set by earlier tests

Investigation approach:

# Run tests in random order to expose ordering issues
npx jest --randomize

Run the failing test in isolation
npx jest --testPathPattern="failing-test.spec.ts"

If it passes alone, find the test that it depends on
Binary search through the test suite:
npx jest --testPathPattern="(testA|testB|failing)" --verbose

3. Resource-Dependent Failures

These failures occur when tests compete for shared resources: ports, files, database connections, or external services.

Symptoms:

Failures increase with parallelism
"Address already in use" or "Connection refused" errors
Timeouts when connecting to external services
File lock or permission errors

Common causes:

Hardcoded ports for test servers
Shared database without transaction isolation
File system operations without unique paths
Connection pool exhaustion

Investigation approach:

// Diagnose port conflicts
const getPort = require('get-port');

beforeAll(async () => {
  // FLAKY: Hardcoded port
  // server = app.listen(3000);

  // STABLE: Dynamic port allocation
  const port = await getPort();
  server = app.listen(port);
  baseUrl = http://localhost:${port};
});

4. Environment-Dependent Failures

Environment flakiness occurs when tests behave differently across machines, operating systems, or CI environments.

Symptoms:

"Works on my machine" syndrome
Tests pass on macOS but fail on Linux CI
Failures only occur in containerized environments
Timezone or locale-dependent failures

Common causes:

Filesystem path separator differences (Windows vs Unix)
Timezone assumptions
Locale-dependent string formatting
Missing system dependencies in CI
Different Node.js or runtime versions

Investigation approach:

// Diagnose environment differences
beforeAll(() => {
  console.log('Node version:', process.version);
  console.log('Platform:', process.platform);
  console.log('Timezone:', Intl.DateTimeFormat().resolvedOptions().timeZone);
  console.log('Locale:', Intl.DateTimeFormat().resolvedOptions().locale);
  console.log('CPU cores:', require('os').cpus().length);
  console.log('Free memory:', require('os').freemem() / 1024 / 1024, 'MB');
});

5. Data-Dependent Failures

Data-dependent flakiness arises when test outcomes depend on the specific data used, and that data is not fully controlled by the test.

Symptoms:

Tests fail around date boundaries (midnight, month-end, year-end)
Failures correlate with specific input values generated randomly
Tests break when seed data changes
Sorting-related assertions fail intermittently

Common causes:

Using Date.now() or Math.random() without seeding

Relying on database auto-increment IDs
Non-deterministic sort order for equal elements
Floating-point comparison without tolerance

Investigation approach:

// Freeze time to eliminate date-dependent flakiness
beforeEach(() => {
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2026-06-15T10:00:00Z'));
});

afterEach(() => {
  jest.useRealTimers();
});

// Use deterministic random values
function seededRandom(seed) {
  const x = Math.sin(seed) * 10000;
  return x - Math.floor(x);
}

The Investigation Checklist

When you encounter a flaky test, work through this checklist systematically. Do not skip steps, and do not assume you know the cause until you can reproduce the failure deterministically.

Step 1: Gather Evidence

Before changing any code, collect information about the failure:

- [ ] Record the exact error message and stack trace
[ ] Note the CI run URL and build number
[ ] Check if the test has failed before (search CI history)
[ ] Record the test execution time (slow tests are more likely to be timing-dependent)
[ ] Note what else was running on the CI runner
[ ] Check if the failure correlates with recent code changes

Step 2: Classify the Failure

Use the five categories above to classify the failure. Ask these questions:

| Question | If Yes | Category |

|----------|--------|----------|

| Does it pass when run alone? | Order-dependent | Ordering |

| Does it fail more on slow machines? | Timing-dependent | Timing |

| Does it fail more with parallelism? | Resource contention | Resource |

| Does it work locally but not in CI? | Environment-dependent | Environment |

| Does it fail near date boundaries? | Data-dependent | Data |

Step 3: Reproduce the Failure

This is the most critical step. A flaky test you cannot reproduce is a flaky test you cannot fix. Here are techniques for each category:

# Timing: Run with reduced resources
docker run --cpus=0.5 --memory=512m your-test-image npm test

Ordering: Randomize and repeat
for i in $(seq 1 50); do
  npx jest --randomize --bail 2>/dev/null || echo "FAILED on run $i"
done

Resource: Increase parallelism
npx jest --maxWorkers=8

Environment: Test in CI-equivalent container
docker run -it --rm node:20 bash -c "npm ci && npm test"

Data: Run around date boundaries
TZ=UTC faketime '2026-12-31 23:59:50' npx jest path/to/test

Step 4: Isolate the Root Cause

Once you can reproduce the failure, narrow down the cause:

# Git bisect to find the commit that introduced flakiness
git bisect start
git bisect bad HEAD
git bisect good v1.0.0  # Last known stable version

At each step, run the test multiple times:
git bisect run bash -c 'for i in $(seq 1 10); do npx jest path/to/test || exit 1; done'

Step 5: Fix and Verify

After identifying the root cause, apply the fix and verify it eliminates the flakiness:

# Run the test many times to confirm the fix
for i in $(seq 1 100); do
  npx jest path/to/test 2>/dev/null || { echo "STILL FLAKY on run $i"; exit 1; }
done
echo "All 100 runs passed - fix verified"

Bisecting Flaky Tests

Git bisect is an underused tool for flaky test root cause analysis. The challenge is that standard bisect expects deterministic pass/fail, but flaky tests are probabilistic. The solution is to run multiple iterations at each bisect step:

#!/bin/bash
bisect-flaky.sh - Run a test N times to determine if it's flaky at this commit

TEST_PATH=$1
ITERATIONS=${2:-20}
FAILURES=0

for i in $(seq 1 $ITERATIONS); do
  if ! npx jest "$TEST_PATH" --silent 2>/dev/null; then
    FAILURES=$((FAILURES + 1))
  fi
done

echo "Failed $FAILURES out of $ITERATIONS runs"

Consider it "bad" if it fails more than 10% of the time
if [ $FAILURES -gt $((ITERATIONS / 10)) ]; then
  exit 1  # bad
else
  exit 0  # good
fi

Usage:

git bisect start
git bisect bad HEAD
git bisect good abc123
git bisect run bash bisect-flaky.sh "src/__tests__/flaky.test.ts" 30

This approach typically finds the guilty commit within minutes, even for tests that only fail 5-10% of the time.

Building a Flaky Test Knowledge Base

Each flaky test root cause analysis produces valuable knowledge. Capture it systematically so your team learns from every investigation:

## Flaky Test Report: UserDashboard.test.tsx

Test: "should display recent activity after login"
Category: Timing
Failure rate: ~15% in CI, 0% locally
Root cause: The test asserted on the activity list immediately after
calling login(), but the activity data was fetched asynchronously.
On CI runners with limited CPU, the fetch did not complete before the
assertion ran.
Fix: Replaced getByTestId('activity-list') with
await findByTestId('activity-list').
Time to diagnose: 2 hours
Lessons learned: All data-fetching assertions should use findBy
queries, not getBy. Added a lint rule to catch this pattern.

Over time, this knowledge base reveals patterns. You might discover that 80% of your flaky tests are timing-related, pointing to a systemic issue with how your team writes async assertions. Or you might find that a specific test utility is responsible for most ordering issues.

Preventing Flaky Tests Before They Merge

The cheapest flaky test to fix is one that never reaches your main branch. Implement these preventive measures:

Pre-Merge Flake Detection

Run new or modified tests multiple times before merging:

# .github/workflows/flake-detection.yml
flake-check:
  runs-on: ubuntu-latest
  steps:
    - uses: actions/checkout@v4
    - name: Find changed test files
      id: changed
      run: |
        FILES=$(git diff --name-only origin/main | grep '\.test\.' | tr '\n' ' ')
        echo "files=$FILES" >> $GITHUB_OUTPUT
    - name: Run changed tests 10 times
      if: steps.changed.outputs.files != ''
      run: |
        for i in $(seq 1 10); do
          echo "=== Run $i ==="
          npx jest ${{ steps.changed.outputs.files }} || exit 1
        done

Code Review Checklist for Tests

Add these items to your PR review template:

## Test Quality Checklist
[ ] No hardcoded timeouts or sleep calls
[ ] All async operations use proper await/waitFor patterns
[ ] Tests clean up after themselves (mocks, state, connections)
[ ] No dependency on test execution order
[ ] Environment-specific values are properly configured
[ ] Date/time values are controlled, not using system clock

The Cost of Skipping RCA

Teams that skip flaky test root cause analysis and simply retry or quarantine flaky tests are accumulating technical debt. Each ignored flaky test:

Reduces confidence in the test suite by a measurable amount
Makes it harder to identify real regressions hidden behind flaky noise
Trains developers to ignore CI failures
Slows down the entire team when flaky tests block deployments

A proper RCA takes 30 minutes to 2 hours. The cost of not doing it compounds indefinitely.

Automate Your Flaky Test Investigation

Manual investigation does not scale. When your test suite grows to hundreds or thousands of tests, you need automated tools to continuously monitor test reliability, detect flakiness patterns, and prioritize which tests need attention. DeFlaky performs continuous flaky test root cause analysis across your entire test suite, categorizing failures, tracking failure rates over time, and providing actionable fix recommendations.

Start analyzing your test suite today:

npx deflaky run

DeFlaky identifies flaky tests, classifies them by root cause category, and gives your team a prioritized remediation plan. Stop guessing why tests fail intermittently and start applying systematic flaky test root cause analysis to build a test suite your team can trust.

Root Cause Analysis for Flaky Tests: A Systematic Framework

The Five Categories of Test Flakiness

1. Timing-Dependent Failures

On Linux:

Or use nice to lower priority:

2. Order-Dependent Failures

Run the failing test in isolation

If it passes alone, find the test that it depends on

Binary search through the test suite:

3. Resource-Dependent Failures

4. Environment-Dependent Failures

5. Data-Dependent Failures

The Investigation Checklist

Step 1: Gather Evidence

Step 2: Classify the Failure

Step 3: Reproduce the Failure

Ordering: Randomize and repeat

Resource: Increase parallelism

Environment: Test in CI-equivalent container

Data: Run around date boundaries

Step 4: Isolate the Root Cause

At each step, run the test multiple times:

Step 5: Fix and Verify

Bisecting Flaky Tests

bisect-flaky.sh - Run a test N times to determine if it's flaky at this commit

Consider it "bad" if it fails more than 10% of the time

Building a Flaky Test Knowledge Base

Preventing Flaky Tests Before They Merge

Pre-Merge Flake Detection

Code Review Checklist for Tests

The Cost of Skipping RCA

Automate Your Flaky Test Investigation

Stop guessing. DeFlaky your tests.