Flaky Tests in GitHub Actions: Detection, Prevention, and Monitoring
GitHub Actions is the most popular CI/CD platform for open-source projects and an increasingly common choice for enterprise teams. Its ephemeral runner model -- where each job gets a fresh virtual machine that is destroyed after the job completes -- creates unique flakiness patterns that differ from traditional CI servers like Jenkins.
This guide covers GitHub Actions-specific causes of test flakiness, provides ready-to-use workflow configurations for detection, and shows how to build a monitoring pipeline that catches flaky tests before they become problems.
Why GitHub Actions Introduces Unique Flakiness
Ephemeral Environments
Every GitHub Actions job starts on a fresh VM. This eliminates the "works on my CI server because of cached state" problem but introduces a new one: cold-start variability. The first time a job runs, it must install dependencies, warm up caches, and start services from scratch. The time this takes varies between runs, sometimes significantly.
Shared Runner Infrastructure
GitHub-hosted runners share physical infrastructure with other customers. During peak hours, your runner may have less available CPU, memory, and network bandwidth than during off-peak hours. Tests with tight timing assumptions may pass at 2 AM but fail at 2 PM.
Network Variability
GitHub-hosted runners access the internet through shared network infrastructure. npm install, pip install, Docker pulls, and API calls to external services all depend on network performance that varies between runs.
Runner Image Updates
GitHub periodically updates runner images with new OS versions, browser versions, and system library versions. A test that depends on a specific browser rendering behavior may start flaking after a runner image update.
GitHub Actions-Specific Flakiness Patterns
Pattern 1: Dependency Installation Failures
# FLAKY: npm install can fail due to registry timeouts
steps:
- uses: actions/checkout@v4
- run: npm install
- run: npm test
The npm registry occasionally has latency spikes. A timeout during npm install fails the entire job, which looks like a test failure but is actually an infrastructure issue.
steps:
- uses: actions/checkout@v4
- name: Cache node_modules
uses: actions/cache@v4
with:
path: ~/.npm
key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
restore-keys: ${{ runner.os }}-node-
- name: Install dependencies
run: npm ci --prefer-offline
timeout-minutes: 5
- name: Run tests
run: npm test
npm ci is deterministic (uses the lock file exactly), --prefer-offline uses cached packages when available, and the explicit timeout-minutes prevents the job from hanging indefinitely.
Pattern 2: Service Container Startup Race
# FLAKY: Tests might start before PostgreSQL is ready
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test
POSTGRES_PASSWORD: test
ports:
- 5432:5432
steps:
- uses: actions/checkout@v4
- run: npm test # Database might not be accepting connections yet
Fix: Add a health check to the service definition.
services:
postgres:
image: postgres:16
env:
POSTGRES_DB: test
POSTGRES_PASSWORD: test
ports:
- 5432:5432
options: >-
--health-cmd pg_isready
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- uses: actions/checkout@v4
- name: Wait for PostgreSQL
run: |
until pg_isready -h localhost -p 5432; do
echo "Waiting for PostgreSQL..."
sleep 2
done
- run: npm test
Pattern 3: Resource Exhaustion on Shared Runners
GitHub-hosted runners have limited resources (typically 2 cores, 7 GB RAM for Linux runners). Running too many parallel tests or memory-intensive browser tests can cause OOM kills and timeouts.
Fix: Limit parallelism and monitor resources.steps:
- uses: actions/checkout@v4
- run: npm ci
# Limit Jest workers to match available cores
- name: Run tests
run: npx jest --maxWorkers=2
# Or for Playwright, limit browsers
- name: Run E2E tests
run: npx playwright test --workers=1
Pattern 4: Timezone-Dependent Failures
GitHub-hosted runners use UTC by default. Tests that depend on a specific timezone will fail unless the timezone is explicitly set.
steps:
- uses: actions/checkout@v4
- name: Set timezone
run: sudo timedatectl set-timezone America/New_York
- name: Run tests
run: npm test
env:
TZ: America/New_York
Building a Flaky Test Detection Workflow
This reusable workflow runs your test suite multiple times to detect flaky tests automatically.
# .github/workflows/flaky-detection.yml
name: Flaky Test Detection
on:
schedule:
# Run every Monday and Thursday at 3 AM UTC
- cron: '0 3 1,4'
workflow_dispatch:
inputs:
runs:
description: 'Number of test runs'
default: '5'
type: string
jobs:
detect:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
run: [1, 2, 3, 4, 5]
steps:
- uses: actions/checkout@v4
- name: Setup Node.js
uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- name: Run tests (attempt ${{ matrix.run }})
run: |
npx jest \
--json \
--outputFile=test-results-${{ matrix.run }}.json \
--forceExit
continue-on-error: true
- name: Upload results
uses: actions/upload-artifact@v4
with:
name: results-run-${{ matrix.run }}
path: test-results-${{ matrix.run }}.json
analyze:
needs: detect
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Download all results
uses: actions/download-artifact@v4
with:
pattern: results-run-*
merge-multiple: true
- name: Analyze flakiness
run: |
echo "## Flaky Test Detection Report" >> $GITHUB_STEP_SUMMARY
echo "" >> $GITHUB_STEP_SUMMARY
# Compare results across runs
node -e "
const fs = require('fs');
const results = {};
for (let i = 1; i <= 5; i++) {
const file = 'test-results-' + i + '.json';
if (!fs.existsSync(file)) continue;
const data = JSON.parse(fs.readFileSync(file, 'utf-8'));
data.testResults.forEach(suite => {
suite.testResults.forEach(test => {
const key = test.fullName;
if (!results[key]) results[key] = [];
results[key].push(test.status);
});
});
}
const flaky = Object.entries(results)
.filter(([, statuses]) => {
const unique = new Set(statuses);
return unique.size > 1;
})
.map(([name, statuses]) => ({
name,
passes: statuses.filter(s => s === 'passed').length,
fails: statuses.filter(s => s === 'failed').length,
}))
.sort((a, b) => b.fails - a.fails);
if (flaky.length === 0) {
console.log('No flaky tests detected across 5 runs.');
} else {
console.log('Flaky tests detected: ' + flaky.length);
flaky.forEach(t => {
console.log(' ' + t.name + ' (passed: ' + t.passes + ', failed: ' + t.fails + ')');
});
}
"
Monitoring Flaky Tests Across PR Builds
Add flaky test monitoring to every pull request to catch new flakiness before it merges.
# .github/workflows/pr-tests.yml
name: PR Tests
on:
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
cache: 'npm'
- run: npm ci
- name: Run tests with JUnit output
run: |
npx jest \
--reporters=default \
--reporters=jest-junit \
--forceExit
env:
JEST_JUNIT_OUTPUT_DIR: ./test-results
JEST_JUNIT_OUTPUT_NAME: results.xml
continue-on-error: true
- name: Analyze with DeFlaky
run: |
npx deflaky analyze \
--input test-results/results.xml \
--format junit \
--threshold 0.05
continue-on-error: true
- name: Upload test results
if: always()
uses: actions/upload-artifact@v4
with:
name: test-results
path: test-results/
- name: Push results to DeFlaky Dashboard
if: always()
run: |
npx deflaky push \
--input test-results/results.xml \
--project ${{ github.repository }} \
--commit ${{ github.sha }} \
--branch ${{ github.head_ref }}
env:
DEFLAKY_TOKEN: ${{ secrets.DEFLAKY_TOKEN }}
Retry Configuration for GitHub Actions
When infrastructure-level retries are needed, GitHub Actions does not have built-in job retries. Use these patterns instead.
Test-Level Retries (Preferred)
Configure retries in your test framework rather than at the workflow level.
- name: Run tests with retries
run: npx jest --retries 2
Step-Level Retries
Use the nick-fields/retry action for steps that might fail due to infrastructure issues.
- name: Run tests with step retry
uses: nick-fields/retry@v3
with:
timeout_minutes: 15
max_attempts: 3
command: npm test
Job-Level Retries via Reusable Workflow
# .github/workflows/test-with-retry.yml
name: Tests with Retry
on: [push]
jobs:
test:
uses: ./.github/workflows/run-tests.yml
retry-on-failure:
needs: test
if: failure()
uses: ./.github/workflows/run-tests.yml
Artifact-Based Debugging
When a test fails in GitHub Actions, you need artifacts to debug it. Configure comprehensive artifact collection.
- name: Run Playwright tests
run: npx playwright test
continue-on-error: true
- name: Upload test artifacts
if: always()
uses: actions/upload-artifact@v4
with:
name: test-artifacts-${{ github.run_id }}
path: |
test-results/
playwright-report/
retention-days: 14
For Playwright specifically, enable trace capture on retries:
// playwright.config.ts
export default defineConfig({
retries: process.env.CI ? 2 : 0,
use: {
trace: 'on-first-retry',
screenshot: 'only-on-failure',
video: 'on-first-retry',
},
});
Long-Term Monitoring with DeFlaky
For ongoing monitoring across all your GitHub Actions workflow runs, integrate DeFlaky into your pipeline.
# Add to your main test workflow
- name: Push results to DeFlaky
if: always()
run: |
deflaky push \
--input test-results.xml \
--project "${{ github.repository }}" \
--commit "${{ github.sha }}" \
--branch "${{ github.ref_name }}" \
--run-id "${{ github.run_id }}"
env:
DEFLAKY_TOKEN: ${{ secrets.DEFLAKY_TOKEN }}
The DeFlaky Dashboard aggregates results across all workflow runs, computing FlakeScore per test and per suite. You can see at a glance which tests are the most flaky, whether flakiness is trending up or down, and which GitHub Actions workflow runs were affected.
Set up alerts to get notified when a previously stable test becomes flaky:
# Configure alerts in DeFlaky
deflaky alerts create \
--project "my-org/my-repo" \
--condition "flakescore < 80" \
--channel slack \
--webhook "$SLACK_WEBHOOK"
Conclusion
GitHub Actions is an excellent CI platform, but its shared, ephemeral runner model introduces flakiness patterns that teams on dedicated CI servers may not have encountered. Dependency installation failures, service startup races, resource exhaustion, and runner image changes all contribute to tests that pass locally but fail intermittently in CI.
The solutions are straightforward: cache aggressively, health-check service containers, limit parallelism, and set explicit environment variables. For ongoing monitoring, integrate DeFlaky into your GitHub Actions workflows to track test reliability across every run and catch new flakiness before it becomes entrenched.
Build the detection workflow from this article, run it weekly, and review the results. Within a month, you will have a clear map of your flaky tests and a prioritized list of fixes. That is the first step toward a CI pipeline your team can trust.