How to Detect Flaky Tests in GitHub Actions with DeFlaky

Flaky tests silently erode confidence in your CI pipeline. A test that passes locally but fails randomly in GitHub Actions wastes developer time, blocks deployments, and eventually leads teams to ignore CI failures entirely. The solution is not to retry and hope -- it is to systematically detect which tests are flaky and fix them.

DeFlaky automates this detection. It runs your test suite multiple times, identifies tests that produce inconsistent results, calculates a FlakeScore, and pushes everything to a dashboard where you can track reliability over time. This guide walks through setting it up in GitHub Actions from scratch.

Why Run Flaky Test Detection in CI

Running DeFlaky locally tells you about flakiness on your machine. Running it in CI tells you about flakiness where it actually matters -- in the environment where your tests gate deployments.

There are several reasons CI-based detection is essential:

Environment differences: GitHub Actions runners have different CPU, memory, and network characteristics than developer machines. Tests that are stable locally may flake in CI due to resource contention or network latency.

Consistency: Every run happens on a clean VM with identical dependencies. This removes the "works on my machine" variable and gives you reliable flakiness data.

Automation: Schedule weekly detection runs without anyone having to remember. New flaky tests get caught before they become entrenched.

PR gating: Fail PRs that introduce new flaky tests before they merge into main.

Historical tracking: The DeFlaky dashboard aggregates results across all CI runs, showing trends over time.

Prerequisites

Before starting, you need:

A GitHub repository with a test suite (Playwright, Cypress, Jest, Pytest, or any other framework)

A DeFlaky account -- sign up at deflaky.com

A DeFlaky API token (format: df_)

Step 1: Get Your DeFlaky Token

Go to the DeFlaky Dashboard and sign in.

Click New Project and give it a name matching your repository.

Copy the generated API token. It looks like df_a1b2c3d4-e5f6-7890-abcd-ef1234567890.

Keep this token -- you will add it to GitHub in the next step.

Step 2: Add the Token as a GitHub Secret

Your DeFlaky token must never be committed to source control. GitHub Secrets keeps it encrypted and only exposes it to workflows at runtime.

Open your GitHub repository in a browser.

Go to Settings > Secrets and variables > Actions.

Click New repository secret.

Set the name to DEFLAKY_TOKEN and paste your token as the value.

Click Add secret.

The token is now available in your workflows as ${{ secrets.DEFLAKY_TOKEN }}.

Step 3: Create the Workflow File

Create a new file at .github/workflows/deflaky.yml in your repository. Below is a complete workflow for a Playwright project:

name: DeFlaky - Flaky Test Detection

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2   1'  # Weekly on Monday at 2am UTC

jobs:
  deflaky:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - run: npm ci

      - name: Install Playwright browsers
        run: npx playwright install --with-deps

      - name: Install DeFlaky CLI
        run: npm install -g deflaky-cli

      - name: Run DeFlaky
        run: deflaky run -c "npx playwright test" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}
        env:
          DEFLAKY_TOKEN: ${{ secrets.DEFLAKY_TOKEN }}

This workflow triggers on three events:

Push to main: Detect flakiness in code that just merged.

Pull requests: Catch flaky tests before they merge.

Weekly schedule: Ongoing monitoring even when no code changes.

Framework-Specific Examples

The only thing that changes between frameworks is the test command passed to deflaky run -c. Here are ready-to-use examples for the most popular frameworks.

Playwright

- name: Install Playwright browsers
  run: npx playwright install --with-deps

name: Run DeFlaky
  run: deflaky run -c "npx playwright test" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

For single-browser testing to save CI minutes:

- name: Run DeFlaky (Chromium only)
  run: deflaky run -c "npx playwright test --project=chromium" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

Cypress

- name: Run DeFlaky
  run: deflaky run -c "npx cypress run" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

For a specific spec file:

- name: Run DeFlaky
  run: deflaky run -c "npx cypress run --spec cypress/e2e/checkout.cy.ts" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

Jest

- name: Run DeFlaky
  run: deflaky run -c "npx jest --ci" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

The --ci flag in Jest disables interactive mode and provides better output for CI environments.

Pytest

For Python projects, make sure Python and your dependencies are installed first:

- uses: actions/setup-python@v5
  with:
    python-version: '3.12'

run: pip install -r requirements.txt

name: Install DeFlaky CLI
  run: npm install -g deflaky-cli

name: Run DeFlaky
  run: deflaky run -c "pytest" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

For Pytest with JUnit XML output for richer reporting:

- name: Run DeFlaky
  run: deflaky run -c "pytest --junitxml=report.xml" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

Configuration Options

DeFlaky supports several flags that control behavior in CI:

| Flag | Description | Example |

|------|-------------|---------|

| -c | Test command to run | -c "npx playwright test" |

| -r | Number of runs | -r 5 |

| --push | Push results to dashboard | --push |

| --token | API token | --token ${{ secrets.DEFLAKY_TOKEN }} |

| --fail-threshold | Fail CI if FlakeScore below N% | --fail-threshold 90 |

| --verbose | Show detailed output per run | --verbose |

Failing CI on Low FlakeScore

Use --fail-threshold to enforce a minimum FlakeScore. If the score drops below the threshold, the workflow step exits with code 1 and the CI check fails:

- name: Run DeFlaky (strict mode)
  run: deflaky run -c "npx playwright test" -r 5 --push --token ${{ secrets.DEFLAKY_TOKEN }} --fail-threshold 90

This is useful for preventing PRs from merging when they introduce flaky tests.

Scheduling Weekly Runs

The schedule trigger in GitHub Actions uses cron syntax. Here are common schedules:

on:
  schedule:
    # Every Monday at 2am UTC
    - cron: '0 2   1'

    # Every day at midnight UTC
    # - cron: '0 0   *'

    # Every Monday and Thursday at 3am UTC
    # - cron: '0 3   1,4'

Weekly runs give you a reliable baseline even during periods with no active development. The DeFlaky dashboard shows trends over time so you can see whether your test suite is becoming more or less reliable.

Viewing Results on the Dashboard

After your workflow runs, results are automatically pushed to the DeFlaky Dashboard. There you can see:

FlakeScore trend: A graph showing your test suite's reliability over time.

Flaky test list: Every test that produced inconsistent results, sorted by severity.

Pass rate per test: How many runs passed vs. failed for each flaky test.

Stack traces: The actual error messages from failed runs.

First seen / last seen: When each flaky test was first detected and when it last flaked.

Filter by branch, date range, or test name to drill into specific issues.

PR Comments with FlakeScore (Coming Soon)

We are working on a feature that automatically posts a comment on every pull request with the FlakeScore and a summary of any flaky tests detected. The comment will include:

Overall FlakeScore for the PR
List of flaky tests with pass/fail counts
Comparison against the main branch baseline
Direct link to the test details on the dashboard

Until this feature ships, you can achieve the same result using actions/github-script in your workflow. See the full example workflow for a working implementation that posts PR comments.

Troubleshooting

"deflaky: command not found"

Make sure you install the CLI before running it:

- run: npm install -g deflaky-cli

Or use npx:

- run: npx deflaky-cli run -c "npx playwright test" -r 3 --push --token ${{ secrets.DEFLAKY_TOKEN }}

Token Not Working

Verify the secret name matches exactly: DEFLAKY_TOKEN in both the secret and the workflow reference.

Tokens use the format df_. Make sure you copied the full token from the dashboard.

Secrets are not available in workflows triggered by forks. If you are running PRs from forks, consider using the pull_request_target event (with caution).

Tests Timing Out

GitHub Actions has a default job timeout of 6 hours. For faster feedback, set an explicit timeout:

jobs:
  deflaky:
    runs-on: ubuntu-latest
    timeout-minutes: 30
    steps: ...

Since DeFlaky runs your tests multiple times, multiply your normal test duration by the number of runs plus some buffer.

Playwright Browser Installation Fails

Always use --with-deps to install system dependencies:

- run: npx playwright install --with-deps

This installs the browser binaries and all required system libraries (like libgbm and libwoff2) on the Ubuntu runner.

Complete Example Workflow

For a ready-to-copy workflow file with PR comments, step outputs, and multi-framework support, download the example workflow and place it at .github/workflows/deflaky.yml in your repository.

What's Next

Read the full DeFlaky documentation for advanced configuration

Set up dashboard alerts to get notified when FlakeScore drops

Explore the CLI reference for all available commands and flags

Check out our guide on fixing flaky tests for strategies to eliminate flakiness at the source

Detecting flaky tests is the first step. The goal is to fix them and build a test suite your team can trust. DeFlaky gives you the data to prioritize which flaky tests to fix first, based on how frequently they flake and how much CI time they waste.