How to Build a Flaky Test Monitoring Dashboard for Your Team
You cannot fix what you cannot see. Most engineering teams know they have flaky tests, but they lack the data to quantify the problem, prioritize fixes, or measure improvement. A flaky test dashboard changes that by making test reliability visible, measurable, and actionable.
Without a flaky test dashboard, teams rely on gut feelings: "I think the checkout test is flaky" or "CI has been red a lot lately." That is not a strategy. It is a recipe for slow, frustrating debugging sessions and eroded confidence in your test suite.
This guide covers everything you need to build a meaningful test monitoring system -- from the metrics that matter to the alerting strategies that keep your team informed without drowning them in noise.
Why You Need a Flaky Test Dashboard
Before diving into the how, let us be clear about the why. A dedicated flaky test dashboard serves three critical purposes:
1. Visibility. Everyone on the team can see which tests are unreliable. This eliminates the "works on my machine" debates and the finger-pointing that often follows CI failures. 2. Prioritization. Not all flaky tests are equally harmful. A dashboard shows you which tests fail most often, which block the most pipelines, and which waste the most developer time. You fix the worst offenders first. 3. Accountability. When flake rates are visible, teams naturally take ownership. A dashboard that shows improvement over time is motivating. One that shows degradation prompts action before things spiral.Essential Metrics for Your Flaky Test Dashboard
The difference between a useful dashboard and a wall of noise comes down to choosing the right metrics. Here are the metrics that actually drive decisions.
Flake Rate
The most fundamental metric. Flake rate is the percentage of test runs that produce inconsistent results over a given time window.
Flake Rate = (Flaky Runs / Total Runs) x 100
A "flaky run" is any run where a test passes on retry after initially failing, or where a test produces different results across multiple executions of the same commit.
Track flake rate at three levels:
FlakeScore
Raw flake rate does not capture the full impact of a flaky test. A test that flakes once a month in a rarely-run nightly suite is very different from a test that flakes daily in your PR pipeline.
FlakeScore combines multiple dimensions into a single priority number:
FlakeScore = Flake Rate x Run Frequency x Pipeline Criticality x Time-to-Fix
Where:
A test with a 5% flake rate that runs 50 times per day in your PR pipeline has a much higher FlakeScore than a test with a 20% flake rate that runs once nightly. FlakeScore tells you which to fix first.
Mean Time to Detect (MTTD)
How long does it take from when a test becomes flaky to when your team notices? If your MTTD is measured in weeks, flaky tests are silently accumulating damage -- slowing CI, training developers to ignore failures, and masking real bugs.
A good flaky test dashboard reduces MTTD to hours by automatically flagging new flaky tests as they appear.
Mean Time to Resolve (MTTR)
Once a flaky test is identified, how long does it take to fix or quarantine it? Track this metric to understand whether your team is actually addressing flakiness or just acknowledging it.
Flaky Test Inventory
A running count of known-flaky tests, broken down by status:
Historical Trends
Point-in-time metrics are useful, but trends tell the real story. Track these over weekly and monthly windows:
- Is the total number of flaky tests increasing or decreasing?
- Is the suite-level flake rate improving?
- How many flaky tests were fixed this sprint vs. how many new ones appeared?
- Which teams are improving? Which are falling behind?
Designing Your Dashboard Layout
A well-designed flaky test dashboard has three zones: the executive summary, the investigation view, and the detail view.
Executive Summary (Top of Dashboard)
This section answers the question: "How healthy is our test suite right now?"
Include these widgets:
Investigation View (Middle)
This section helps engineers diagnose and prioritize:
Detail View (Drill-Down)
When an engineer clicks on a specific test, they should see:
- Full test name and file path
- Flake rate over the last 30 days
- Last 20 run results (pass/fail timeline)
- Associated error messages and stack traces
- Commit that introduced the test
- Commit where flakiness was first detected
- Suggested root cause category
Collecting the Data
The hardest part of building a flaky test dashboard is collecting reliable data. Here are three approaches, from simplest to most comprehensive.
Approach 1: Parse CI Logs
Extract test results from your CI system's logs or JUnit XML reports:
import xml.etree.ElementTree as ET
from datetime import datetime
import json
def parse_junit_xml(xml_path, build_id, commit_sha):
tree = ET.parse(xml_path)
root = tree.getroot()
results = []
for testsuite in root.findall('.//testsuite'):
for testcase in testsuite.findall('testcase'):
result = {
'name': testcase.get('name'),
'classname': testcase.get('classname'),
'duration': float(testcase.get('time', 0)),
'build_id': build_id,
'commit': commit_sha,
'timestamp': datetime.utcnow().isoformat(),
'status': 'pass'
}
if testcase.find('failure') is not None:
result['status'] = 'fail'
result['error'] = testcase.find('failure').get('message')
elif testcase.find('skipped') is not None:
result['status'] = 'skip'
results.append(result)
return results
Approach 2: Test Framework Plugins
Many frameworks have plugins or reporters that can send results directly to a database:
// Custom Jest reporter that sends results to your dashboard API
class FlakeReporter {
constructor(globalConfig, options) {
this.apiEndpoint = options.apiEndpoint;
}
onTestResult(test, testResult) {
const results = testResult.testResults.map(r => ({
name: r.fullName,
status: r.status,
duration: r.duration,
retries: r.invocations - 1,
errorMessage: r.failureMessages?.join('\n'),
file: test.path,
timestamp: new Date().toISOString()
}));
fetch(this.apiEndpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ results })
});
}
}
module.exports = FlakeReporter;
Approach 3: Use DeFlaky
DeFlaky collects test results automatically, detects flaky tests using statistical analysis, calculates FlakeScore, and provides a ready-made dashboard with all the metrics described in this guide:
npx deflaky run --report
This approach skips the custom infrastructure entirely and gives you a production-ready flaky test dashboard in minutes rather than weeks.
Setting Up Alerts
A dashboard that nobody checks is useless. Alerts ensure the right people see the right information at the right time.
Alert Tiers
Tier 1 -- Immediate (Slack/Teams notification):- Suite flake rate exceeds threshold (e.g., > 10%)
- A new test becomes flaky in a PR pipeline
- A previously fixed test starts flaking again
- Summary of flaky tests detected in the last 24 hours
- Tests that have been flaky for more than 7 days without action
- FlakeScore changes for top offenders
- Overall trend analysis
- Team-by-team flake rate comparison
- ROI of flakiness fixes (time saved)
Example Alert Configuration
alerts:
flake_rate_spike:
condition: "suite_flake_rate > 0.10"
channel: "#test-reliability"
severity: "high"
message: "Suite flake rate has exceeded 10% ({current_rate}%)"
new_flaky_test:
condition: "test.is_newly_flaky == true"
channel: "#test-reliability"
severity: "medium"
message: "New flaky test detected: {test_name} (flake rate: {rate}%)"
stale_flaky_test:
condition: "test.days_since_detection > 14 AND test.status == 'active'"
channel: "#engineering-leads"
severity: "medium"
message: "{test_name} has been flaky for {days} days without action"
Avoiding Alert Fatigue
The biggest risk with alerting is noise. Follow these principles:
Team Process Integration
A flaky test dashboard is most effective when it is woven into your team's existing processes.
Sprint Planning
Review the flaky test dashboard at the start of each sprint. Allocate time to fix the top 3-5 offenders based on FlakeScore. This makes flakiness reduction a visible, planned activity rather than something engineers do when they are frustrated.
PR Reviews
When reviewing a PR that adds new tests, check whether those tests have been run multiple times to verify they are not flaky. Some teams require new tests to pass 5 consecutive runs before merging.
Incident Response
When a CI pipeline blocks a release due to test failures, the flaky test dashboard should be the first place the team checks. If the failing test is a known flaky test, the team can make an informed decision about whether to retry, quarantine, or investigate.
Retrospectives
Include test reliability metrics in your sprint retrospectives. Celebrate improvements. Discuss what caused new flakiness. The goal is to make test reliability part of the team's definition of quality.
Measuring ROI of Your Dashboard
To justify the investment in test monitoring, track the impact:
What Makes DeFlaky's Dashboard Different
While you can build a custom flaky test dashboard with the approaches above, DeFlaky provides a purpose-built solution that includes:
The goal is to give you a production-ready test reliability dashboard without the weeks of custom development.
Conclusion
Building a flaky test dashboard is one of the highest-leverage investments a testing team can make. It transforms flakiness from a vague annoyance into a quantifiable, manageable problem. The metrics described in this guide -- flake rate, FlakeScore, MTTD, MTTR, and historical trends -- give you the data you need to prioritize fixes, measure improvement, and hold the line against test suite degradation.
Whether you build your own or use an existing tool, the important thing is to start tracking. You cannot improve what you do not measure.
Get a flaky test dashboard for your team in minutes, not months. DeFlaky provides automatic detection, FlakeScore ranking, and team-level visibility out of the box:npx deflaky run