QUIP 24: Rules and guidelines for ignoring results of flaky automatic tests

by Anna Wojciechowska, Axel Spörl, Edward Welbourne

Details

  • Number: 24
  • Title: Rules and guidelines for ignoring results of flaky automatic tests
  • Author: Anna Wojciechowska, Axel Spörl, Edward Welbourne
  • Status: Active
  • Type: Process
  • Created: 19 July 2024

Motivation

The blacklisting functionality was introduced in Qt 5.4, replacing the previous insignificant_test mechanism. Blacklisting was implemented as a tool to mitigate the CI (Continuous Integration)-blocking effect of flaky tests. A blacklisted test will run as usual; but if it fails, it prints BFAIL instead of FAIL, and the exit code is successful. If it passes, it prints BPASS instead of PASS. It enables integrations despite a failing automated test, thus creating technical debt. While solving an immediate problem, it poses the risk of faulty commits being merged instead of being prevented by a test failure.

Since its introduction, there has been a steady increase in the number of blacklisted automated tests. There are different approaches to calculating the number of blacklisted tests. In this document, we refer to the blacklisted functions metric, defined as the number of distinct blacklisted test functions (without the distinction of data tags) that are run as part of the Health Check Coin task in the dev branch. Using this metric, we counted 560 blacklisted test functions, which constituted 2.8% of the total test functions as of July 8, 2024.

The absolute number continues to rise, despite efforts to unblacklist tests; the relative number due to the growth of The Qt Project remained steady in the last year. Some tests have remained blacklisted since the mechanism’s introduction. As of July 2024, the average age of a blacklisted test is 3.6 years, and 260 tests have been blacklisted for more than three years. Those numbers show that the blacklisting feature sometimes has the effect of permanently excluding test results, despite spending CI time to keep running them.

Recommendations

This QUIP sets out recommended practice for the handling of flaky tests. It clarifies that blacklisting is a form of quarantine, selectively ignoring a flaky test so that it does not block development. Since that may allow regressions to get integrated, that the test would have caught, this quarantine does not change the fact that the flaky test is in need of fixing. This technical debt should be tracked and visible to contributors responsible for or interested in the relevant code.

A test that sporadically crashes cannot be guarded against with blacklisting. In such a case, other mechanisms for avoiding the problem should be used: see Best practices for test for guidance on what to do in such a case.

Review Guideline for Patches that Blacklist Tests

A test may be blacklisted if all of the following conditions are met:

  • The test has demonstrated flakiness in integrations. See more at general information about qt test system
  • A preliminary investigation has shown that it is difficult to provide an immediate fix.
  • There is a relevant open, prioritized as (at least) important (P2) and assigned ticket at https://bugreports.qt.io, ensuring that the matter is followed up upon.
  • The ticket must be mentioned as a comment in the BLACKLIST file.

Blacklisting the test means adding an entry to the test directory’s BLACKLIST file. This entry comprises a heading for each test to be blacklisted followed by condition lines that identify the contexts in which the test is known to be flaky. These conditions should be narrowly specified, to avoid ignoring the test in contexts in which it is not known to be flaky.

Each condition line added should be marked with a comment identifying a bug report that shall remain open once the blacklisting has been integrated. Such bug reports should also be identified by Task-number footers in the commit message, to enable those later working on the ticket to find and revert the blacklisting. Such bug reports should be assigned, prioritized and otherwise annotated suitably to ensure they are visible to relevant contributors, to enable them to suitably prioritize work to fix the flakiness within the context of their work on Qt.

Reviewers should verify, in addition to the preconditions above, that each condition line added to the BLACKLIST file has a comment referencing a bug report, as above, and that each such bug report:

  • Appears in a Task-number footer of the commit message (and not in a Fixes footer),
  • Describes the flakiness clearly, including relevant excerpts from logs of builds exhibiting flaky failures (not just links to the logs), and
  • Is suitably assigned, prioritized and annotated to ensure it shall get the attention of relevant contributors.

When to Retract a Blacklisting

Ideally the bug report relating to the blacklisted test shall be addressed, leading to the blacklisting being retracted as part of fixing the problems that cause it.

Contributors may, in any case, make changes that cause the blacklisted test to cease being flaky. One of the concerns with blacklisting is that a contribution that the test should have caught may get through integration. This may lead to the formerly flaky test always failing. Equally, since the underlying cause of the flakiness is typically unknown, a contribution may fix it without those involved being aware they are doing so.

The Continuous Integration system keeps track of the history of results from blacklisted tests (among others). This makes it possible to review the history of past results for a particular test and configuration condition to determine whether it is still flaky. A blacklisted test is only flaky if it does produce a mix of BPASS and BFAIL outcomes. If it ceases being flaky, its blacklisting should be retracted. See Best practices for test for how to resolve that.

References