Ah, those good, old flaky test suites! Sooner or later you’ll encounter one of them. They are test suites that sometimes pass, sometimes fail, depending on certain environmental conditions. A lot has been written about flaky tests and what causes them, but in this post I’d like to discuss a specific type of flaky test –order dependant test failures–, and how to help debug them using GitHub Actions as part of your CI/CD pipelines.

Order dependant test failures

An order dependant test failure is one that happens when:

  • There is more than one test being run as part of the suite.
  • One of the test fails only when the suite is run in a specific order.

Let’s simplify things and assume you have a very small test suite consisting of two tests: Test A and Test B. This post will assume we’re using ruby as our language of choice, and rspec as our testing framework, however the fundamentals apply to any other language and good testing framework. In this case, we might be dealing with a situation like this:

  1. When we run Test A, it passes.
  2. When we run Test B, it passes.
  3. When we run Test A and Test B, they both pass.
  4. When we run Test B and Test A, Test B passes but Test A fails.

Test scenarios

If using rspec in its default configuration, you are probably running your test suite in a random order. This makes rspec generate a random seed and use that seed to determine in which order tests should be run. When running the above test suite using rspec in a random order, you can expect your suite to break roughly 50% of the times.

However, order dependant test failures can be very pernicious because they are introduced silently, they can make your test suite fail only occasionally, which leads to developers being lazy and use the retry the tests until they pass technique. The bogus test doesn’t get dealt with until it’s too late: the test suite now fails often, causing delays in releases, frustration, or even panic situations when the need for a quick release arises: there’s nothing worse than having to hotfix a production issue quickly and not being able to because your test suite keeps failing.

Bisect to the rescue

One of the features of rspec is the ability to run a bisect. Once you discover an order dependant failure and can consistently reproduce it with a fixed seed, it can still be difficult to determine which test is causing the issue. In our example we only have 2 tests, but in bigger test suites the failing test might be executed after other hundreds of tests, making it hard to determine which one of them is the bad apple. Bisect solves that problem by consistently running all your tests to try and determine the minimal set of examples that reproduce the same failures. The way rou run bisect is by providing rspec with the exact same options and seed that caused the order dependant failure, and adding the --bisectflag to the CLI. Interally bisect will split tests into two chunks, run those tests, discard the chunk that does not fail, and carry on recursively until the smallest failing number of tests is found.

Our example

I have created a proof of concept gem with a test suite that has an order dependant failure. The repository can be checked at brafales/flaky_specs_poc.

If you’re not interested in the nitty gritty of why this particular test suite is problematic and are only interested in the GitHub Actions Workflow file, please skip this section.

The problematic spec in this gem is spec/flaky_specs_poc/job_two_spec.rb. This proof of concept uses Sidekiq to show a common testing issue with this popular background job processing framework.

Sidekiq works on the basis of jobs, which get pushed into a queue, and then picked up by a worker process. Sidekiq will use a backend to store jobs, for example a redis instance; however, when running your tests you might not want to have to mess around with having a redis instance available for use. For this reason, Sidekiq in test mode uses a virtual backend which will queue jobs in memory, and doesn’t process them by default.

If you want to test a bit of code that queues a Sidekiq job, you do it like this:

# frozen_string_literal: true

require "sidekiq/testing"

RSpec.describe FlakySpecsPoc::JobOne do
  it "queues an HttpJob" do
    expect do
      subject.perform
    end.to change(FlakySpecsPoc::HttpJob.jobs, :size).by(1)
  end
end

This is a good way to check that your code did the right thing (queue a job) without having to worry about the specifics about what that job does. It’s essentially the same as mocking a third party HTTP request.

However, sometimes you might want to know not only that a job was queued, but also that a certain side effect of that job having run took place. One might argue that this is a bad test since we should not be testing for side effects, but the reality is these kind of tests (especially feature or end to end tests) are ubiquitous. For this, Sidekiq provides a special method that allows you to run jobs that get queued immediately, in an in-line fashion. This method can be used in two ways:

  • With a block, where inline test mode will be enabled for the code that runs inside the block, and disabled once the code in the block has been executed.
  • Without a block, which enables inline testing globally.

And it’s very easy to do something like this in a spec where you want your Sidekiq jobs to run inline:

# frozen_string_literal: true

require "sidekiq/testing"

RSpec.describe FlakySpecsPoc::JobOne do
  before do
    Sidekiq::Testing.inline!
  end

  it "checks something done by the HttpJob" do
    VCR.use_cassette("job_one") do
      subject.perform
    end
    expect(true).to eq(true)
  end
end

What the code above will do when run is to enable Sidekiq inline testing and leaving it on for the rest of the test suite execution. The problem with this is that if another test after this runs and queues a Sidekiq job, that job will be run inline instead of being queued in memory. If that test does not expect that, it’ll fail only if run after the first test.

I’ve recreated this scenario in my gem by having a spec that tests that a job is queued, then having a spec that mistakenly enables inline testing for Sidekiq globally, and finally by having the Sidekiq job that gets queued make an HTTP request. I’m using VCR to record and then mock external HTTP calls.

So what happens is the following:

  • If the test that checks if a job is queued runs first, it passes, because no external HTTP calls are made, since the Sidekiq job simply gets queued in memory, but never executed inline.
  • If the test that sets inline testing runs first, then when the other test runs after it, the Sidekiq job will run, make an HTTP call and cause a failure since VCR does not expect that external call to be made.

For reference, this is what a correct way to write this spec is:

# frozen_string_literal: true

require "sidekiq/testing"

RSpec.describe FlakySpecsPoc::JobOne do
  around do |spec|
    Sidekiq::Testing.inline! do
      spec.call
    end
  end

  it "checks something done by the HttpJob" do
    VCR.use_cassette("job_one") do
      subject.perform
    end
    expect(true).to eq(true)
  end
end

You can easily recreate this by running the following command on the gem source code:

bundle exec rspec --order=rand --seed=55702

Which should give you this output:


Randomized with seed 55702

FlakySpecsPoc::JobOne
  checks something done by the HttpJob

FlakySpecsPoc::HttpJob
  gets a response from a server

FlakySpecsPoc::JobOne
  queues an HttpJob (FAILED - 1)

Failures:

  1) FlakySpecsPoc::JobOne queues an HttpJob
     Failure/Error: res = Net::HTTP.get_response(uri)

     VCR::Errors::UnhandledHTTPRequestError:


       ================================================================================
       An HTTP request has been made that VCR does not know how to handle:
         GET https://reqbin.com/echo/get/json

       There is currently no cassette in use. There are a few ways
       you can configure VCR to handle this request:

         * If you're surprised VCR is raising this error
           and want insight about how VCR attempted to handle the request,
           you can use the debug_logger configuration option to log more details [1].
         * If you want VCR to record this request and play it back during future test
           runs, you should wrap your test (or this portion of your test) in a
           `VCR.use_cassette` block [2].
         * If you only want VCR to handle requests made while a cassette is in use,
           configure `allow_http_connections_when_no_cassette = true`. VCR will
           ignore this request since it is made when there is no cassette [3].
         * If you want VCR to ignore this request (and others like it), you can
           set an `ignore_request` callback [4].

       [1] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/debug-logging
       [2] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/getting-started
       [3] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/allow-http-connections-when-no-cassette
       [4] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/ignore-request
       ================================================================================
     # ./lib/flaky_specs_poc/http_job.rb:13:in `perform'
     # ./lib/flaky_specs_poc/job_one.rb:10:in `perform'
     # ./spec/flaky_specs_poc/job_one_spec.rb:8:in `block (3 levels) in <top (required)>'
     # ./spec/flaky_specs_poc/job_one_spec.rb:7:in `block (2 levels) in <top (required)>'

Finished in 0.03305 seconds (files took 0.85052 seconds to load)
3 examples, 1 failure

Failed examples:

rspec ./spec/flaky_specs_poc/job_one_spec.rb:6 # FlakySpecsPoc::JobOne queues an HttpJob

Randomized with seed 55702

Run the same test suite with a different seed though:

bundle exec rspec --order=rand --seed=3164

And everything’s good:

Randomized with seed 3164

FlakySpecsPoc::HttpJob
  gets a response from a server

FlakySpecsPoc::JobOne
  queues an HttpJob

FlakySpecsPoc::JobOne
  checks something done by the HttpJob

Finished in 0.02717 seconds (files took 0.39785 seconds to load)
3 examples, 0 failures

Randomized with seed 3164

In this case, given we have very little tests, this could be relatively easy to debug manually, but with a bigger test suite we can use rspect bisect:

bundle exec rspec --order=rand --seed=55702 --bisect

Which will give us the following:

Bisect started using options: "--order=rand --seed=55702"
Running suite to find failures... (0.10595 seconds)
Starting bisect with 1 failing example and 2 non-failing examples.
Checking that failure(s) are order-dependent... failure appears to be order-dependent

Round 1: bisecting over non-failing examples 1-2 .. ignoring example 2 (0.19095 seconds)
Bisect complete! Reduced necessary non-failing examples from 2 to 1 in 0.25318 seconds.

The minimal reproduction command is:
  rspec './spec/flaky_specs_poc/job_one_spec.rb[1:1]' './spec/flaky_specs_poc/job_two_spec.rb[1:1]' --order=rand --seed=55702

And now we know how to consistently reproduce the error with the minimum number of tests, which will make pinpointing the sneaky bogus test easier.

Automating bisects

The next step is clear: automate it! I’m going to show you a GitHub Actions Workflow that will automatically run a bisect on a failing test suite.

First of all a couple of disclaimers:

  • This has not been productionised, so as usual, use at your own risk ;)
  • This flow does a bisect on failing test suites. This will make your test pipeline slower, since a bunch of failing tests will be run twice, including failures which are not caused by flaky tests!

Here’s the complete flow:

name: Ruby

on:
  push:
    branches:
      - main

  pull_request:

jobs:
  RunTests:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v2
    - name: Set up Ruby
      uses: ruby/setup-ruby@v1
      with:
        bundler-cache: true
    - name: Run the tests
      id: tests
      continue-on-error: true
      run: bundle exec rspec --order=rand -f j -o tmp/rspec_results.json
    - name: Bisect flaky specs
      if: steps.tests.outcome != 'success'
      run: bundle exec rspec --order=rand --seed $(cat tmp/rspec_results.json | jq '.seed') --bisect

The first bit of the flow is a pretty standard way of doing things. The bits that interest us are the Run the testsand Bisect flaky specs steps.

This step will run our tests:

- name: Run the tests
  id: tests
  continue-on-error: true
  run: bundle exec rspec --order=rand -f j -o tmp/rspec_results.json
  • --order=rand will ensure the suite is run in random order.
  • -f j will make sure the output of the tests is in JSON format. This is important since we need to be able to parse the test results easily.
  • -o tmp/rspec_results.json sends the results into a file instead of STDOUT.
  • We also use continue-on-error: true to tell GitHub Actions that when the tests fail, the rest of the steps will still be executed, otherwise on a test failure the flow would immediately end.

And this is the step that will run a bisect:

- name: Bisect flaky specs
  if: steps.tests.outcome != 'success'
  run: bundle exec rspec --order=rand --seed $(cat tmp/rspec_results.json | jq '.seed') --bisect

A few noteworthy bits:

  • if: steps.tests.outcome != 'success' will ensure this step is only run if the original test suite failed.
  • We use cat tmp/rspec_results.json | jq '.seed' to get the seed that was originally used to run the tests, so we can pass it to the bisect.

For reference, this is what an rspec result in JSON format looks like:

{
    "version": "3.11.0",
    "seed": 55702,
    "examples": [
        {
            "id": "./spec/flaky_specs_poc/job_two_spec.rb[1:1]",
            "description": "checks something done by the HttpJob",
            "full_description": "FlakySpecsPoc::JobOne checks something done by the HttpJob",
            "status": "passed",
            "file_path": "./spec/flaky_specs_poc/job_two_spec.rb",
            "line_number": 16,
            "run_time": 0.009731,
            "pending_message": null
        },
        {
            "id": "./spec/flaky_specs_poc/http_job_spec.rb[1:1]",
            "description": "gets a response from a server",
            "full_description": "FlakySpecsPoc::HttpJob gets a response from a server",
            "status": "passed",
            "file_path": "./spec/flaky_specs_poc/http_job_spec.rb",
            "line_number": 4,
            "run_time": 0.003383,
            "pending_message": null
        },
        {
            "id": "./spec/flaky_specs_poc/job_one_spec.rb[1:1]",
            "description": "queues an HttpJob",
            "full_description": "FlakySpecsPoc::JobOne queues an HttpJob",
            "status": "failed",
            "file_path": "./spec/flaky_specs_poc/job_one_spec.rb",
            "line_number": 6,
            "run_time": 0.021981,
            "pending_message": null,
            "exception": {
                "class": "VCR::Errors::UnhandledHTTPRequestError",
                "message": "\n\n================================================================================\nAn HTTP request has been made that VCR does not know how to handle:\n  GET https://reqbin.com/echo/get/json\n\nThere is currently no cassette in use. There are a few ways\nyou can configure VCR to handle this request:\n\n  * If you're surprised VCR is raising this error\n    and want insight about how VCR attempted to handle the request,\n    you can use the debug_logger configuration option to log more details [1].\n  * If you want VCR to record this request and play it back during future test\n    runs, you should wrap your test (or this portion of your test) in a\n    `VCR.use_cassette` block [2].\n  * If you only want VCR to handle requests made while a cassette is in use,\n    configure `allow_http_connections_when_no_cassette = true`. VCR will\n    ignore this request since it is made when there is no cassette [3].\n  * If you want VCR to ignore this request (and others like it), you can\n    set an `ignore_request` callback [4].\n\n[1] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/debug-logging\n[2] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/getting-started\n[3] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/allow-http-connections-when-no-cassette\n[4] https://www.relishapp.com/vcr/vcr/v/6-1-0/docs/configuration/ignore-request\n================================================================================\n\n",
                "backtrace": [
                    "REDACTED FOR LEGIBILITY"
                ]
            }
        }
    ],
    "summary": {
        "duration": 0.037856,
        "example_count": 3,
        "failure_count": 1,
        "pending_count": 0,
        "errors_outside_of_examples_count": 0
    },
    "summary_line": "3 examples, 1 failure"
}

What we do with this file is send it to the jq tool for parsing, and telling it to get us the value for top level key seed. jq is a really useful and powerful tool so I suggest you check it out if you’re unfamiliar with it.

Below you can see a screenshot of this flow successfully bisecting our example test suite.

GitHub Actions Workflow

Conclusions

In this post we have learned about a specific, pernicious test failure that manifests itself when a test suite is run in a specific order. We have then seen how a technique called bisecting can help determine what test of potentially many is causing te failure. Last but not least, we have shown a GitHub Actions Workflow that will automatically run the bisect task when a test suite fails to execute.

This is a very small, toy example of how to make this work. Your real life test suites are probably a lot more complex, bigger, and so this example might not work for you, but the fundamentals should be the same.