Complementing BDD with Session-based Exploratory Testing

Watch me presenting Don't fire the testers  at Agile Testing & BDD eXchange in London, November 2016, or read on to discover what I've got to say about the need for a balanced test strategy.

A definition of BDD can be quite hard to come by, due to the evolving nature of BDD and continuous refinement of its practices, but I quite like using Matt Wynne’s definition: 

“BDD practitioners explore, discover, define, then drive out the desired behaviour of software using conversations, concrete examples and automated tests."

When I first started applying BDD 7 years ago, I didn’t have Matt’s definition and I thought it was about the tools: Cucumber, SpecFlow, etc… and about testing, particularly test automation. While Matt’s definition includes test automation, I’ve learnt the greatest value is generated from using the conversations and concrete examples to discover the desired behaviour, and guide development of the solution. Many people have written and talked about this so I’m not going to focus on the subject of BDD, but instead focus on the test automation section.

With our BDD customers and in the testing community, I often hear these two questions:
  • Do we need testers on our team now we are applying BDD?
  • Should testers learn to code, to help with the test automation effort?

If I had been writing this blog post 7 years ago I would have been one of those converts furiously writing about how “testers should learn how to code” and “retrain as automated testers”. I was on that band wagon. However, over the last 7 years I have seen the error of my ways, and I see many teams who are adopting BDD starting to make the same mistakes. So today I’m writing about some of what I’ve learnt; how I’m not replacing all the testers with SDITs/SDETs (automators), and how I am complementing BDD and automation with Session-based Testing.

Automation - a false sense of security

In a study by Stuart Oskamp, psychiatrists were given one paragraph of information (Stage 1 in the chart below) about a patient and asked to make a diagnosis and rate their confidence in their diagnosis. This was repeated incrementally giving more information and until all the information was provided (Stage 4 in the chart below). For Stage 1, the mean confidence and accuracy score were uncannily close, but you can see the subject’s confidence in their accuracy disproportionately increases when compared to the real increase in accuracy of the diagnosis. This is an example of overconfidence bias.

BDDX presentation chart

There are some quite famous public incidents of the overconfidence bias; this is just one example:

 “The odds of a meltdown are one in 10,000 years"
Vitali Skylarov, Minister of Power and Electrification in the Ukraine
 (cited in Rylsky, February 1986) - this was 2 months before the Chernobyl accident 

What does this have to do with Automated testing?

Automated testing, like other testing activities, provides us with information. But it provides this information in the abstracted form of a pass-fail binary result. The positive nature of all our tests “passing” makes us very susceptible to the overconfidence bias as we have a large volume of positive data. This can lead us to release software that doesn’t deliver customer value or doesn’t work. A reason for this is the binary results of test automation cannot factor in false positives and false negatives. The overconfidence bias makes us overlook this vulnerability in our testing strategy.

We’ve all had those situations where the tests pass but a bug is released into production. This situation is either caused by a false positive test, a missing test case (lack of coverage), or concepts automation cannot test for (usability).

False Positives

I like to compare test automation to the humble line-following robot. The black line marks its intended route and the robot is programmed to follow the line. When it hits a problem like additional lines or breaks in the line, it just stops or, if it's programmed with some additional rules, it will attempt to resolve the problem, sometimes with non-deterministic results.

In automated tests, the black line equates to the route programmed for the automated test; it cannot deviate from this programming and if it comes across a problem, it just fails. It might be programmed with additional rules to handle common problems but all these must be explicitly planned for and programmed.

An automated test might observe a problem on a page but, if the test is not programmed to inspect that area of the page or doesn’t have the capabilities to do so (visual aspects of test automation are still in their infancy), the problem will be ignored. And that is a false positive… another bug into production. A sentient human running through a test script is likely to pick up this issue, but there are many other reasons not to do this which I won’t go into.

When we are writing automated tests, we make assumptions about how the application will be used and the value it provides to the user. The reality that these assumptions were based on is continuously evolving, but because we have codified them within our automation they quickly become out of date, and this again causes problems to be missed.

Test Coverage

Targeting high test coverage within automation is another way of creating information that triggers the overconfidence bias, and gives a false sense of security. 

If you are creating and discussing scenarios to "explore, discover, define "  your user stories, and then choose to automate those scenarios,  the resulting automated test suite would have a low test coverage because the scenarios only cover key examples and doesn't test all the different minute variations that a tester would focus on. A good example of this is when discussing a user story with a product owner you wouldn't create scenarios for all the different field validation rules on a form.  

When this low coverage from applying BDD is challenged, it is tempting to start adding more scenarios to improve the test coverage; it’s something I certainly did in my early years and I’ve observed new adopters doing this. To continue with the line-following robot analogy, each line represents a scenario or test, and we have to keep adding lines to increase coverage.

 There are consequences to this approach:
  • The increased volume of scenarios significantly increases the complexity of navigating the living documentation and creates an additional overhead in its creation, curation, and maintenance.
  • The style of the new scenarios can often diverge from ‘examples’ that guide us in developing the solution, to scenarios with an imperative style that looks like a traditional test script.
  • There is a small overhead in automating when using the ‘Given When Then’ format and BDD automation tools. Is it worthwhile making these additional automated tests readable to the stakeholders? No, you can use a more appropriate tool in the programming language of your choice and share the same automation abstractions between them e.g. Cucumber-JVM -> JUnit; Cucumber -> Rspec; SpecFlow -> NUnit

These consequences add friction to the application of BDD practices, particularly keeping the other Amigos involved in the process, especially the product owner, and the creation of living documentation becomes burdensome - just too much documentation!

Some bugs are unlikely to be picked up by Automation

The sequence of events for an automated test is different to the sequence of events for real human usage of an application. Typically, an automated test will flow through an application in a linear sequence of state transitions, whereas a real user can make seemingly random state transitions backwards and forwards. Just think of a user filling in an html form and triggering a field validation error, correcting their mistake and resubmitting the form. I know of a project where they had great automation covering all the form validation rules, they just forgot to test that users can re-submit after correcting the validation error - oops!

 Another example -  after all, BDD is about examples: forgetting to disable a button until a response comes back from the server allows a user to send multiple requests creating self-inflicted DDoS; a skilled tester has a good chance of discovering this, compared to an automated test where it needs to be designed in.

The above examples are rather simple but are commons areas where automation fails. But think about testing whole  subject areas like usability and security - how much of that is covered by your automation or could be covered by automation? How do you make an automated test consider usability?

Don’t Fire the Testers

I see many teams getting obsessed with test automation due to the false sense of security caused by the overconfidence bias. Management teams then factor this into their test strategies and hiring policies to replace skilled software testers with Automators. What is needed is a balanced strategy that does not rely upon one source of information:
  • Test automation for fast, valuable feedback and to check our core functionality is being delivered.
  • At least one other testing technique that can cover the weaknesses of automation, and  to check that it's doing it's job. (Who talks about testing the tests?)

The best solution I have found so far is complementing BDD and Test Automation with Session-based Exploratory Testing.

What is Exploratory Testing?

I like Elizabeth Hendrickson’s definition of Exploratory Testing which is adapted from James Bach’s definition:

“Simultaneously designing and executing tests to learn about the system, using your insights from the last experiment to inform the next” 

Exploratory testing needs to be conducted by skilled testers because a good knowledge of test case design, previously used for creating test scripts, is now used to dynamically design and execute test cases in real-time, and without creating mountains of documentation. This relies on the testers' ingenuity (Developers might wrongly call this evilness) and their ability to examine the software from multiple user viewpoints.

The 12th agile principle, "At regular intervals, the team reflects on how to become more effective, then tunes and adjusts its behavior accordingly", can be summarised as Inspect and Adapt. Exploratory testing completely embodies this principle by taking the information observed and discovered in one test and using it to inform and guide the design of the next test. This cycle focuses on following up identified risks and driving the exploratory testing to focus on investigating these risks. This is important if you are trying to cost-effectively supplement your existing test automation and minimise gaps therein.

Exploratory testing can turn into a meandering process with no end in sight. When you are trying to gather evidence to create well-reasoned confidence in the software, while being economically efficient, you need to stop somewhere. Jon Bach came up with the concept of Session-based Test Management (SBTM). This is a way that allows you to chunk your test effort by creating time-boxed sessions with a fixed duration, and each with a targeted mission or charter. Exploratory testing within sessions requires complete focus and should avoid interruptions. To support this, the duration of a session should be a minimum of 45 minutes and a maximum of 2 hours.

It's important to vary the ideas within a session so that many different viewpoints and risks can be considered.  Once such technique that Elizabeth suggests using is personas and roleplay, using the software from different users' perspectives. 

Note: We've taken this concept further by separating the role/job from personas and creating decks of cards. We can iterate through combinations of the cards , to combine a persona with a role and see if it’s something to investigate or not worthwhile. This is a trick that’s really useful for varying our charters but can also be used to discover scenarios when discussing user stories. 

A test session doesn’t have a pass-fail binary output that gives us a false sense of confidence, but multiple outputs of information (including bug reports, concerns/risks, questions, surprises, positives*) which, when passed to the team via a debrief, can support and balance our confidence that the software delivers value.

*Credit to Dan Ashby for the 'positives' idea,  as testers often unintentionally come across as negative.

Session-based Testing and BDD

When developing a User Story the team creates Test Sessions with Charters to focus on important risks for the story (and additional tests may be created depending on how the sessions progress) .  We like to extend the session charters by adding key BDD scenarios from the Story as landmarks. These landmarks are must- see/explore items during the session, or starting points for the session. We even include scenarios if they have been automated as the test session is likely to look at the scenarios from a different point of view, and the allows the testers to explore around the scenario; think back to my earlier example of not creating scenarios for all the validation rules on a form, this is something that can be explored.

We can also use the information that’s created in the sessions to aid with the discovery of missing scenarios - taking the questions and surprises, consulting with the Product Owner and, as a result, new scenarios may get created that weren't previously considered. We couldn’t discover all this information up front, but because we’ve done our exploratory testing it gives us the opportunity to feed back before we think we are 'done'.

In summary, test automation can be an addiction, and companies should get off that addiction and start hiring testers again. Add balance to automation. Automation is not perfect, and throwing more automated tests at a problem isn’t going to solve it.

I believe exploratory testing fits into the Agile mould; you’re designing and executing a test, immediately designing another test, driving down on risks. Don’t fire the testers! Complement BDD and test automation with something else. 

Note:  A lot of this subject mater is similar to the Checking vs Testing debate within the testing community. I have deliberately not used some of the language definitions from that debate as I've found it prohibitive to discussions with people not familiar with the subject