Investing in Testing

Thursday, April 21, 2016

Test Brain, how we multi threaded our testing in C#

When I joined my current role specflow was already in use as the tool for BA’s to write their specs. I hadn’t worked in C# land before and thought that I could quickly build up a working mobile testing framework then easily parallelize it using selenium grid and the C# version of whatever java/python tool I had used in the past. Quickly though I came to realize that this was not the case and although Microsoft have done a tremendous job recently improving their tooling it is still behind what I was using 2+ years ago in other languages.

One of the biggest problems with UI tests (and especially mobile) is that they are very slow. With specflow being unable to run tests in parallel (there has since been a update that allows this functionality but I have not tested it myself) we were stuck with test suites with run times of over 8 hours. As our automation grew (with testing times hitting 24 hours+), even with splitting the tests into device type and then into bundles we were still hovering at the 10 hour mark for getting results from our regression runs.

So not only are UI tests slow to run but they are also very unreliable. Look at the schedules from any of the automation conferences and you see presentations on flaky tests or making your tests more reliable.

To mitigate these problems we developed the Test Brain. At its core the Test Brain is a tool which looks at your infrastructure determines the number of tests threads that can be run in parallel, spawns these, runs the tests and can rerun any failed tests. The Test Brain has a queue of tests which are ordered based on run priority and as new builds come in the tests are added to the queue. Let’s go through these in a little more detail.

Number of Test Threads - The test brain can run a script which calculates the number of threads that can be executed in parallel. For example with mobile tests it checks the selenium grid for devices that match the test run criteria, which are online and available. For tests which alter the environment it might check the number of environments that are available and matching the build requirement.

Spawning Test Threads - Similar to http://www.clusterrunner.com/ Test Brain creates copies of the project, distributes them to available resources and then spins up the required number of threads and starts executing tests!

Rerun Failed Tests – Sometimes we have issues which are unrelated to our tests, sometimes tests are just flaky especially when run on mobile phones (more on this another day), so runs can have rerun counts and pass % requirements. This fits into our reporting infrastructure such that we can see which tests have rerun and each runs result.

Prioritizing Tests – If we have a smoke test suite come into the queue and we still have 1000 tests left in a regression run then we prioritize the smoke test run finishing first. Similarly if we see the same test twice we can cut it out of the queue.

We’ve extended and are working on further extending the Test Brain, here are some of the additional parts of functionality:

Fail Builds Early – For some of our test suites if they fail a single test then we need to instantly report this failure, rather than progressing and executing the rest of the test suite.

Defect Creation – We have yet to implement this feature as I’m not sure whether it should be included in our Test Brain or in Repauto, but being able to instantly raise a defect with the information found during test execution is something we are considering. Given that Repauto already has some built in triaging functionality, including manually updating test status it might be easier for this functionality to be there.

Test Tagging – Similar to defect creation except this involves changing the tags on test scripts to move them into different test suites.

Number of Tests Run – Each build has a different chance of introducing bugs. We would like to run defect density or code complexity or even just change size/what has changed to each of our builds and feed this detail into the Test Brain as part of the calculations on which tests should be run.

Tests to be Run – Similar to the above, except here we would like to keep track on which tests have been run against which builds and from this determine which tests should be run against each build, skipping tests on older builds that are already passing on more recent builds, etc.

The Test Brain has helped us more than we anticipated, there is an immeasurable difference in the value we get from the automation now that it is fast and reliable. Our teams now believe in automation and are checking the results daily.

Tuesday, April 19, 2016

Automated Performance Testing

Every company I've worked at previously has done performance testing using Opensta, Jmeter, VS Test or Loadrunner. We've run our tests at the end of the project and prayed that everything worked.

At my last company we started using RUM performance testing reusing selenium tests to generate har files, this gave us some interesting results and we ran the tests during development but we still left running larger loads until the end of the project.

So at my current job we we face over a 3 month period at the end of each project called stabilization, during which time we run performance testing among other activities. One of our goals has been to shorten this length of time before each release and to do this my team has been working on moving our performance testing into our development phase.

So we automated the deployment of each build (using chef and octopus deploy) and start of performance testing. This was scheduled to run each night with results being available in the morning. We would then perform additional performance tests dependent on what was needed for the project. This worked ok, but it requires us to store the results from each performance test, the environment data and have these checked after each run.

I've talked about our reporting tool for automation before called Repauto. We extended this to store performance test results. Rather than storing the results from each run from our performance testing tool we installed Prometheus (an open source monitoring tool similar to splunk or ELK). Prometheus pulls data from the servers and stores it for us to query.

In Repauto each performance run has a summary with the most important stats of the run, including the build, environments used and test details. We also have two tabs one which has the data stored in Prometheus graphed using grafana and the another which details all of the environment stats, which we pull out using chef. It looks a little like this.

For each of the projects we have a set of alerts which we've created such that if performance degrades we can quickly look into the run and start assessing what has changed. Further we can compare runs looking for any differences in the environments (patches applied etc.) which before we had no archive of. So now we have nightly performance runs and constant and consistent monitoring. We will see soon if this helps shorten our stabilization cycle.

Sunday, February 7, 2016

Repauto - Reporting made better

I was lucky enough to get to present at 2015 Selenium Conference. Myself and Xiaoxing Hu gave a talk on Repauto our reporting dashboard for automated tests, you can see the talk here. Unfortunately open sourcing it has been more difficult than expected, but we have managed to open source a slightly less functional version which can be found here.

I will write up a more substantial post about the functionality and how we use it. But for now would be great to hear some feedback.

Thursday, July 16, 2015

Seperation of Tests

Go to any automation conference and there will no doubt be a couple of talks on flaky tests, they are one of the larger pain points when dealing with automation. There are a number of approaches to reducing the problem of flaky tests, I would suggest watching talks from previous GTAC or selenium conferences, I would like to talk about splitting the test results, or the execution themselves.

We frequently had a problem where the build was constantly red, atleast one test had failed. From a Pm's point of view they stop seeing the benefit of automation, testers themselves start to lose hope that they will keep see a green build and worst of all the results stop being valued by the team.

We started by moving tests that were flaky or had defects attached to them into a separate run. We continued to execute these defect/flaky tests, looking to see whether the defect tests failed earlier or started to pass and making sure that the flaky tests were flaky.

Unfortunately quite often during projects testers are pressed for time, so with this steup we had tests still sitting in the regression suite waiting to be investigated whether the failure was the result of the flaky test or a real defect. This lead us to creating a final run called investigation, where all tests that had failed in the previous regression run are rerun. These are ususally run straight after completion of the regression run. The results from this enable us to hopefully allocate the test into the correct run (flaky or defect).

In the future we hope to automate the process of allocating the tests into the correct run.

Monday, July 13, 2015

Breaking vs Failing Tests

So your test failed, what does it mean? One of the first things you need to determine is did it fail on the test or the setup. Sadly I've frequently found that tests fail before they get to the actual part of the system they are testing, especially with gui testing.

The problem with tests failing in this way is that we often report them as failures and this can massively skew our results. If we have a suite of 50 tests targetted at the Accounts functionality of our system but the accounts tab has been removed so we are unable to navigate to it should this be reported as 1 failure or 50?

By marking one test as failing due to a check that the accounts tab is available and the rest as breaking we solve this problem. Suddenly we have 1 test failure and 49 grouped breakages, which is a far more indicative of the actual state of the system.

So a breaking test is a test that fails before it gets to what it is checking/testing/asserting. I highly recommend incorporating the breakdown of failures to include breaking tests into your automation reports.

Thursday, June 25, 2015

Gherkin Reuse

Unfortunately there is no silver bullet for automation. As with every tool there are both positives and negatives, to get the most from an automation tool we need to accentuate the positives and mitigate the negatives. With gherkin based tests we can reuse the same language and stories for multiple tests, mitigating the problem of maintaining stories.

Gherkin best practices state that where not directly testing the GUI tests should be UI agnostic. UI agnostic tests require less frequent changes and are often shorter and easier to maintain.

UI agnostic example:

Given the user has an open account
When they close their account
Then the account has a status of closed

UI specific example:

Given the user has an open account
And the user is on the account status page
When they click close account
And select yes from the prompt
Then the user is taken to the account status page
And the accounts status is displayed as closed

Although fictitious this example is similar to what I've noticed regularly testers who are new to gherkin writing. The UI specific test:

Will require more maintenance as the test is bound to the UI with any dialog/button name changes needing to be reflected in the test.
Longer in length making test sets more difficult to quickly read through.
Difficult to move to another UI/platform. For example if we were going to run this test on a mobile device the action click is not as relevant and should possibly be replaced by tap.

In the latest framework I've developed, tags on a test cause the test to be run multiple times on different platforms. For example a test tagged @soap, @iOS, @chrome would be run as a soap test, on a ios device and on the chrome browser. In the UI agnostic test we can use (scope binded steps) to have the same story but be executed in the appropriate way for the SUT.

Additionally for one of the latest project we have the same execution platform but different authentication methods. We were able to set the tests to run by default using both authentication methods with only one underlying method changing between the two test executions.

One of the complaints often registered against gherkin tests is that maintaining the stories is time consuming/difficult. By having good practices/documentation/training in story creation and reusing the tests across different executions we can continue to gain the benefits from using a gherkin testing whilst mitigating some of the difficulties.

Increasing Automation ROI - Reuse of Automation

Both creating and maintaining an automation framework or a suite of automated tests requires a large investment and buy-in from management. As with all investments managers are seeking a return on their investment (ROI). It is often said that the ROI from automation increases after each execution. Unfortunately this is not always the case; in the case where the product is stable and the test does not alter it's data running the tests multiple times would be unlikely to discover any bugs in the system thus only providing a sense of security. Without the data of the test or system under test (SUT) changing a test is unlikely to find any new bugs so the question becomes how can we get value from our tests?

Although difficult to achieve randomizing the data of a test can increase it's value. Imagine a test as trail through a forest, once created with set data each run follows the same trail, not exploring the unknown parts of the forest. Bugs in this metaphor will often be found in the unexplored areas. Changes in the data used by the test can expand the coverage of the test and increase it's value.

In a similar fashion when tests are executed in random or the test actions themselves execute in a random order (see model based testing) additional value can be derived from the same tests.

The easiest way though to increase the ROI of your automation suite is to change the SUT. This could mean running your test against:

Different operating systems. Especially valuable now in mobile testing where operating systems are changing more rapidly.
Different browsers.
Different devices.
Different integrated environments/components.

As a manual tester I've done browser upgrades and never has my desire to come into work been dampened to the same extent. So a side benefit to executing automation against SUT is that it reduces the mindnumbing arduous tasks that need to be completed by manual testers.