Building In-House Test Labs #4: How to Create Enterprise Grade Reliability for In-house Test Labs

Bitbar, the mobile devops company. Logo, large
(Last Updated On: 8 Dec 2016)

Last week we covered the hardware infrastructure that you need for setting up a robust device lab. This week we are covering more the configuration and software related aspects related to making sure your in-house test lab runs without any failures and produces reproducible results for every test run. Let’s take a look on the most important mechanisms to achieve enterprise grade robustness:

Always run tests on clean devices

One of the most common causes for flaky test results is that the environment where you run your tests i.e. your test device is not in exactly in the same state during each test session. This means that there may be different processes running, different applications or background services installed, different amount of free memory or different amount of storage space available. In order to avoid test flakiness make sure that there is a proper cleaning cycle between every test session that uninstalls all unwanted apps, sets the device storage to the same state as it was before the test and reboots the devices so that there are no processes running from the previous test session. Learn more about enterprise grade reliability of an in-house testing lab in our free ebook.

Create intelligent retry mechanisms 

Some of the test execution related failures are not actually related to the application under test but are caused by some test infrastructure related reason. The good news is that this type of failures are easy to identify and tests that have failed because of such reasons can be automatically retried as many times as needed to have a test run that is free from infrastructure related failures. The most common reasons for test infrastructure related failures are: The connection between the device control server and device fails, the connection between the device and your back end server fails, either the device or device control server runs out of storage space, device runs out of power in the middle of test session etc. All these can be identified during the test session and the tests can be automatically retried so that the only remaining failures are real failures related to the application under test or the test scripts.

Check and automatically reconnect the USB and Wireless data connections

Losing either the USB or wireless data connection at some point is unfortunately very common on both Android and iOS devices. To make matters worse quite often the device is reporting that it has live connection over USB, Wifi or Wireless data but no data is moving. You need to be able to automatically verify that both the USB connection and the Wifi connection are actually transferring data and if that’s not the case you need to automatically disconnect & reconnect the connections to get them back up again.

Automate all configuration changes

When you have hundreds of devices and tens of device control servers in your test lab you cannot (and you should not) really make any changes to settings or configurations manually because the risk of not doing the changes identically on all devices or servers is simply too high. We automate everything we can by using Opsworks Chef and we have implemented our on-device services that take care of changing the device settings etc. Even if we have been very systematic with this, every now and then there are problems because some small setting was done by hand and it was not replicated identically on all devices or servers. This is one area where you cannot afford to cut corners if your goal is 99.99% reliability of your in-house test lab.

Use professional monitoring for your test lab hardware

The last but not by any means least important way to ensure the robustness of your in-house test lab is to set up a professional monitoring for all aspects of your test lab infrastructure. The most important areas to monitor are:

Disk spaces – automated tests create a lot of data in the form of logs, screenshots, videos, memory dumps, network dumps and when you are running tens or even hundreds of devices sometimes 24/7 you may run out of disk space just out of the blue.

Network connections, latencies, packet loss rates – These have an impact on everything in your test lab and once you start troubleshooting the strangest and most randomly appearing problems, most often the root cause is networking related.

Power – Disruptions on power supply (even very short ones) can cause very strange problems when the system has lot of moving parts and all of them are dependent on steady power supply. Use UPS backed power whenever you can and also make sure that your servers close down and come up in a controlled way when longer power disruption happens.

You can use any generic monitoring software for this. In Testdroid Cloud we are currently using AWS monitoring, Loggly and New Relic to handle our monitoring.

For a complete picture, you can watch our newly recorded webinar about building large scale in-house test lab

Watch this webinar

What are the most common areas of failure and how often those happen?

In order to get a high level view of what are the most common areas of failures and how often those appear we processed all error messages generated in Testdroid Cloud during Q1/2014 and then grouped them to ten different buckets to get an understanding whether those are real failures related to the application under test (or the test script running the app) or if they are related to our infrastructure.


From the graph above you can see that the green bars (Assertion failure, Element not found and Instrumentation run failure) are all most commonly related only to the application and the test scripts and the ones in red (Device or resource busy, Google IAB Library is missing, Service not registered, Credential failure, and Network error) are most commonly infrastructure related and can be fully avoided by using the best practices described in this blog post.

This blog post covered best practices on how to improve the reliability of your in-house test lab so that you can always rely on the test results and your organization will not have any testing downtime due to test infrastructure. Next week we will look into the operational aspects and best practices of running an enterprise grade device lab.

Build a Large Scale In-House Test Lab for Mobile Apps

(Last Updated On: 7 Dec 2016)

Learn best practices from this guide to maximize the ROI by building a flawless in-house test lab.


By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.