The Poor State of Software Testing Today

This post was originally published on the blog of Dependable Systems Lab at EPFL.

Today, software quality assurance is largely synonymous to software testing. Broadly speaking, testing means developers think hard of all relevant inputs that a program (or subcomponent) can receive and write code that executes the program with those inputs, while checking that the program behaves correctly.

Unfortunately, writing and maintaining a thorough test suite is a very expensive manual process. First, testing is done at many levels in a system—e.g., unit, component, or integration-level—and for each level developers have to write code that sets up the environment, initializes the component, and so on. Second, the number of tests needed to properly cover a component can be very large. For example, the test suite of Chrome’s browser engine has on the order of 100,000 tests! Finally, software changes all the time in response to new features and requirements, so a test suite would need to be adjusted accordingly. Statistics show that, on average, developers spend as much as half of their time doing testing and companies allocate up to a quarter of their IT budget for software testing.

Over the years, several frameworks (e.g., XUnit), methodologies (e.g., TDD), and test automation runtimes (e.g., Jenkins, BuildBot) have helped streamline the testing process. However, the fundamental problem still persists: developers need to manually provide the test scenarios, which almost never entirely cover the immense set of possible program behaviors. Instead of providing any correctness guarantees, testing often turns into a mere hunt for bugs and test suites mainly serve to increase a subjective notion of confidence in the codebase.

It should be no wonder then that, despite all the effort and resources invested, bugs escape testing and make it into releases. According to Steve McConnell’s “Code Complete”, the industry average for bug density is a staggering 15 errors per 1000 lines of shipped code, while in the best cases, the most thorough testing processes reach 0.1 errors per 1000 lines of code.

Adding insult to injury, virtually all computer systems today are online and hence vulnerable to security attacks that exploit the bugs in the software running on them. For example, in 2014, there were 19 vulnerabilities officially recorded on average per day, not to mention the vulnerabilities that are never reported, but directly exploited (the so-called “zero-day” vulnerabilities). The exploits end up causing major disruptions or leaks of sensitive data in private businesses and governments alike, which cost money, reputation, and sometimes even personal safety. To put a number on this issue, studies show that, on average, a data breach cost $3.8 million in 2015, with the most prominent cases billing over $100 million.

It wasn’t meant to be like this. In the early days of computer science, software was envisioned to be developed together with the mathematical proof of its correctness. If this sounds so unexciting to you, it stands proof that this approach didn’t catch on. The rapid pace of feature development in the market, the urge of instant gratification, and the complexity of large software have precluded a formal, rigorous approach. With the exception of a few safety-critical industry segments, such as avionics, automotive, or medical equipment, most of the software industry today favors features and ease of development over correctness proofs. After all, what does it even mean for a JavaScript-based webapp to be provably correct? :) (This is a hard question, which hopefully will be covered in another post.)

Things are not standing still, though. There is a whole bunch of academic and industrial research being conducted that attempts to improve software reliability and security. With respect to the software correctness ideal, this research can be broadly divided into two: (a) work that tries to steer the software development back onto the rigorous track, by providing languages, methodologies, and tools for creating software that is correct by construction; and (b) work that attempts to deal with the software as being written today, by providing best-effort solutions to automate program analysis and testing, and ultimately address the above-mentioned fundamental problem faced by manual testing. The research falling in the second category has enjoyed the most practical successes, but, as of recently, there is growing interest and exciting results in the first category, too.