More states, more bugs

programming
Back

Here's a basic fact about software engineering: the more states your program can be in, the more likely it is that bugs can creep in.

Consider that the time you can allocate to testing a given release is approximately constant, relative to the size of your product. The number of possible states, on the other hand, generally expands at a much higher rate, and if not managed with care can even grow exponentially.

What this means is that there will be more and more states that receive less and less testing time per release.

This post explores null pointer exceptions, the importance of type systems as a way of controlling possible states and some common features that, when handled without care, can result in state explosion.

The billion dollar mistake

Tony Hoare is famous for two big ideas in computer science: Quicksort (1960) and null references (1965). A brilliant computer scientist, he now refers to the null pointer as the billion dollar mistake.

Why? Imagine you have a function that takes two boolean variables as input e.g. calculate(a,b). There are in total 4 possible cases, which means we could easily handle, even unit test every case:

expect(calculate(false, false)).toEqual(someValue1)
expect(calculate(false, true)).toEqual(someValue2)
expect(calculate(true, false)).toEqual(someValue3)
expect(calculate(true, true)).toEqual(someValue4)

What if the input variables are nullable? The number of cases grows to 9, more than double the original state space.

What if you have 3 boolean variables instead? The number of cases gows from 8 (without nulls) to 27 (with nulls).

Here's a table that shows the difference for a few values:

number of bool argumentswith nullswithout nulls
3278
48116
524332
672964

Whether your codebase has functions that reflect this particular example is irrelevant. The point is, null pointers make the number of possible states explode, and the more states your program has, the more bugs it can have.

Type systems

Why are type systems useful? Essentially, they are a way of reducing the state space. Let's go back to our previous example. Assuming we're using javascript, then calculate(a,b)'s parameters can actually have 4 states each: null, undefined, true and false, which gives us a total of 16 states. That's a lot of cases to unit test. Unless we don't have to: maybe it turns out that you never use undefined in this function, which cuts the state space in half.

Type systems serve as a way to restrict the possible states that a program can be in. Less possible states, means less cases to test, which means higher quality, less bugs.

The dangers of A/B Testing

A/B testing systems are extremely useful as a way of improving product usage, while maintaining feature risk low. The come with one big risk, though: just like the infamous null pointer, A/B Testing systems, when not handled with care, can result in explosions of possible states that you will simply not have enough time to test.

The mathematics is essentially the same. If you have a single A/B test running, there are 2 variants of your program in production. You can be confident that after enough time, about 50% of your active users will be using either variant A or B. What if you have n=5 A/B tests? That's 32 variants of your program, so only 3.125% of your users will land on each variant.

You might argue that my numbers are wrong as there is usually total independence between some variants (e.g. a variant in the sign-up flow vs a variant in the checkout). I completely agree, I'm just illustrating the fact that a poorly managed A/B testing initiative can be detrimental to a product's quality. If you have 100s of A/B tests in a relatively small product, there's a very big chance that this is affecting quality.

The way to properly manage A/B test systems is to keep the number of variants low. Consider that if you have a 1000 WAUs and n=5, that means 31 users on each variant of your product.

The dangers of RBAC systems

If you build enterprise software, you will eventually need a role management system (RBAC stands for role based access control). The idea is that a user can have a set of roles assigned to them and these roles will determine what the user can or cannot do.

When implementing these types of systems there is a temptation to allow for very fine permission granularity, to maximize flexibility. Can the Admin role view users? What about user's personal details, what about user's payment history? gender? address? One solution is to create a very specific permission for each bit of information.

You know about the "state explosion" issue, so you define just two roles: Admin and regular user. You map all the granular permissions into these two roles. Problem solved, right?

This can work in theory, but you need to be very careful to make sure that every user only has the permissions given to them by their role. CS & Sales teams will, over time, come with very specific requests from customers asking if they can either grant or remove permissions, thus slowly leaking a giant state space into your product.

The solution to managing RBAC systems correctly is to monitor the number of "effective roles". By effective roles I mean the number of actual role combinations. For example, a big customer of yours might have the Admin role with a few extra permissions added over time. Another big cusomer comes in and asks for a different set of permissions for the Admin: instead of freely assigning permissions, try to see if you can create a Super Admin role that can suit both.

Final thoughts

A software engineer's job is not to program, it's to make products that solve real problems for real people, and to make sure that these systems don't break over time. Managing complexity is key to making such systems.

Idenifying areas where complexity can increase exponentially is key to being able to manage a growing product. Failing to do so can result in reduced quality and lengthier development times.

If you liked this article, follow me on twitter. I write about once a month about software engineering practices and programming in general.


Other posts you may like...

© Fernando Hurtado Cardenas.RSS