Travel Needs an Upgrade

Over the last year I have been working as a contract CTO and one of the clients was in the travel sector, a London based hotel chain. The client wanted to ensure that the technology was rock solid from a transactional stability point of view. Even one false payment or booking wasn't an option. (You may assume these things only happen rarely but as complexity increases so do failures.) To try and ensure that we could achieve the desired stability in light of total data centre outage applying banking style transaction management seemed a sensible place to start. This left me with a challenge; in order to deliver the project in the time scales that we needed a more 'agile' approach.

Lean / Agile Works for Start-ups but...

If you have visited many of the London start-up co-working spaces you will be familiar with the usual stack of 'Agile' and 'Lean' books that seem to litter these spaces. The challenge is how to build systems that are as stable as what you would expect after an exhaustive build and test [a'la waterfall] process. This was a particular challenge for us as we had a tiny team and needed to outsource the majority of the work. For me personally the ethos of don't out-source anything you care about was also ringing in my ears. The different back end vendors we evaluated seemed to come in two varieties:

White labelling solutions who primarily offered the same solution to most clients with some re-branding.

Niche agencies that offered a (sometimes) slick and customised product.

When it came to it we thought that we could do better. We wanted to apply start-up style rapid development to the back-end and out-source the front-end build to companies better placed to deliver on an industry leading solution.

Brittleness of Agile

When talking to clients they always seem to like the idea of Agile because they like to see results. From that point of view the user story generation approach to requirements capture was also something I enjoyed. User stories are a great way of capturing specific behavioural expectations.

The challenge with all of this 'speed' is, what do you put into the things that you build. There are a lot of great design patterns, like reversibility [be able to undo - encapsulating replicable components when there is a high risk of changing solution in the future] to help you out of a pinch but the hard part is still choosing where to invest your time / code. For me this is the most challenging and fascinating problem; let's assume that you only have time to write 10K lines of code - what do you want to spend them doing? Features? Tests? Integration testing?

By definition - if you are using an Agile / Lean approach then it is likely that you have to choose some things to leave out! The goal is to get your system live as soon as possible so it is up to you to pick the right things to write. For me system tests are a must as they provide the contract that guarantees that our back-end does what we say it does. After that we need some unit testing of complex logic but 100% coverage is rarely a practical goal.

To compound things further the principles of code quality and readability are exaggerated by Agile. You expect to go back to your code and modify far more frequently than you would for say a legacy system that is expected to run for years without modification. To put this into context - I wanted to engineer our platform so that we could do multiple orchestrated releases per day with system tests running in continuous integration. There are a few things you need to make sure you get right if you want to run this way:

Tests need to run in seconds not minutes.

You need to be able to run your tests on all of the different versions of your application at the same time.

You will probably need to be able to build multiple clusters of homogeneous servers.

You will probably want to run local, staged, live staged and live release clusters.

Remember: all software has bugs (feature and functional) so it is important that you set yourself up for success with the tools you need to do a 'good job' of dealing with so much uncertainty.

Detecting Issues

The big value impact for us came from the simplest things to do but all too easy to get wrong. Traceability - being able to perform detailed investigation into the root cause of an issue. This let us quickly respond to issues by being able to see a lot of detail of what was going wrong. This then drove new bugs, which had failing tests written first, which we fixed then we would release. This process became fairly fluid after a while but logging wasn't quite enough.

Again, thinking back to companies I have worked at before - one of the most useful features to build on day one is support for live site monitoring - and I don't mean TCP monitoring. Every application has a different set of indicators of health and happiness. This is often something like healthz or varz which can then be monitored by your friendly monitoring technology. For Upgrade we actually created a servlet library that made exporting stats a cinch.

Today Upgrade represents a major code base of approaching 500KLOC and can still run on a model A pi. I am proud to say that we have had only a very small number of bugs found in production. Bottom line is if you have an API write some full stack tests. You will thank yourself later.

Damien Allison - Personal Blog

2014/12/01