Background

Over the last decade I have worked at a number of very large and very small companies, the smallest being just two people and the largest having over one hundred thousand people. On day one a startup faces many challenges however the most important one is usually survival. The first couple of months are critical while going from 'zero to one' as the saying goes. The top priority has to be to secure customers and launch an initial offering, these days often referred to as an Minimum Viable Product. In this mode engineers need to build just enough to make the product work. Like Occams Razor all non essential work should likely be avoided and there is little room for high principles in the search for results.

One things that changes as a startup moves into iterating rather than prototyping is how to survive the first big customer. The focus of the team often has to shift from optimizing for a viable solution to consider aspects of stability, reliability and correctness. In some sectors these are more or less important compared to rapidly reaching a position where feedback can be sought to verify product hypothesis.

In this series of posts I am going to share a collection of 'rules' which emerged through a number of projects including the rationale, signals and counter cases.

Note: these rules are based on experience from startups and may not reflect common practices in larger companies. These insights are shared purely based on experiences from:

SiteMorph: A SEO / SEM marking tool for SMB.
ClickDateLove.com (Muster): A dating site employing basic ML approaches to create better profiles.
Shomei / Futreshare: Ad attribution heuristic modelling for advertisers with billions of ad impressions.
Upgrade Digital: hospitality booking platform build for developers with one of the fastest build times for web developers available in the world at the time.

The objective of these rules was to have standard solutions to everyday questions based on real world lessons. Having de-facto solutions to everyday problems meant that development could go faster. Going faster for a startup means less cost, faster iteration and more feedback. Some of the rules may seem to contradict this when they add overhead. The point here is that the solutions were born out of necessity. This necessity drove iteration to a viable solution.

Rule 1. Storage, prefer insertions to updates

Advice

When a data attribute of an entity may be written or updated by a number of writers, prefer refactoring that at attribute into a separate concept and inserting in into a different store rather than updating a field on an existing entity.

Payment authorization code for a payment
Approval for a change where multiple people can approve
Any transaction sensitive attribute which could be the source of a race condition.

Example

Consider a hotel booking for a single stay this can be expressed in normal form along the lines of:

Hotel Booking

user : who books the stay
checkin : date of arrival
rooms : ... details of the required rooms
total cost: sum of all room night rates and fees.
payment request: payment transaction token used to initiate the transaction.
payment confirmation: payment completion token from transaction processor.
payment cancellation: the cancellation token passed by the transaction processor.

This seems pretty reasonable and has all of the fields associated with the booking however without good locking of the entity type multiple actors are able to update the fields leading to a lost update race condition. Locking isn't such a bad thing you may argue however underlying locking semantics typically lead to centralization as decentralized consistency isn't offered by many storage engines and CAP theorem comes into play. Rather than update attributes of the existing booking, one typically safe solution well aligned with eventual consistency offered by many storage engines is to always insert. To achieve this the entities need to be separated like so:

Hotel booking

user
checkin
rooms
total cost

Payment request

hotel booking reference
payment request token

Payment confirmation

payment request reference
payment confirmation token

Payment cancellation

payment confirmation reference
payment cancellation token

Reasoning

Avoiding locks helps us to scale better. Many storage engines only support table level locks which can be a significant issue in online transaction processing systems. One payment provider I have worked with had median API response times in the 1000+ms range. Even the best available are often still in the 200ms range. Effectively this means if you hold a lock to update your booking, or payment table, you can only process ~5 transactions per second. Always inserting typically has O(1) performance semantics and is typically only limited in performance by disk / network speed.
Avoiding lock release starvation is a significant gain. In the world of scaled data centres it's only a matter of time before one of your service is going to crash during a transaction. The law of large numbers says that as you have more services you are likely to start to observe more instance crashes. With a 2x9s 99.5% you still have 12960 seconds every month of downtime per instance to contend with. Even using advanced monitoring you can't avoid some crashes at scale. Given that it's essential to plan for them. When a process crashes, most distributing locking solutions will have to wait for automated timeout of the lock. Eliding this problem by locking is a significant win in degraded situations.
Minimise the window for issues. Recovery is always required but writing updates with O(1) insert semantics dramatically narrows the window for lost writes. For our storage system at the time we were seeing insertion times in nanoseconds for first disk flush. At this point we only saw one crash during insert per year. We build a recovery task for that too. Keep an eye out for the future rule on self correction.
Minimise your entity storage reliance on technology specific sophisticated locking e.g. relational database locks.

Context

For upgrade digital one of our key value propositions is that our platform included correction of booking state and payment processing providers. One of the hotel chains we worked with regularly had rooms without payment and payment without rooms!

Some payment providers used had delays in correction of up to 24 hours in production so we had to recover elegantly. This might lead to retrying a transaction that had previously timed out only to see it later succeeded so we needed to keep all request initialisation vectors.
Hotel room booking systems often allow manual overrides for room allocations as well as overbooking as a standard practice. This could mean that the actual product wasn't available for extended periods of time.

For general payment systems it's good to practice to expect delays in callbacks and generally avoid overwriting fields as race conditions and replay are regular occurences.
For hospitality the Upgrade Digital platform provided a consistent RESTful API across multiple Micros Opera versions and a number of payment processors. Our approach to play / replay / check async task execution automatically repaired numerous issues on either side of the platform automatically meaning we could sleep at night. For a small oncall team supporting bookings across 120 countries this is a must!

Counter cases

Despite the general practice of always inserting there are notable counter cases where we did use basic locking functionality with a 'test and set' semantic:

In our task scheduling the library used a task claim, compatible with AWS SQS to claim async work. This claim required a test and set style storage engine which was easy to achieve with SQL and some no-sql storage engines like Dynamo.
Critical sections of code where exactly once semantics are required.

Damien Allison - Personal Blog

2021/09/19

Startup ENG Rules Series. 1. Storage. Prefer Insertions to Updates