Background
Over the last decade I have worked at a number of very large and very small
companies, the smallest being just two people and the largest having over one
hundred thousand people. On day one a startup faces many challenges however
the most important one is usually survival. The first couple of months are
critical while going from 'zero to one' as the saying goes. The top priority
has to be to secure customers and launch an initial offering, these days often
referred to as an Minimum Viable Product. In this mode engineers need to build
just enough to make the product work. Like
Occams Razor all
non essential work should likely be avoided and there is little room for high
principles in the search for results.
One things that changes as a startup moves into iterating rather than
prototyping is how to survive the first big customer. The focus of the team
often has to shift from optimizing for a viable solution to consider aspects
of stability, reliability and correctness. In some sectors these are more or
less important compared to rapidly reaching a position where feedback can be
sought to verify product hypothesis.
In this series of posts I am going to share a collection of 'rules' which
emerged through a number of projects including the rationale, signals and
counter cases.
Note: these rules are based on experience from startups and may not reflect
common practices in larger companies. These insights are shared purely based
on experiences from:
- SiteMorph: A SEO / SEM marking tool for SMB.
-
ClickDateLove.com (Muster): A dating site employing basic ML approaches to
create better profiles.
-
Shomei / Futreshare: Ad attribution heuristic modelling for advertisers
with billions of ad impressions.
-
Upgrade Digital: hospitality booking platform build for developers with
one of the fastest build times for web developers available in the world
at the time.
The objective of these rules was to have standard solutions to
everyday questions based on real world lessons. Having de-facto solutions to
everyday problems meant that development could go faster. Going faster for a
startup means less cost, faster iteration and more feedback. Some of the
rules may seem to contradict this when they add overhead. The point here is
that the solutions were born out of necessity. This necessity drove
iteration to a viable solution.
Rule 1. Storage, prefer insertions to updates
Advice
When a data attribute of an entity may be written or updated by a number of
writers, prefer refactoring that at attribute into a separate concept and
inserting in into a different store rather than updating a field on an
existing entity.
- Payment authorization code for a payment
- Approval for a change where multiple people can approve
-
Any transaction sensitive attribute which could be the source of a race
condition.
Example
Consider a hotel booking for a single stay this can be expressed in
normal form
along the lines of:
- Hotel Booking
- user : who books the stay
- checkin : date of arrival
- rooms : ... details of the required rooms
- total cost: sum of all room night rates and fees.
-
payment request: payment transaction token used to initiate the
transaction.
-
payment confirmation: payment completion token from transaction
processor.
-
payment cancellation: the cancellation token passed by the transaction
processor.
This seems pretty reasonable and has all of the fields associated with the
booking however without good locking of the entity type multiple actors are
able to update the fields leading to a lost update race condition. Locking
isn't such a bad thing you may argue however underlying locking semantics
typically lead to centralization as decentralized consistency isn't offered
by many storage engines and
CAP theorem comes
into play. Rather than update attributes of the existing booking, one
typically safe solution well aligned with eventual consistency offered by
many storage engines is to always insert. To achieve this the entities need
to be separated like so:
- Hotel booking
- user
- checkin
- rooms
- total cost
- Payment request
- hotel booking reference
- payment request token
- Payment confirmation
- payment request reference
- payment confirmation token
- Payment cancellation
- payment confirmation reference
- payment cancellation token
Reasoning
-
Avoiding locks helps us to scale better. Many storage engines only support
table level locks which can be a significant issue in online transaction
processing systems. One payment provider I have worked with had median API
response times in the 1000+ms range. Even the best available are often still
in the 200ms range. Effectively this means if you hold a lock to update your
booking, or payment table, you can only process ~5 transactions per second.
Always inserting typically has O(1) performance semantics and is typically
only limited in performance by disk / network speed.
-
Avoiding lock release starvation is a significant gain. In the world of
scaled data centres it's only a matter of time before one of your service
is going to crash during a transaction. The law of large numbers
says that as you have more services you are likely to start to observe more
instance crashes. With a 2x9s 99.5%
you still have 12960 seconds every month of downtime per instance to contend with.
Even using advanced monitoring you can't avoid some crashes at scale. Given
that it's essential to plan for them. When a process crashes, most distributing
locking solutions will have to wait for automated timeout of the lock.
Eliding this problem by locking is a significant win in degraded situations.
-
Minimise the window for issues. Recovery is always required but writing
updates with O(1) insert semantics dramatically narrows the window for lost
writes. For our storage system at the time we were seeing insertion times in
nanoseconds for first disk flush. At this point we only saw one crash
during insert per year. We build a recovery task for that too. Keep an eye
out for the future rule on self correction.
- Minimise your entity storage reliance on technology specific sophisticated locking e.g. relational database locks.
Context
- For upgrade digital one of our key value propositions is that our platform included correction of booking state and payment processing providers. One of the hotel chains we worked with regularly had rooms without payment and payment without rooms!
- Some payment providers used had delays in correction of up to 24 hours in production so we had to recover elegantly. This might lead to retrying a transaction that had previously timed out only to see it later succeeded so we needed to keep all request initialisation vectors.
- Hotel room booking systems often allow manual overrides for room allocations as well as overbooking as a standard practice. This could mean that the actual product wasn't available for extended periods of time.
- For general payment systems it's good to practice to expect delays in callbacks and generally avoid overwriting fields as race conditions and replay are regular occurences.
- For hospitality the Upgrade Digital platform provided a consistent RESTful API across multiple Micros Opera versions and a number of payment processors. Our approach to play / replay / check async task execution automatically repaired numerous issues on either side of the platform automatically meaning we could sleep at night. For a small oncall team supporting bookings across 120 countries this is a must!
Counter cases
Despite the general practice of always inserting there are notable counter cases where we did use basic locking functionality with a 'test and set' semantic:
- In our task scheduling the library used a task claim, compatible with AWS SQS to claim async work. This claim required a test and set style storage engine which was easy to achieve with SQL and some no-sql storage engines like Dynamo.
- Critical sections of code where exactly once semantics are required.