Background
- SiteMorph: A SEO / SEM marking tool for SMB.
- ClickDateLove.com (Muster): A dating site employing basic ML approaches to create better profiles.
- Shomei / Futreshare: Ad attribution heuristic modelling for advertisers with billions of ad impressions.
- Upgrade Digital: hospitality booking platform build for developers with one of the fastest build times for web developers available in the world at the time.
The objective of these rules was to have standard solutions to
everyday questions based on real world lessons. Having de-facto solutions to
everyday problems meant that development could go faster. Going faster for a
startup means less cost, faster iteration and more feedback. Some of the
rules may seem to contradict this when they add overhead. The point here is
that the solutions were born out of necessity. This necessity drove
iteration to a viable solution.
Rule 1. Storage, prefer insertions to updates
Advice
When a data attribute of an entity may be written or updated by a number of
writers, prefer refactoring that at attribute into a separate concept and
inserting in into a different store rather than updating a field on an
existing entity.
- Payment authorization code for a payment
- Approval for a change where multiple people can approve
- Any transaction sensitive attribute which could be the source of a race condition.
Example
Consider a hotel booking for a single stay this can be expressed in
normal form
along the lines of:
- Hotel Booking
- user : who books the stay
- checkin : date of arrival
- rooms : ... details of the required rooms
- total cost: sum of all room night rates and fees.
- payment request: payment transaction token used to initiate the transaction.
- payment confirmation: payment completion token from transaction processor.
- payment cancellation: the cancellation token passed by the transaction processor.
This seems pretty reasonable and has all of the fields associated with the booking however without good locking of the entity type multiple actors are able to update the fields leading to a lost update race condition. Locking isn't such a bad thing you may argue however underlying locking semantics typically lead to centralization as decentralized consistency isn't offered by many storage engines and CAP theorem comes into play. Rather than update attributes of the existing booking, one typically safe solution well aligned with eventual consistency offered by many storage engines is to always insert. To achieve this the entities need to be separated like so:
- Hotel booking
- user
- checkin
- rooms
- total cost
- Payment request
- hotel booking reference
- payment request token
- Payment confirmation
- payment request reference
- payment confirmation token
- Payment cancellation
- payment confirmation reference
- payment cancellation token
Reasoning
- Avoiding locks helps us to scale better. Many storage engines only support table level locks which can be a significant issue in online transaction processing systems. One payment provider I have worked with had median API response times in the 1000+ms range. Even the best available are often still in the 200ms range. Effectively this means if you hold a lock to update your booking, or payment table, you can only process ~5 transactions per second. Always inserting typically has O(1) performance semantics and is typically only limited in performance by disk / network speed.
- Avoiding lock release starvation is a significant gain. In the world of scaled data centres it's only a matter of time before one of your service is going to crash during a transaction. The law of large numbers says that as you have more services you are likely to start to observe more instance crashes. With a 2x9s 99.5% you still have 12960 seconds every month of downtime per instance to contend with. Even using advanced monitoring you can't avoid some crashes at scale. Given that it's essential to plan for them. When a process crashes, most distributing locking solutions will have to wait for automated timeout of the lock. Eliding this problem by locking is a significant win in degraded situations.
- Minimise the window for issues. Recovery is always required but writing updates with O(1) insert semantics dramatically narrows the window for lost writes. For our storage system at the time we were seeing insertion times in nanoseconds for first disk flush. At this point we only saw one crash during insert per year. We build a recovery task for that too. Keep an eye out for the future rule on self correction.
- Minimise your entity storage reliance on technology specific sophisticated locking e.g. relational database locks.
Context
- For upgrade digital one of our key value propositions is that our platform included correction of booking state and payment processing providers. One of the hotel chains we worked with regularly had rooms without payment and payment without rooms!
- Some payment providers used had delays in correction of up to 24 hours in production so we had to recover elegantly. This might lead to retrying a transaction that had previously timed out only to see it later succeeded so we needed to keep all request initialisation vectors.
- Hotel room booking systems often allow manual overrides for room allocations as well as overbooking as a standard practice. This could mean that the actual product wasn't available for extended periods of time.
- For general payment systems it's good to practice to expect delays in callbacks and generally avoid overwriting fields as race conditions and replay are regular occurences.
- For hospitality the Upgrade Digital platform provided a consistent RESTful API across multiple Micros Opera versions and a number of payment processors. Our approach to play / replay / check async task execution automatically repaired numerous issues on either side of the platform automatically meaning we could sleep at night. For a small oncall team supporting bookings across 120 countries this is a must!
Counter cases
Despite the general practice of always inserting there are notable counter cases where we did use basic locking functionality with a 'test and set' semantic:
- In our task scheduling the library used a task claim, compatible with AWS SQS to claim async work. This claim required a test and set style storage engine which was easy to achieve with SQL and some no-sql storage engines like Dynamo.
- Critical sections of code where exactly once semantics are required.
No comments:
Post a Comment