Protubuf Persistence for Fun and Profit

Note: This post is about the recently open sourced proto store project as part of a SiteMorph Open Source initiative.

Since my days writing code at Google I have always been a huge fan of using protobuf to generate all simple POJO object classes. There are a few good reasons for this:

Code that is generated by protoc doesn't have to be tested.

You avoid writing repetitive boiler plate code like get* and set*. Yes, I realise that your IDE can generate a lot of these for you...

The coding style of protobuf makes them easy to use.

Protobuf supports data interchange between programming environments.

Protobuf in Java are inherently thread safe as they are immutable.

Let's just say that writing protobuf code rocks. An example could be:


message Picture {
required string urn = 1;
required string url = 2;
required string profileUrn = 3;
required Moderated moderated = 4 [default = OK];
optional int32 width = 5;
optional int32 height = 6;
}

This proto message creates a picture. Note that I am using meta references here with references, a lot like you would in a relational database. Many of the rest APIs I am working on also use meta references rather than fully materialising related objects. Proto does have some disadvantages though which are about 'what it is'. I typically use protobuf for internal representation of data which is persisted. I use a different set of objects for external representations, for example for web services I tend to use Jackson.

In previous versions of projects, SiteMorph, Shomei, Connect, Click Date Love I used to write data access or try to use libraries for object persistence. A few things struck me about these.

They were very large, necessarily because they were general purpose.

Some were not very performance oriented, XML is usually going to be slow.

They are often tricky to configure to work with legacy databases.

Re-factoring isn't as clear a process as it could be.

Given all of that, one Saturday a few months ago I decided to try and create a very light weight library that solved 80% of my database coding problems. The answer was to write a CRUD driver that stored protobuf messages into tables. The assumption is that you then use your database SQL to re-factor your data across version changes. This might sound like an unnecessary consideration right? Your database only changes every few months right? Wrong: for some projects I am making more than 10 structural database changes per month. Simply mapping the proto to the table was a clear choice as it lets you read messages based on a really simple interface.

Very small library footprint. Version 2.6.1 packed as a jar is only 26KB.

Create a message by passing a semi constructed builder and the storage system sets the identifier.

Read with support for primary key (unique identifier) and secondary indexes. Also read all.

Update an message given a builder created from MyMessage.toBuilder().

Delete.

Basic ordering of records returned.

Support for auto ID primary key column.

Support for urn keyed primary key column.

Support for reading basic message types as well as enumeration values but not nested messages.

A very simple iterator interface for accessing data.

Performance comparable to writing prepared statements manually.

No more writing SQL for basic operations.

Eliminate the need for testing SQL on version migrations.

Don't get me wrong. Writing tests is a great thing. Avoiding the need for them is even better. Take this re-factoring example process where you rename and modify a field.

Do your SQL re-factoring renaming / setting the new value of your field.

Use your IDE to rename the method for the proto field in your code to the new field name. get[Field], has[Field] and builder methods set[Field] and clear[Field]. This updates all of your code to use the new naming convention.

Update your proto field names and rebuild. Everything should just work.

Deploy the updated code which now uses the new field.

Delete your old field in the database table.

Pretty simple right? You may still have to change the semantic interpretation of the underlying data, that is worth testing but we have avoided a few very repetitive tasks and the need to test the code. Combined with the power of using protobuf to generate your object representations you can be a much more effective coder. To give an example, in most projects, between 20% and 50% of the code used in the project is just object representations. Admittedly proto generated code is somewhat verbose but the point stands. Using the protostore means that 80% of your data access layer doesn't need to be written either. All you need to do is create a factory somewhere which returns a protostore for your type of database table.

Damien Allison - Personal Blog

2013/10/29

Protubuf Persistence for Fun and Profit

No comments:

Post a Comment