In my last post, I gave you a quick introduction to Google’s protocol buffers, a.k.a. “protobuf” — and a quick peek under the hood.
Today, we’ll build on that understanding and talk about protobuf’s strategy for backward and forward compatibility. We’ll also talk about the pitfalls specific to this strategy that you should watch out for.
Adding Sweetener
We’re going to build on the Coffee
message from my previous example and add another field. This time, we’ll use an enumeration. Let’s look at it first:
syntax = "proto3";
package cafe;
message Coffee {
bool cream = 1;
enum Sweetener {
NONE = 0;
SUGAR = 1;
SUCRALOSE = 2;
}
Sweetener sweetener = 2;
}
Let’s start with coffee that has cream and sucralose (sorry — I don’t do sugar.) Once again, we can decode it raw, and with the updated .proto
file, we get this:
% protoc --decode_raw <coffee.protobuf
1: 1
2: 2
% protoc --decode=cafe.Coffee \
--proto_path=. \
cafe.proto <coffee.protobuf
cream: true
sweetener: SUCRALOSE
All is well. But now let’s try coffee with no sweetener (which is honestly what I prefer). I’ve set the sweetener
field to NONE
for this run.
% protoc --decode_raw <coffee.protobuf
1: 1
% protoc --decode=cafe.Coffee \
--proto_path=. \
cafe.proto <coffee.protobuf
cream: true
And the file is once again just two bytes long. Where’s the sweetener?
Default Values
There’s a design decision in protocol buffers you’ll need to get your head around in order to use them effectively, and that is the notion of default values.
Default values are the mechanism by which your code can deal with .proto
files that aren’t the same. They may be newer, or older than the one used to originally generate the serialized data you’re dealing with.
Protobuf libraries, when asked to serialize a field that matches a default value, will simply omit that field from the serialized representation. Then, when deserializing, they will fill in the object with the default value.
In the case of Sweetener
, the default value of an enum is zero — so the default is always NONE
. This means:
- If you’re deserializing an old
Coffee
message,sweetener
will always beNONE
. This makes sense, semantically, with the specific message we’re building here. When adding fields, be aware of this behavior and plan accordingly. - If you’re serializing a
Coffee
message withNONE
, that field will be omitted from the serialized data. The older.proto
will be able to read it and deal exclusively with thecream
field, and all is well.
Unknown Fields
You’ll notice one thing we didn’t cover: what happens if we feed a newer Coffee
message, with sweetener
set to SUGAR
or SUCRALOSE
, to the older .proto
that knows nothing about it?
Let’s try it, using the old cafe.proto
and Coffee
with cream
(true) and SUCRALOSE
.
% protoc --decode=cafe.Coffee \
--proto_path=. \
cafe.proto <coffee.protobuf
cream: true
2: 2
protoc
didn’t complain, and neither will your generated deserializer. The unknown field doesn’t have a name, but it does still have its number.
If you wrote code around the original .proto
file, you wouldn’t be aware of sweetener at all. Google’s protobuf library has methods to expose unknown fields to your code, but you won’t get any more information — just that you have a field numbered 2 whose value is 2.
In fact, you won’t even know it’s an enum! Protobuf stores enums as their numeric value. It’s up to the deserializing code to make sense of that value. Scalar values are particularly tricky here, because the same bytes can be decoded many ways — but we digress.
The upshot is that protocol buffers allow you to handle newer Coffee
messages with older code, but you just won’t know anything about the sweetener request. Is that important for your application? Maybe, maybe not.
Next Time: Enforcing Predictability
This time, we talked about adding a field to an existing message, and what happens to the serialized version when you try to read it with older or newer code.
In our final post, we’ll circle back around to my original requirement — an effort to make sure that we aren’t dealing with data that we don’t know how to deal with (because the data is newer than our code), or dealing with data that isn’t complete (because the data is older than our code).