Using Protobuf, Part 2: Fields and Compatibility

In my last post, I gave you a quick introduction to Google’s protocol buffers, a.k.a. “protobuf” — and a quick peek under the hood.

Today, we’ll build on that understanding and talk about protobuf’s strategy for backward and forward compatibility. We’ll also talk about the pitfalls specific to this strategy that you should watch out for.

Adding Sweetener

We’re going to build on the Coffee message from my previous example and add another field. This time, we’ll use an enumeration. Let’s look at it first:

syntax = "proto3";

package cafe;

message Coffee {
  bool cream = 1;
  enum Sweetener {
    NONE = 0;
    SUGAR = 1;
    SUCRALOSE = 2;
  }
  Sweetener sweetener = 2;
}

Let’s start with coffee that has cream and sucralose (sorry — I don’t do sugar.) Once again, we can decode it raw, and with the updated .proto file, we get this:

% protoc --decode_raw <coffee.protobuf
1: 1
2: 2
% protoc --decode=cafe.Coffee \
         --proto_path=. \
         cafe.proto <coffee.protobuf
cream: true
sweetener: SUCRALOSE

All is well. But now let’s try coffee with no sweetener (which is honestly what I prefer). I’ve set the sweetener field to NONE for this run.

% protoc --decode_raw <coffee.protobuf
1: 1
% protoc --decode=cafe.Coffee \
         --proto_path=. \
         cafe.proto <coffee.protobuf
cream: true

And the file is once again just two bytes long. Where’s the sweetener?

Default Values

There’s a design decision in protocol buffers you’ll need to get your head around in order to use them effectively, and that is the notion of default values.

Default values are the mechanism by which your code can deal with .proto files that aren’t the same. They may be newer, or older than the one used to originally generate the serialized data you’re dealing with.

Protobuf libraries, when asked to serialize a field that matches a default value, will simply omit that field from the serialized representation. Then, when deserializing, they will fill in the object with the default value.

In the case of Sweetener, the default value of an enum is zero — so the default is always NONE. This means:

  1. If you’re deserializing an old Coffee message, sweetener will always be NONE. This makes sense, semantically, with the specific message we’re building here. When adding fields, be aware of this behavior and plan accordingly.
  2. If you’re serializing a Coffee message with NONE, that field will be omitted from the serialized data. The older .proto will be able to read it and deal exclusively with the cream field, and all is well.

Unknown Fields

You’ll notice one thing we didn’t cover: what happens if we feed a newer Coffee message, with sweetener set to SUGAR or SUCRALOSE, to the older .proto that knows nothing about it?

Let’s try it, using the old cafe.proto and Coffee with cream (true) and SUCRALOSE.

% protoc --decode=cafe.Coffee \
         --proto_path=. \
         cafe.proto <coffee.protobuf
cream: true
2: 2

protoc didn’t complain, and neither will your generated deserializer. The unknown field doesn’t have a name, but it does still have its number.

If you wrote code around the original .proto file, you wouldn’t be aware of sweetener at all. Google’s protobuf library has methods to expose unknown fields to your code, but you won’t get any more information — just that you have a field numbered 2 whose value is 2.

In fact, you won’t even know it’s an enum! Protobuf stores enums as their numeric value. It’s up to the deserializing code to make sense of that value. Scalar values are particularly tricky here, because the same bytes can be decoded many ways — but we digress.

The upshot is that protocol buffers allow you to handle newer Coffee messages with older code, but you just won’t know anything about the sweetener request. Is that important for your application? Maybe, maybe not.

Next Time: Enforcing Predictability

This time, we talked about adding a field to an existing message, and what happens to the serialized version when you try to read it with older or newer code.

In our final post, we’ll circle back around to my original requirement — an effort to make sure that we aren’t dealing with data that we don’t know how to deal with (because the data is newer than our code), or dealing with data that isn’t complete (because the data is older than our code).