Using Protobuf, Part 3: Building Robust Code

In the first part of this series, I introduced Google’s protocol buffers, a.k.a. “protobuf.” In the second part, I talked about forward and backward compatibility, defaults, and unknown fields. Now it’s time to circle back to what led me down this path in the first place: given a system that produces and consumes serialized protobuf messages, how can we be assured that our code is equipped to handle messages that may be older or newer?

The “Unknown Fields” Canary

The system I was working with produced a lot of protobuf messages. And, like any system under development, those messages grew new fields and replaced existing ones.

We decided we would error if we didn’t think we could handle an incoming message. One case was easy: the message we read in had more fields than we knew what to do with. Protobuf decoders expose unknown fields, so we could just check if there were any.

But — not so fast. Recall that protobuf omits fields that match their default values.

So if 0 is a valid number for a field, we simply cannot tell whether it was originally set or set to a default value. In practice, this means that we can’t detect some unknown fields.

Regardless, the unknown field test is a good start. We just need to understand its limitations.

Validation

Checking for unknown fields doesn’t solve the opposite problem at all: what if we need a field to be set, but the upstream protobuf message producer doesn’t know about it?

Default values strike again here. The upstream won’t set them, but downstream we can’t tell if they’ve never been set or just have a default value.

We can construct these new fields in such a way that default values will be invalid, then leverage a library like PGV to validate incoming messages.

This doesn’t solve for fields that can legitimately be 0 or an empty string. But it will flag other concerns, including out-of-bounds values, so it’s also valuable in its way.

Other Options

There are, of course, other options to implementing validation. Some things we tossed around:

  • A field that would be set to a message version number and incremented every time we changed the .proto. We’d have to maintain this version number, and it wouldn’t be a guarantee that the rest of the message had our expected field set, of course. But it would get us closer.
  • Hacking around with wrapper messages or abusing repeated fields to try to enforce that a field is present, even if it’s set to what would normally be a default value. Unfortunately, this had the side effect that our messages were now very unwieldy.

Understanding all the limitations of this system and the data fields carried, we decided we were safe with a combination of the unknown field test and validation. But you may well decide something different in your project.

Upgrading Carefully

In order for all this to work, we also had to carefully follow rules for upgrading protobuf messages. They’re pretty simple, but critical:

  1. Never re-use an old field number, even if that field has been deprecated. If you read a message with that old field, it can be initialized unpredictably — particularly if it’s a new data type.
  2. If you change the type of a field, deprecate the old field and use a new field number. Due to the way protobuf stores numbers, for example, you can read a 32-bit int as a 64-bit int. But you can’t read a 64-bit int that doesn’t fit in 32 bits as a 32-bit int. The value you get out will be bit-truncated and almost certainly useless.

Finally, if mistakes had been made in the past, the best solution in our case would be to deprecate old problematic fields and move on.

Protobuf Messages: The End

We were pretty confident we did our best to make this system work well given the constraints of protobuf and the fact that its advantages — speed and size, critical for the large volume of data we were working with — weren’t sacrificed.

You could, of course, argue (and we considered) that protobuf wasn’t the right fit for our needs. But the advantages mentioned were so important that we stuck with it.

Our setup did error appropriately when we had a mismatch between upstream and downstream. We couldn’t get to 100%, but we were confident we’d done well.

I’d love to hear if you have worked in a similar system with a similar requirement, and what you may have done to address this issue. And regardless, I hope you’ve learned something interesting!