I love choices! But there are a lot of binary serialization formats out there. I recently surveyed them for a project with fairly loose requirements for transferring text and binary messages between servers and embedded devices.
The result was a list of most of the available binary serialization formats and libraries, with comments on each. I’ll spoil the ending and say that for the project in question, we chose MessagePack. But along the way I learned how little I knew of this list before I really dug in to the task. I’m certain the research will come in handy some day when I need to make the same choice with different constraints, and maybe it can help you make your own choices, or at least know where to start.
The list isn’t in any particular order, just organized into three categories. “Dynamically Typed” refers to the library interface more than the format: it means that, like JSON or YAML, you can just shovel objects, Hashes, strings, or whatever into the serializer and it figures out types, sans schema. This is in contrast to what I call “Statically Typed”, which refers to formats where the messages must be described with a grammar or schema, and must conform to that before serialization is possible. This is a little loose; some (Avro, MessagePack) straddle the boundaries and not all of these projects fit snuggly into their categories. The final category is orthogonal: it includes the list of formats that I did some research on but that were not really in the race for some reason I describe. They’re here for mostly for completeness. Lastly, I list some of my many starting points and sources at the end of the post.
Originally from Erlang. Specification written by Tom Preston-Werner, founder of Github, and used heavily there. C++, Ruby, and JS libraries available, among others. In some simple Ruby benchmarking, I noticed that BERT was an order of magnitude slower at serialization than BSON or MessagePack. This may not be true in other language ports, however.
MessagePack supports some optional features such as RPC and IDL.
Part of MongoDB. “Lightweight, traversable, and efficient.” BSON has a huge number of implementations. Not as JSON-compatible as MessagePack, but probably much more widely used. The key advantage is its traversability, which makes it suitable for storage purposes, but comes at the cost of over-the-wire encoding size.
Sort of a hybrid, Avro uses schemas but embeds them into messages. Furthermore, schema are specified using JSON! Comparably speedy, Avro might be a good choice if the rigidity of Protocol Buffers or Thrift is too much for you.
Invented at Google, used heavily there for a decade. Like the majority of items in this list, does not provide an RPC mechanism but instead focuses on interchange protocols. Widely implemented, though not all are of the same quality/completion.
I was particularly interested to see the embedded-specific C port.
Built by Facebook and now lives at Apache. Thrift’s goal is “to enable efficient and reliable communication across programming languages”. Solving many aspects of cross-platform services, it generates RPC code for clients and servers, providing a compact, deterministic, and versionable interchange protocol.
- bencoding – Simple and widely implemented but pretty specific to bittorrent’s needs. Seems a little squishy between implementations.
- Gobs – Pretty specific to the Go language. Intro article, C lib
- XDR – somewhat archaic, seems focused on RPC
- ASN.1 – Overcomplicated grammar-grammar, aging toolset. First standardized when I was a toddler. Still used in telecoms and, interestingly, in the Ruby stdlib.
- EBML – Billed as a binary extension to XML, but doesn’t support schema. Few libraries.
- SDXF – couldn’t find any supporting libraries, 16mb limit
- Boost Serialization – C++-specific
- BISON – Seldom used, doesn’t appear to be a serious project. Critiqued here.
- Binary XML, Fast Infoset, EXI – focused on serializing XML, so too much baggage for my purpose
- OGDL – primarily meant for representation of graphs