Protocol Buffer Design: Principles and Practices for Collaborative Development

22 points

1/20/1970

10 months ago

by benocodes

Comments

fifilura

One of the problems I have had when working with protobuf is that it is not favored by the Apache ecosystem. Apache is built around Avro instead.

So support for protobuf in Apache frameworks like for example Flink and Iceberg is always late. In some cases way too late.

Another problem is around null values. I don't have time to do the full investigation, but I remember a regression in proto3 caused us to not know whether a field was missing because it was 0 or if it was missing because it was null. Someone else may recognize it. Or possibly we misunderstood something, I don't have the full details as of now.

10 months ago

jiehong

This optional field issue seems to have been fixed in 3.15 [0], and seems documented [1].

[0]: https://github.com/protocolbuffers/protobuf/releases/tag/v3....

[1]: https://protobuf.dev/programming-guides/proto3/#field-labels

10 months ago

fifilura

Not being able to distinguish between 0 value and a missing value.

It is just a huge regression in a data format. How could it happen?

Imagine building a data format for temperature measuring stations around that. Is it 0 degrees, or did the thermometer just skip reporting?

10 months ago

tveita

You can handle missing values if you plan for it when you design the API. As the article says you can use the well-known optional types, or now mark fields as optional.

Protocol buffers include a set of rules to follow for backwards and forwards compatibility between servers and clients, I think these should be understood as part of the philosophy of protocol buffers. https://protobuf.dev/programming-guides/proto3/#updating

> Imagine building a data format for temperature measuring stations around that. Is it 0 degrees, or did the thermometer just skip reporting?

What does it mean for a thermometer to "skip reporting"? If that is something you intend to support you will need to define what it means and how the data should be read. Should a reader always check if the field is present? Should it crash?

What would typically happen is that you start out with a required field, because of course you will always have a temperature:

  required sint32 temperature = 1; // temperature in Fahrenheit

But soon enough you realize you need a richer format.

  required sint32 legacy_temperature = 1 [deprecated=true]; // temperature in Fahrenheit - DO NOT USE
  message Measurement { 
    optional Unit unit = 1;
    optional float value = 2;
  }
  optional Measurement temperature = 2;

- or you decide to allow submitting a batch of temperatures in each message, etc.

You configure new measurement stations to set both the old and the new format so they'll be compatible with old servers - maybe your customers are running servers on-prem. You mark legacy_temperature as optional in your server code to prepare for eventual deprecation (luckily in this case the server never sends temperatures back to measuring stations)

Finally all the customers have upgrade to the new server, and you ship new stations that don't set the previously "required" field. New measurement stations don't show up in the UI. A batch job that doesn't even look at the temperature field was compiled with the old definitions and is now crashing.

10 months ago

[deleted]

10 months ago

fifilura

Thank you for your thorough reply.

I am sorry i don't have time to consider all aspects of it.

But to me it seems like a problem around what optional means, and it has been very unstable, in particular in the transition between proto2 and proto3.

Avro just seems like a more stable choice. Also back to my main point, that support for protobuf in Apache products is iffy.

10 months ago

[deleted]

10 months ago