Avro is a remote procedure call and serialization framework developed as part of the Apache Hadoop project. Avro defines a data format designed to support data-intensive applications, and provides support for this format in a variety of programming languages.
It uses JSON for defining data types and protocols, and serializes data in a compact binary format. Its primary use is in Apache Hadoop, where it can provide both a serialization format for persistent data, and a wire format for communication between Hadoop nodes, and from client programs to the Hadoop services.
Avro provides functionality that is similar to the other marshalling systems such as Thrift, Protocol Buffers, etc. The main differentiators of Avro include:
- Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static data types, etc.
- Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
- No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
Schema Evolution: It allows for building less decoupled and more robust systems. This flexibility is a very interesting feature for rapidly evolving protocols like OpenRTB.
Untagged Data: There are two ways to encode data when serializing with Avro, binary or JSON. In the binary file, the schema is included at the beginning of file. Another interesting point is that the schema can be defined, and then the data can be encoded/decoded in JSON.
Dynamic Typing: The key abstraction is GenericData.Record. This is essentially a set of name-value pairs where name is the field name, and value is one of the Avro supported value types
In practice, JSON serialization can be used for debugging purposes, when data volume is low or when we simply want to (ab)use Avro as a general JSON-serialization layer. However, for the purposes of large-volume data processing and archival, binary format is the preferred option due to the fact that json-serialization adds certain memory size overheadby