Avro — Data Format
Most of the developers very often transfer the data over network or store the data into persistent storage.
Data Format defines in which way data is being transferred or stored so that receive can parse it in its applicable way.
CSV, XML, JSON etc are widely used data format for data transfer. Serialization APIs of Java facilitates data storage in serialized form and deserialize when access.
So, while CSV, XML, JSON don’t serialize and validate data , serialization APIs don’t have so much vast scope. Hence, there is this one new data format — Avro. Avro supports both.
Avro —
Apache Avro is schema based data serialization system. Avro serializes data into a compact binary format which can be deserialize by any application.
Avro uses JSON to define its schema. This schema is just like schema in database to define the semantics of data being transfer.
Basic Features-
Avro provides -
* Data Structures (Schema)
* Binary data format
* Container file format to store persistent data
* RPC capabilities
Why Avro??
To understand this, lets go through some of the drawbacks of widely used data format.
CSV — CSV is very easy to read and parse but let the reader assume data type of its element and even doesn’t specify that element is required or not.
XML- XML is schema based but it is heavy weight and hence, not appropriate for data streaming.
JSON- JSON is omnipresent in all languages. Every language has its own parser of json. JSON can take any form and easily shared over network, however, JSON has no native schema support and JSON objects might become high in volume because of the usage of repeated keys. It also doesn’t contains any metadata or documentation.
Hence, all of these can’t become choice when call for data streaming.
AVRO- Avro is widely used in Big Data Community. It is name for Fast-Data-Serialization format after being pushed into Cofluent Schema Registry.
Avro does all the wonders with its Schema.
In schema, we can define data type of element and specify, it is required or not. It is light-weight and very compact but not human-readable, only specific avro tools parse it in human readable form.
There is no duplication of keys and its schema evolves along with data.
This much capabilities make AVRO reach to new height and favourite choice for data streaming and RPC.