【Hadoop】数据序列化系统Avro

时间：2015-08-31 10:13:05 阅读：199 评论：0 收藏：0 [点我收藏+]

标签：

Avro简介
- schema
文件组成
- Header与Datablock声明代码
- 测试代码
序列化与反序列化
- specific
- generic
参考资料

Avro简介

Avro是由Doug Cutting（Hadoop之父）创建的数据序列化系统，旨在解决Writeable类型的不足：缺乏语言的可移植性。为了支持跨语言，Avro的schema与语言的模式无关。有关Avro的更多特性请参看官方文档 1。

Avro文件的读写是依据schema而进行的。通常情况下，Avro的schema是用JSON编写，而数据部分则是二进制格式编码，并采用压缩算法对数据进行压缩，以便减少传输量。

schema

schema中数据字段的类型包括两种

原生类型（primitive types）: null, boolean, int, long, float, double, bytes, and string
复杂类型（complex types）: record, enum, array, map, union, and fixed

复杂类型比较常用的record。这里用[2]中twitter.avro文件为例，打开文件后，文件头如下：

Objavro.codecnullavro.schemaò{“type”:”record”,”name”:”twitter_schema”,”namespace”:”com.miguno.avro”,”fields”:[{“name”:”username”,”type”:”string”,”doc”:”Name of the user account on Twitter.com”},{“name”:”tweet”,”type”:”string”,”doc”:”The content of the user’s Twitter message”},{“name”:”timestamp”,”type”:”long”,”doc”:”Unix epoch time in milliseconds”}],”doc:”:”A basic schema for storing Twitter messages”}

将schema格式化之后

{
    "type": "record",
    "name": "twitter_schema",
    "namespace": "com.miguno.avro",
    "fields": [
        {
            "name": "username", "type": "string",
            "doc": "Name of the user account on Twitter.com"
        },
        {
            "name": "tweet", "type": "string",
            "doc": "The content of the user‘s Twitter message"
        },
        {
            "name": "timestamp", "type": "long",
            "doc": "Unix epoch time in milliseconds"
        }
    ],
    "doc:": "A basic schema fostoring Twitter messages"
}

其中，name是该JSON串的名字，type是指明name的类型，doc是对该name更为详细的说明。

文件组成

3中的图对Avro文件进行详细地描述，一个文件由header与多个data block组成。header主要由MetaDatas与16位sync marker组成，MetaDatas中的信息包含codec与schema；codec是data block中的数据采用的压缩方式，为null（不压缩）或者是deflate。deflate算法是gzip所采用的压缩算法，就我自己感觉而言压缩比在6倍以上（具体还没研究过）。

技术分享

其实每个data block间都会间隔一个sync marker，具体参看4。sync marker是为了用于mapReduce阶段时文件分割与同步；此外Avro本身是为了mapReduce而设计的。

//org.apache.avro.file.DataFileStream.java

  public static final class Header {
    Schema schema;
    Map<String,byte[]> meta = new HashMap<String,byte[]>();
    private transient List<String> metaKeyList = new ArrayList<String>();
    byte[] sync = new byte[DataFileConstants.SYNC_SIZE]; //byte[16]
    private Header() {}
  }

  static class DataBlock {
    private byte[] data;
    private long numEntries;
    private int blockSize;
    private int offset = 0;
    private boolean flushOnWrite = true;
    private DataBlock(long numEntries, int blockSize) {
      this.data = new byte[blockSize];
      this.numEntries = numEntries;
      this.blockSize = blockSize;
    }

测试代码

DataFileReader<Void> reader =
                  new DataFileReader<Void>(new FsInput(new Path("twitter.avro"), new Configuration()),
                                           new GenericDatumReader<Void>());
//print schema
System.out.println(reader.getSchema().toString(true));

//print meta 
List<String> metaKeyList = reader.getMetaKeys();
System.out.println(metaKeyList.toString());
System.out.println(reader.getMetaString("avro.codec"));
System.out.println(reader.getMetaString("avro.schema"));

//print blockount
reader.getBlockCount();

//print the data in data block
System.out.println(reader.next());

可以看到meta中存放的是avro.codec, avro.schema。

序列化与反序列化

官网上给出了两种序列化方式：specific与generic。

specific

// Serialize user1, user2 and user3 to disk
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
dataFileWriter.create(user1.getSchema(), new File("users.avro"));
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.append(user3);
dataFileWriter.close();

// Deserialize Users from disk
DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader);
User user = null;
while (dataFileReader.hasNext()) {
// Reuse user object by passing it to next(). This saves us from
// allocating and garbage collecting many objects for files with
// many items.
user = dataFileReader.next(user);
System.out.println(user);
}

specific的方式是根据所生成的User类，提取出schema来进行Avro的解析。

generic

GenericRecord user1 = new GenericData.Record(schema);
user1.put("name", "Alyssa");
user1.put("favorite_number", 256);
// Leave favorite color null

GenericRecord user2 = new GenericData.Record(schema);
user2.put("name", "Ben");
user2.put("favorite_number", 7);
user2.put("favorite_color", "red");

// Serialize user1 and user2 to disk
File file = new File("users.avro");
DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema);
DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter);
dataFileWriter.create(schema, file);
dataFileWriter.append(user1);
dataFileWriter.append(user2);
dataFileWriter.close();

generic的方式是预先生成了一个schema，然后再根据其解析。因为Avro文件会将schema写在文件头，所以在平常做解析时，generic的方式更为常见。

avro-tools的jar包提供了对Avro文件丰富的操作，包括对Avro文件进行切割，以用于做测试数据。

Available tools:
      compile  Generates Java code for the given schema.
       concat  Concatenates avro files without re-compressing.
   fragtojson  Renders a binary-encoded Avro datum as JSON.
     fromjson  Reads JSON records and writes an Avro data file.
     fromtext  Imports a text file into an avro data file.
      getmeta  Prints out the metadata of an Avro data file.
    getschema  Prints out schema of an Avro data file.
          idl  Generates a JSON schema from an Avro IDL file
       induce  Induce schema/protocol from Java class/interface via reflection.
   jsontofrag  Renders a JSON-encoded Avro datum as binary.
      recodec  Alters the codec of a data file.
  rpcprotocol  Output the protocol of a RPC service
   rpcreceive  Opens an RPC Server and listens for one message.
      rpcsend  Sends a single RPC message.
       tether  Run a tethered mapreduce job.
       tojson  Dumps an Avro data file as JSON, one record per line.
       totext  Converts an Avro data file to a text file.
  trevni_meta  Dumps a Trevni file‘s metadata as JSON.
trevni_random  Create a Trevni file filled with random instances of a schema.
trevni_tojson  Dumps a Trevni file as JSON.

参考资料

【Hadoop】数据序列化系统Avro

标签：

原文地址：http://blog.csdn.net/keyboardlabourer/article/details/48087775

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行

【Hadoop】数据序列化系统Avro

Avro简介

schema

文件组成

Header与Datablock声明代码

测试代码

序列化与反序列化

specific

generic

参考资料