标签:
apache avro 是一个数据序列化系统,是一个基于二进制数据传输高性能的中间件。
富有的数据结构
一个简洁紧凑,快速的二进制数据格式
一个持久存储数据的文件容器
远程过程调用(RPC)
简单的动态语言结合功能,Avro和动态语言结合后,读写数据文件和使用RPC协议都不需要生成代码,而代码生成作为一种可选的优化只值得在静态类型语言中实现
Avro 支持跨编程语言实现(C, C++, C#,Java, Python, Ruby, PHP),Avro 提供着与诸如 Thrift 和 Protocol Buffers 等系统相似的功能,但是在一些基础方面还是有区别的,主要是:
动态类型:Avro 并不需要生成代码,模式和数据存放在一起,而模式使得整个数据的处理过程并不生成代码、静态数据类型等等。这方便了数据处理系统和语言的构造。
未标记的数据:由于读取数据的时候模式是已知的,那么需要和数据一起编码的类型信息就很少了,这样序列化的规模也就小了。
不需要用户指定字段号:即使模式改变,处理数据时新旧模式都是已知的,所以通过使用字段名称可以解决差异问题。
Avro可以做到将数据进行序列化,适用于远程或本地大批量数据交互。
在传输的过程中Avro对数据二进制序列化后 节约数据存储空间 和 网络传输带宽。
做个比方:有一个100平方的房子,本来能放100件东西,现在期望借助某种手段能让原有面积的房子能存放比原来多150件以上或者更多的东西,就好比数据存放在缓存中,缓存是精贵的,需要充分的利用缓存有限的空间,存放更多的数据。再例如网络带宽的资源是有限的,希望原有的带宽范围能传输比原来高大的数据量流量,特别是针对结构化的数据传输和存储,这就是Avro存在的意义和价值。
pom.xml
<dependencies> <dependency> <groupId>org.apache.avro</groupId> <artifactId>avro</artifactId> <version>1.7.7</version> </dependency> </dependencies> <build> <plugins> <plugin> <groupId>org.apache.avro</groupId> <artifactId>avro-maven-plugin</artifactId> <version>1.7.7</version> <executions> <execution> <phase>generate-sources</phase> <goals> <goal>schema</goal> </goals> <configuration> <sourceDirectory>${project.basedir}/src/main/resources/</sourceDirectory> <outputDirectory>${project.basedir}/src/main/java/</outputDirectory> </configuration> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> </plugin> </plugins>
user.avsc文件 {"namespace": "org.pq.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
在当前maven工程目录下执行:$ mvn clean compile
执行结果,会在org.pq.arvo目录(注意User.avsc文件的namespace)下生成 User.java类。
然后写测试类 Test.java
package org.pq.avro; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import org.apache.avro.specific.SpecificDatumReader; import org.apache.avro.specific.SpecificDatumWriter; import java.io.File; import java.io.IOException; public class Test { public static void main(String[] args) throws IOException { //1.creating Users User u1 = new User(); u1.setName("Alyssa"); u1.setFavoriteNumber(256); User u2 = new User("Ben",7,"red"); User u3 = User.newBuilder() .setName("Charlie") .setFavoriteColor("blue") .setFavoriteNumber(null) .build(); //2.Now let‘s serialize our Users to disk DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class); DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter); File file = new File("users.avro"); dataFileWriter.create(u1.getSchema(),file); dataFileWriter.append(u1); dataFileWriter.append(u2); dataFileWriter.append(u3); dataFileWriter.close(); //3.Deserialize Users from dist DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class); DataFileReader<User> dataFileReader = new DataFileReader<User>(file, userDatumReader); User user = null; while (dataFileReader.hasNext()) { // Reuse user object by passing it to next(). This saves us from // allocating and garbage collecting many objects for files with // many items. user = dataFileReader.next(user); System.out.println(user); } } }
Run Result:
{"name": "Alyssa", "favorite_number": 256, "favorite_color": null} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} {"name": "Charlie", "favorite_number": null, "favorite_color": "blue"} |
package org.pq.avro; import org.apache.avro.Schema; import org.apache.avro.file.DataFileReader; import org.apache.avro.file.DataFileWriter; import org.apache.avro.generic.GenericData; import org.apache.avro.generic.GenericDatumReader; import org.apache.avro.generic.GenericDatumWriter; import org.apache.avro.generic.GenericRecord; import org.apache.avro.io.DatumReader; import org.apache.avro.io.DatumWriter; import java.io.File; import java.io.IOException; import java.net.URISyntaxException; public class Test2 { public static void main(String[] args) throws IOException, URISyntaxException { //First, we use a Parser to read our schema definition and create a Schema object. File file = new File(Test2.class.getClassLoader().getResource("user.avsc").toURI()); Schema schema = new Schema.Parser().parse(file); //using this schema,let‘s create some users GenericRecord u1 = new GenericData.Record(schema); u1.put("name","Alyssa"); u1.put("favorite_number",256); GenericRecord u2 = new GenericData.Record(schema); u2.put("name","Ben"); u2.put("favorite_number",7); u2.put("favorite_color","red"); // Serialize u1 and u2 to disk File usersFile = new File("users.avro"); DatumWriter<GenericRecord> datumWriter = new GenericDatumWriter<GenericRecord>(schema); DataFileWriter<GenericRecord> dataFileWriter = new DataFileWriter<GenericRecord>(datumWriter); dataFileWriter.create(schema, file); dataFileWriter.append(u1); dataFileWriter.append(u2); dataFileWriter.close(); // Deserialize users from disk DatumReader<GenericRecord> datumReader = new GenericDatumReader<GenericRecord>(schema); DataFileReader<GenericRecord> dataFileReader = new DataFileReader<GenericRecord>(file, datumReader); GenericRecord user = null; while (dataFileReader.hasNext()) { // Reuse user object by passing it to next(). This saves us from // allocating and garbage collecting many objects for files with // many items. user = dataFileReader.next(user); System.out.println(user); } } }
Run result:
{"name": "Alyssa", "favorite_number": 256, "favorite_color": null} {"name": "Ben", "favorite_number": 7, "favorite_color": "red"} |
https://avro.apache.org/docs/current/gettingstartedjava.html
http://www.javabloger.com/article/hadoop-avro-rpc-java.html
标签:
原文地址:http://my.oschina.net/pengqiang/blog/531657