用户自定义函数(UDF) public abstract class EvalFunc<T> { public abstract T exec(Tuple input) throws IOException; public List<FuncSpec> getAvgToFuncMapping() throws FrontendException; public FuncSpec outputSchema() throws FrontendException; } 输入元组的字段包含传递给函数的表达式,输出是泛型;对于过滤函数输出就是Boolean类型。建议尽量在 getAvgToFuncMapping()/outputSchema()申明输入和输出数据的类型,以便Pig进行类型转换或过滤不匹配类型的错误值。 Grunt>REGISTER pig-examples.jar; DEFINE isGood org.hadoopbook.pig.IsGoodQuality(); 加载UDF public LoadFunc { public void setLocation(String location, Job job); public InputFormat getInputFormat(); public void prepareToRead(RecordReader reader, PigSplit split); public Tuple next() throws IOException; } 类似Hadoop,Pig的数据加载先于mapper的运行,所以保证数据可以被分割成能被各个mapper独立处理的部分非常重要。从Pig 0.7开始, 加载和存储函数接口已经进行了大幅修改,以便与Hadoop的InputFormat和OutputFormat类基本一致。 Grunt>Register loadfunc.jar Define customLoad org.hadoopbook.pig.loadfunc() records = load ‘input/sample.txt’ using customLoad(‘16-19, 88-92, 93-93’) as (year:int, temperature:int, quality:int); 更多精彩内容请关注:http://bbs.superwu.cn 关注超人学院微信二维码: |
原文地址:http://blog.csdn.net/crxy2014/article/details/46049771