标签:blog class code java tar int
grunt> cat /opt/dataset/input.txt keyword1 keyword2 keyword2 keyword4 keyword3 keyword1 keyword4 keyword4 A = LOAD ‘/opt/dataset/input.txt‘ using PigStorage(‘\n‘) as (line:chararray); B = foreach A generate TOKENIZE((chararray)$0); C = foreach B generate flatten($0) as word; D = group C by word; E = foreach D generate COUNT(C), group; dump B; ({(keyword1),(keyword2)}) ({(keyword2),(keyword4)}) ({(keyword3),(keyword1)}) ({(keyword4),(keyword4)}) dump C; (keyword1) (keyword2) (keyword2) (keyword4) (keyword3) (keyword1) (keyword4) (keyword4) dump D; (keyword1,{(keyword1),(keyword1)}) (keyword2,{(keyword2),(keyword2)}) (keyword3,{(keyword3)}) (keyword4,{(keyword4),(keyword4),(keyword4)}) dump E; (2,keyword1) (2,keyword2) (1,keyword3) (3,keyword4) store E into ‘./wordcount‘;
TOKENIZE Splits a string and outputs a bag of words. Syntax TOKENIZE(expression) Terms expression An expression with data type chararray. Usage Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*). Example In this example the strings in each row are split. A = LOAD ‘data‘ AS (f1:chararray); DUMP A; (Here is the first string.) (Here is the second string.) (Here is the third string.) X = FOREACH A GENERATE TOKENIZE(f1); DUMP X; ({(Here),(is),(the),(first),(string.)}) ({(Here),(is),(the),(second),(string.)}) ({(Here),(is),(the),(third),(string.)})
pig—WordCount analysis,布布扣,bubuko.com
标签:blog class code java tar int
原文地址:http://blog.csdn.net/xiewenbo/article/details/25047375