方法一 : 先DISTINCT, 在计数
- DISTINCT 能偶对所有数据去重
方法二 : 先分组
- FOREACH 嵌套
- 使用DISTINCT
首先创建一份数据源文件
[hadoop@hadoop1 ~]$ cat score.txt James,Network,Tiger,100 James,Database,Tiger,99 James,PDE,Yao,95 Vincent,Network,Tiger,95 Vincent,PDE,Yao,98 Vincent,PDE, NocWei,PDE,Yao,100 [hadoop@hadoop1 ~]$ hadoop fs -put score.txt
[hadoop@hadoop1 ~]$ pig grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) AS (student,course,teacher,score:int); grunt> DESCRIBE A; grunt> B = FOREACH A GENERATE student, teacher; #只提取student和teacher,其他的丢掉 grunt> DESCRIBE B; #查看B数据,会发现只有两个元祖 grunt> C = DISTINCT B; #对B的数据去重 grunt> D = FOREACH ( GROUP C BY student ) GENERATE group AS student , COUNT(C); grunt> DUMP D #结果 (James,2) (NocWei,1) (Vincent,3) grunt>
grunt> E = group B by student;
grunt> F = foreach E
>> {
>> T = B.teacher;
>> uniq = DISTINCT T;
>> generate group as student,COUNT(uniq) as cnt;
>> }
步骤一: group by
- group by 的嵌套方法
步骤二: order by
- foreach嵌套
步骤三: limit
- 配合order by 使用
步骤四: flantten
- 去括号过程
grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) as (student,course,teacher,score:int);
grunt> dump A
(James,Network,Tiger,100)
(James,Database,Tiger,99)
(James,PDE,Yao,95)
(Vincent,Network,Tiger,95)
(Vincent,PDE,Yao,98)
(Vincent,PDE,,)
(NocWei,PDE,Yao,100)
grunt> B = FOREACH A GENERATE student,course,score;
grunt> dump B
(James,Network,100)
(James,Database,99)
(James,PDE,95)
(Vincent,Network,95)
(Vincent,PDE,98)
(Vincent,PDE,)
(NocWei,PDE,100)
grunt> C = group B by course
grunt> dump C
(PDE,{(NocWei,PDE,100),(Vincent,PDE,),(Vincent,PDE,98),(James,PDE,95)})
(Network,{(Vincent,Network,95),(James,Network,100)})
(Database,{(James,Database,99)})
grunt> D = FOREACH C
>> {
>> sorted = ORDER B BY score DESC;
>> top = LIMIT sorted 2;
>> GENERATE group AS course, top AS top;
>> }
grunt> dump D
(Database,{(James,Database,99)})
(Network,{(James,Network,100),(Vincent,Network,95)})
(PDE,{(NocWei,PDE,100),(Vincent,PDE,98)})
grunt> E = FOREACH D GENERATE course,FLATTEN(top); #对输出格式去括号
grunt> dump E
(Database,James,Database,99)
(Network,James,Network,100)
(Network,Vincent,Network,95)
(PDE,NocWei,PDE,100)
(PDE,Vincent,PDE,98)
本文出自 “晓风残月” 博客,请务必保留此出处http://kinda22.blog.51cto.com/2969503/1582569
原文地址:http://kinda22.blog.51cto.com/2969503/1582569