pig 实验

时间：2014-11-26 06:49:24 阅读：201 评论：0 收藏：0 [点我收藏+]

标签：hadoop pig

任务目标:

目标一 : 每名学生被多少位老师教过

方法一 : 先DISTINCT, 在计数

- DISTINCT 能偶对所有数据去重

方法二 : 先分组

- FOREACH 嵌套

- 使用DISTINCT

首先创建一份数据源文件

[hadoop@hadoop1 ~]$ cat score.txt 
James,Network,Tiger,100
James,Database,Tiger,99
James,PDE,Yao,95
Vincent,Network,Tiger,95
Vincent,PDE,Yao,98
Vincent,PDE,
NocWei,PDE,Yao,100
[hadoop@hadoop1 ~]$ hadoop fs -put score.txt

[hadoop@hadoop1 ~]$ pig
grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) AS (student,course,teacher,score:int);
grunt> DESCRIBE A;
grunt> B = FOREACH A GENERATE student, teacher;                      #只提取student和teacher,其他的丢掉
grunt> DESCRIBE B;                                                   #查看B数据,会发现只有两个元祖
grunt> C = DISTINCT B;                                               #对B的数据去重
grunt> D = FOREACH ( GROUP C BY student ) GENERATE group AS student , COUNT(C);  
grunt> DUMP D

#结果
(James,2)
(NocWei,1)
(Vincent,3)
grunt>

grunt> E = group B by student;  
grunt> F = foreach E                                
>> {                                            
>> T = B.teacher;                               
>> uniq = DISTINCT T;                           
>> generate group as student,COUNT(uniq) as cnt;
>> }

目标二 : 找出每门课程最优秀的两名学生

步骤一: group by

- group by 的嵌套方法

步骤二: order by

- foreach嵌套

步骤三: limit

- 配合order by 使用

步骤四: flantten

- 去括号过程

grunt> A = LOAD ‘/score.txt‘ USING PigStorage(‘,‘) as (student,course,teacher,score:int);
grunt> dump A
(James,Network,Tiger,100)
(James,Database,Tiger,99)
(James,PDE,Yao,95)
(Vincent,Network,Tiger,95)
(Vincent,PDE,Yao,98)
(Vincent,PDE,,)
(NocWei,PDE,Yao,100)
grunt> B = FOREACH A GENERATE student,course,score;
grunt> dump B
(James,Network,100)
(James,Database,99)
(James,PDE,95)
(Vincent,Network,95)
(Vincent,PDE,98)
(Vincent,PDE,)
(NocWei,PDE,100)
grunt> C = group B by course
grunt> dump C
(PDE,{(NocWei,PDE,100),(Vincent,PDE,),(Vincent,PDE,98),(James,PDE,95)})
(Network,{(Vincent,Network,95),(James,Network,100)})
(Database,{(James,Database,99)})
grunt> D = FOREACH C                        
>> {                                    
>> sorted = ORDER B BY score DESC;      
>> top = LIMIT sorted 2;                
>> GENERATE group AS course, top AS top;
>> }
grunt> dump D
(Database,{(James,Database,99)})
(Network,{(James,Network,100),(Vincent,Network,95)})
(PDE,{(NocWei,PDE,100),(Vincent,PDE,98)})
grunt> E = FOREACH D GENERATE course,FLATTEN(top);                      #对输出格式去括号
grunt> dump E
(Database,James,Database,99)
(Network,James,Network,100)
(Network,Vincent,Network,95)
(PDE,NocWei,PDE,100)
(PDE,Vincent,PDE,98)

本文出自 “晓风残月” 博客，请务必保留此出处http://kinda22.blog.51cto.com/2969503/1582569

pig 实验

标签：hadoop pig

原文地址：http://kinda22.blog.51cto.com/2969503/1582569

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行