码迷,mamicode.com
首页 > 其他好文 > 详细

hive影评练习

时间:2018-12-03 14:00:07      阅读:203      评论:0      收藏:0      [点我收藏+]

标签:tput   string   number   cto   order   mat   导入   合并   种类   

现有如此三份数据:
1、users.dat 数据格式为: 2::M::56::16::70072,
共有6040条数据
对应字段为:UserID BigInt, Gender String, Age Int, Occupation String, Zipcode String
对应字段中文解释:用户id,性别,年龄,职业,邮政编码
2、movies.dat 数据格式为: 2::Jumanji (1995)::Adventure|Children‘s|Fantasy,
共有3883条数据
对应字段为:MovieID BigInt, Title String, Genres String
对应字段中文解释:电影ID,电影名字,电影类型
3、ratings.dat 数据格式为: 1::1193::5::978300760,
共有1000209条数据
对应字段为:UserID BigInt, MovieID BigInt, Rating Double, Timestamped String
对应字段中文解释:用户ID,电影ID,评分,评分时间戳
题目要求
  数据要求:
    (1)写shell脚本清洗数据。(hive不支持解析多字节的分隔符,也就是说hive只能解析‘:‘, 不支持解析‘::‘,所以用普通方式建表来使用是行不通的,要求对数据做一次简单清洗)
    (2)使用Hive能解析的方式进行
  Hive要求:
    (1)正确建表,导入数据(三张表,三份数据),并验证是否正确
    (2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
    (3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)
    (4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)
    (5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(观影者,电影名,影评分)
    (6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
    (7)求1997年上映的电影中,评分最高的10部Comedy类电影
    (8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
    (9)各年评分最高的电影类型(年份,类型,影评分)
    (10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)
之前已经使用MapReduce程序将3张表格进行合并,所以只需要将合并之后的表格导入对应的表中进行查询即可
原始数据是以::进行切分的,所以需要使用能解析多字节分隔符的Serde即可
使用RegexSerde
需要两个参数:
input.regex = "(.)::(.)::(.*)"
output.format.string = "%1$s %2$s %3$s"

create table t_user(
userid bigint,
sex string,
age int,
occupation string,
zipcode string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe‘
with serdeproperties(‘input.regex‘=‘(.)::(.)::(.)::(.)::(.)‘,‘output.format.string‘=‘%1$s %2$s %3$s %4$s %5$s‘)
stored as textfile;
load data local inpath "/root/users.dat" into table t_user;
create table t_movie(
movieid bigint,
moviename string,
movietype string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe‘
with serdeproperties(‘input.regex‘=‘(.
)::(.)::(.)‘,‘output.format.string‘=‘%1$s %2$s %3$s‘)
stored as textfile;
load data local inpath "/root/movies.dat" into table t_movie;
create table t_rating(
userid bigint,
movieid bigint,
rate double,
times string)
row format serde ‘org.apache.hadoop.hive.serde2.RegexSerDe‘
with serdeproperties(‘input.regex‘=‘(.)::(.)::(.)::(.)‘,‘output.format.string‘=‘%1$s %2$s %3$s %4$s‘)
stored as textfile;
load data local inpath "/root/ratings.dat" into table t_rating;
(2)求被评分次数最多的10部电影,并给出评分次数(电影名,评分次数)
create view v_movie_rate
as
select movieid,rate_count
from (
select movieid,count(1) as rate_count
from t_rating
group by movieid ) tmp
order by rate_count desc
limit 10;

select m.moviename,mr.rate_count
from v_movie_rate mr
join t_movie m on m.movieid=mr.movieid;

第二种方式
select a.moviename as moviename,count(a.moviename) as total
from t_movie a join t_rating b on a.movieid=b.movieid
group by a.moviename
order by total desc
limit 10;
(3)分别求男性,女性当中评分最高的10部电影(性别,电影名,影评分)

create view v_muser_rate_top
as
select ‘M‘ as sex,m.moviename,avg(r.rate) as rate_avg,count(r.movieid) movie_count
from t_user u
join t_rating r on u.userid=r.userid
join t_movie m on m.movieid=r.movieid
where u.sex=‘M‘
group by m.moviename
having movie_count >= 50
order by rate_avg desc
limit 10;

create view v_fuser_rate_top
as
select ‘F‘ as sex,m.moviename,avg(r.rate) as rate_avg,count(r.movieid) movie_count
from t_user u
join t_rating r on u.userid=r.userid
join t_movie m on m.movieid=r.movieid
where u.sex=‘F‘
group by m.moviename
having movie_count >= 50
order by rate_avg desc
limit 10;

select from v_muser_rate_top
union
select
from v_fuser_rate_top;
(4)求movieid = 2116这部电影各年龄段(因为年龄就只有7个,就按这个7个分就好了)的平均影评(年龄段,影评分)

select u.age,avg(r.rate) rate_avg
from t_rating r
join t_user u on u.userid=r.userid
where r.movieid=2116
group by u.age;

1 3.2941176470588234
18 3.3580246913580245
25 3.436548223350254
35 3.2278481012658227
45 2.8275862068965516
50 3.32
56 3.5
(5)求最喜欢看电影(影评次数最多)的那位女性评最高分的10部电影的平均影评分(电影名,影评分)
找出最牛逼的那位女性(userid=1150)
create table rate_max_count_famale
as
select r.userid,count(r.userid) as rate_count
from t_rating r
join t_user u on u.userid=r.userid
where u.sex=‘F‘
group by r.userid
order by rate_count desc
limit 1;

找出那个女性评分最高的前10部电影
create table t_famale_top10
as
select r.movieid,r.rate
from t_rating r
join rate_max_count_famale f on f.userid=r.userid
order by r.rate desc
limit 10;
算出这10部电影的平均影评分
select r.movieid,m.moviename,avg(r.rate) rate_avg
from t_famale_top10 f
join t_rating r on r.movieid=f.movieid
join t_movie m on m.movieid=r.movieid
group by r.movieid,m.moviename;
(6)求好片(评分>=4.0)最多的那个年份的最好看的10部电影
先求评分大于4分的所有电影,并将电影上映的年份截取出来
create table tmp_movie_rateavg_4
as
select m.movieid,m.moviename,substr(m.moviename,-5,4) tyear,avg(r.rate) rate_avg
from t_movie m
join t_rating r on r.movieid=m.movieid
group by m.movieid,m.moviename
having rate_avg>=4;
按年分组求出每年最多好片的那一年(1998)
select tyear,count(tyear) total
from tmp_movie_rateavg_4
group by tyear
order by total desc
limit 1;
求出那个年份最好看的10部电影
select movieid,moviename,rate_avg
from tmp_movie_rateavg_4
where tyear=‘1998‘
order by rate_avg desc
limit 10;

(7)求1997年上映的电影中,评分最高的10部Comedy类电影
insert overwrite local directory ‘/root/00movie_rate_top10‘ row format delimited fields terminated by ‘\t‘
select m.movieid,m.moviename,avg(r.rate) rate_avg
from t_movie m
join t_rating r on r.movieid=m.movieid
where moviename like concat(‘%‘,‘1997‘,‘%‘) and movietype like concat(‘%‘,‘Comedy‘,‘%‘)
group by m.movieid,m.moviename
order by rate_avg desc
limit 10;
(8)该影评库中各种类型电影中评价最高的5部电影(类型,电影名,平均影评分)
select movietype,count(1) total
from t_movie
group by movietype;

select m.movieid,m.moviename,r.rate,tv.type
from t_movie m
join t_rating r on r.movieid=m.movieid
lateral view explode(split(m.movietype,"\|")) tv as type;

求出每部电影的平均影评分
create table tmp_movie_rateavg_1
as
select m.movieid,m.moviename,m.movietype,avg(r.rate) rate_avg
from t_movie m
join t_rating r on r.movieid=m.movieid
group by m.movieid,m.moviename,m.movietype;
把类型列裂变成多行数据
create table tmp_movie_rateavg_1_1
as
select movieid,moviename,rate_avg,tv.type
from tmp_movie_rateavg_1
lateral view explode(split(movietype,"\|")) tv as type;

select type,moviename,rate_avg
from(
select type,moviename,rate_avg,row_number() over(partition by type order by rate_avg desc) rn
from tmp_movie_rateavg_1_1 ) tmp
where tmp.rn <=5;
(9)各年评分最高的电影类型(年份,类型,影评分)
在tmp_movie_rateavg_1基础上将类型和年份变出来
create table tmp_movie_rateavg_1_2
as
select movieid,substr(moviename,-5,4) tyear,moviename ,rate_avg,tv.type
from tmp_movie_rateavg_1
lateral view explode(split(movietype,"\|")) tv as type;

create table tmp_movie_type_year_top
as
select tyear,type,rate_avg_movietype,row_number() over(partition by tyear,type order by rate_avg_movietype desc) rn
from(
select tyear,type,avg(rate_avg) rate_avg_movietype
from tmp_movie_rateavg_1_2
group by tyear,type
) tmp
;

select *
from tmp_movie_type_year_top
where rn=1;

(10)每个地区最高评分的电影名,把结果存入HDFS(地区,电影名,影评分)

hive影评练习

标签:tput   string   number   cto   order   mat   导入   合并   种类   

原文地址:http://blog.51cto.com/6000734/2325266

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!