码迷,mamicode.com
首页 > 其他好文 > 详细

hive学习

时间:2019-07-04 00:32:39      阅读:151      评论:0      收藏:0      [点我收藏+]

标签:golden   bak   instant   avg   collect   timestamp   spl   name   学习   

1.hive的基础sql

建立测试数据表:

文章表:里面存入一段话,一个字段

create table article (
sentence STRING
)
row format delimited fields terminated by \n;

LOAD DATA LOCAL INPATH /home/hejunhong/wc.log OVERWRITE INTO TABLE article

(1)hive进行wordcount的统计

1.select word,count(*)
from (
select explode(split(sentence,\t)) as word from article b ) t
group by word

2.
select t.word,count(t.word) from
(select word
from article
lateral view explode(split(sentence,\t))  a as word) t
group by t.word

(2)经典的行转列 统计分析

建表sql
数据样式:
2018-01    211    984
2018-02    333    999
2018-03    111    222

create  table rowtocol(
dt_month string,
valid_num int,
unvalid_num int
)
row format delimited fields terminated by \t;
LOAD DATA LOCAL INPATH /opt/data/row_col.txt OVERWRITE INTO TABLE rowtocol

 要求转换为以下形式:

add_t.type    add_t.num
bene_idno    211
bene_moble    984
bene_idno    333
bene_moble    999
bene_idno    111
bene_moble    222
select  add_t.type  ,add_t.num from rowtocol a lateral view explode(str_to_map(concat(bene_idno=,valid_num,&bene_moble=,unvalid_num),&,=)) add_t as type,num

案例提示:
如果有一行数据是这样:
num1 num2 num3 num4 num5 num6
100 2333 111 1223 8990 9000
想变成
num1   100
num2   ..
num3   ..
num4
num5

num6 9000

可尝试使用
lateral view explode(str_to_map(concat(‘num1=‘,num1,‘&num2=‘,num2),‘&‘,‘=‘)) add_t as filed,num



(3)经典函数 时间计算 的使用

数据样式:
用户id 商品id 对商品的打分评价 时间

udata.user_id udata.item_id udata.rating udata.timestamp
196 242 3 881250949
186 302 3 891717742
22 377 1 878887116
244 51 2 880606923
166 346 1 886397596
298 474 4 884182806
115 265 2 881171488
253 465 5 891628467
305 451 3 886324817
6 86 3 883603013

create table udata(
user_id string,
item_id string,
rating string,
`timestamp` string
)row format delimited fields terminated by \t;
1.推荐数据里面,想知道,距离现在最近的时间是什么时候,最远的时间是什么时候
select max(`timestamp`) as max_t,min(`timestamp`) min_t from bigdata.udata;
893286638 874724710
最近的这个点作为时间参考点

2.查询两个时间点距离多少天
select (cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as diffrentDay from udata
1.能查看用户的行为时间点,以893286638为时间点查出用户对应的所有的 购买频率
select user_id ,collect_list(cast(days as int)) day_list
from
(select user_id,(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24) as days from udata) t
group by user_id limit 10


1.看到结果 距离某个时刻的同一天内用户的数据非常集中 可以判断是否为 刷单
2.查看数据的时间点,可以用这个做一些数据清洗规则
100 [22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22,22]
101    [186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186,186]
102    [3,51,51,110,51,110,51,3,3,51,145,51,51,51,51,72,115,51,115,51,110,51,201,51,3,51,51,51,3,51,51,51,51,110,51,51,51,51,164,51,52,177,51,51,51,115,51,3,50,51,51,3,51,51,51,51,201,51,51,51,3,51,51,51,51,110,51,110,51,51,110,3,51,3,3,51,3,51,51,51,51,51,51,115,51,51,51,51,51,51,51,51,51,51,51,51,110,3,3,51,97,51,3,51,72,110,51,51,51,45,51,51,3,201,51,51,3,3,110,51,94,51,51,110,110,51,115,51,51,51,51,3,51,3,51,51,110,51,51,51,115,115,51,51,51,51,51,3,51,164,110,115,51,51,51,3,110,3,51,51,21,201,51,51,3,51,51,3,3,51,72,3,57,3,3,51,51,51,94,115,51,3,51,51,3,51,51,51,51,51,3,51,51,51,3,51,51,160,3,51,51,87,110,51,110,45,59,51,51,51,51,51,110,115,51,51]
103    [148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148,148]
104    [55,56,56,56,55,55,55,55,55,56,55,55,55,56,55,55,55,55,55,55,55,56,55,55,56,55,55,55,56,55,56,56,56,56,55,56,55,55,55,55,55,55,55,56,56,56,56,55,56,55,56,55,56,55,55,55,56,55,55,55,55,55,55,55,55,55,56,56,56,56,55,56,55,55,56,55,55,55,55,55,56,55,56,55,55,56,56,56,56,55,56,55,56,56,55,56,55,55,55,55,55,55,55,55,56,56,56,55,55,55,56]
105    [47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47,47]
106    [136,136,136,136,108,137,136,136,136,137,136,136,108,137,136,108,53,136,136,136,136,136,136,108,136,108,136,136,136,108,136,136,136,136,108,136,136,137,108,136,136,136,136,136,137,108,136,136,136,137,136,108,136,136,136,136,136,136,137,136,136,108,136,136]
107    [23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23,23]

4.一个用户有多条行为,做行为分析的时候,最近的行为越有效果(越好)。

在上面结果的时间分布基础上:

(1).引入时间衰减函数的目的:t如果为时间,那么当前时间最近的就是今天 距离为0天 就是峰值为1,时间越远这条数据的参考价值就比较低,乘以他的权重评分就越低

(2exp()函数 就是e的x次方

 (3)exp(-t/2) e的-2/t次方 t越大 值越小衰减越慢 相当于一个高斯(正态分布) 以e的0次方为最高点  
他的函数曲线图:
技术图片
select user_id,sum(exp(-(cast(893286638 as bigint)-cast(`timestamp` as bigint))/(60*60*24)/2)*rating) as days from bigdata.udata group by user_id limit 10; 得到的结果: 因为 exp-2/t)中t越小,越接近0(距离当前时间为0)的时候,函数的值越小,求的和越小 所以 sum之后的值越小,当前用户的行为数据越有参考价值 最终比如选择 top100的用户进行 推荐 用户id为100的 就值得做推荐用户 1 3.26938641750186E-8 10 1.3899514053917838E-32 100 0.0028427420371960263 101 4.9919669370351064E-39 102 15.147722144199362 103 4.771115073346258E-31 104 2.15626106001131E-10 105 4.541247668782543E-9 106 1.2297890524212914E-11 107 5.459575349110719E-4

 

4.其他案例

1.表字段说明
aisles.csv  departments.csv  order_products__prior.csv  order_products__train.csv  orders.csv  products.csv
1)aisles 通道 货架的编号 (二级类别) 维度表
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
4,instant foods
5,marinades meat preparation
6,other
7,packaged meat
8,bakery desserts
9,pasta sauce

2)departments 部门 比如厨房类 (一级类别)维度表
department_id,department
1,frozen
2,other
3,bakery
4,produce
5,alcohol
6,international
7,beverages
8,pets
9,dry goods pasta

3)orders  订单表 (在hive中属于行为表)
eval_set:prior历史行为,
train训练(test中user已经购买了的其中一个商品),
test(最终我们要预测的数据集,包含哪个用户他可能会购买的商品)

order_number:这个user订单的编号,体现订单的先后顺序
order_dow:(day of week),订单在星期几
order_hour_of_day:一天中的哪个小时(分成24小时)
days_since_prior_order:order_number后面一个订单与前面一个订单相隔天数(注意第一个订单没有)

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
2254736,1,prior,4,4,07,29.0
431534,1,prior,5,4,15,28.0
3367565,1,prior,6,2,07,19.0
550135,1,prior,7,1,09,20.0
3108588,1,prior,8,1,14,14.0
2295261,1,prior,9,1,16,0.0

4)order_products__prior(500M)   order_products__train
一个订单:订单记录(33120,28985)explode 
(在hive中属于行为表)
add_to_cart_order:加购物车的位置 
reordered:这个订单是否被再次购买(是否)bool
order_id,product_id,add_to_cart_order,reordered  
2,33120,1,1
2,28985,2,1
2,9327,3,0
2,45918,4,1
2,30035,5,0
2,17794,6,1
2,40141,7,1
2,1819,8,1
2,43668,9,0

5)products 在数据库中(如果落到hive中是维度表)
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
4,Smart Ones Classic Favorites Mini Rigatoni With Vodka Cream Sauce,38,1
5,Green Chile Anytime Sauce,5,13
6,Dry Nose Oil,11,11
7,Pure Coconut Water With Orange,98,7
8,Cut Russet Potatoes Steam N Mash,116,1
9,Light Strawberry Blueberry Yogurt,120,16

u.data:
user id | item id | rating | timestamp/t为分割

(1)2.每个用户有多少个订单(orders表)

1).每个用户有多少个订单(orders表)
select user_id, count(order_id) order_cnt  from orders group by  user_id order by  order_cnt desc;
(2).每个用户一个订单平均是多少商品
trains:对应的一个订单多少商品/订单数
order_id,product_id

我今天购买了2个order,一个是10个商品,另一个是4个product
(10+4/2 =7
  a.先用prior这个表算一个order有多少products 104
  
  select order_id,count(1) as prod_cnt 
  from priors 
  group by order_id
  limit 10;
  
  b. prior与order通过order_id关联 ,把订单中产品数量的信息带到每个用户里(订单中产品数量和user对应上)
  select user_id,prod_cnt
  from orders od
  join (
  select order_id,count(1) as prod_cnt 
  from priors 
  group by order_id
  limit 10000)pro
  on od.order_id=pro.order_id
  limit 10;
  
  c. 求和,一个总共购买多少产品
  select user_id,sum(prod_cnt)as sum_prods
  from orders od
  join (
  select order_id,count(1) as prod_cnt 
  from priors 
  group by order_id
  limit 10000)pro
  on od.order_id=pro.order_id
  group by user_id
  limit 10;
  
  d.求平均
  select user_id,
  sum(prod_cnt)/count(1) as sc_prod,
  avg(prod_cnt) as avg_prod 
  from (select * from orders where eval_set=prior)od --如果不是prior统计为0
  join (
  select order_id,count(1) as prod_cnt 
  from priors 
  group by order_id
  limit 10000)pro
  on od.order_id=pro.order_id
  group by user_id
  limit 10;
 4)每个用户在一周中的购买订单的分布(列转行)
  user_id,dow0,dow1,dow2,dow3,dow4...dow6
    1       0    0    1    2    2      0
  select 
  user_id,
  sum(case order_dow when 0 then 1 else 0 end) as dow0,
  sum(case order_dow when 1 then 1 else 0 end) as dow1,
  sum(case order_dow when 2 then 1 else 0 end) as dow2,
  sum(case order_dow when 3 then 1 else 0 end) as dow3,
  sum(case order_dow when 4 then 1 else 0 end) as dow4,
  sum(case order_dow when 5 then 1 else 0 end) as dow5,
  sum(case order_dow when 6 then 1 else 0 end) as dow6
  from orders
  group by user_id
  limit 10;

 

 

 

 

hive学习

标签:golden   bak   instant   avg   collect   timestamp   spl   name   学习   

原文地址:https://www.cnblogs.com/hejunhong/p/11117913.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!