码迷,mamicode.com
首页 > 其他好文 > 详细

HIVE 常见函数

时间:2020-06-23 20:54:23      阅读:64      评论:0      收藏:0      [点我收藏+]

标签:order   limited   price   parse   结果   json   显示   ted   HERE   

函数
1.hive函数分类
    标准函数 自定义标准函数称之为 UDF   传递一行数据,返回一个结果

    聚合函数 自定义聚合函数称之为 UDAF  传递多行数据,返回一个结果 group by   sum count

    表生成函数 自定义表生成函数称之为 UDTF  传递一行数据,返回多行数据  explode


2.函数帮助文档
    SHOW FUNCTIONS;
    显示当前环境所有的函数

    DESC FUNCTION length;
    显示帮助函数文档

    DESC FUNCTION EXTENDED length;
    显示函数额外信息


3.数学函数
    round  求四舍五入

    rand  随机

    conv   进制转换函数
    conv(num, from_base, to_base)  num数据    from_base 指定的进制    to_base 转换进制

    cast  类型转换  select cast(id as STRING) from student;  把int类型转化成STRING


4.日期函数
    from_unixtime   把时间戳转化成字符串
    select from_unixtime(1567995900, yyyyMMdd HHmmss)

    unix_timestamp  把字符串转化时间戳
    select unix_timestamp(2011-12-10 13:10:12);

5.条件函数
    IF(expr1,expr2,expr3) !!!
    expr1 如果expr1为真 输出expr2   否则就输出expr3
    SELECT name, age, if(age > 20, A, B) as type FROM student;


    返回第一个非空数据 #
    SELECT coalesce(null, 12, 3,14,null);


    CASE  !!!
    给数据分组

    SELECT id, CASE   
               WHEN id <=2 THEN a 
               WHEN id > 2 AND id <=10 THEN b 
               ELSE c  
               END  FROM student;


    SELECT mycol, sum(id)FROM (SELECT id, (CASE  WHEN id <=2 THEN a WHEN id > 2 AND id <=10 THEN b ELSE c  END)  as mycol FROM student) t GROUP BY mycol
               

    SELECT CASE id  WHEN 2 THEN a   ELSE b END from student;

//////////////////////////////////////////////////////////////////////
6.字符串处理函数
    字符串拼接
    concat    拼接任何类型,中间没有分隔符指定
    select id, concat(name, age) from student;

    concat_ws 只能拼接字符串类型,可以指定分隔符
    select concat_ws(_, asdfasd, 18) ;


    instr(string str, string substr)
    select  instr(candle, dle)

    length  字符串大小
    lower  小写
    lcase  大写


    正则表达式 #
    regexp_extract  通过正则表达式查找
    regexp_extract(str, regexp[, idx])

    参数
    str 需要处理的字符串
    regexp 正则表达式 主要用于匹配
    idx   索引
          0 整个字符串匹配
          1 匹配正则表达式中第一个括号
          2 匹配正则表达式第二个括号

     select regexp_extract(x=asdfasdf12345asdf, x=([a-z]+)([0-9]+)([a-z]+), 3);



    regexp_replace  。。。。。。。替换
     select regexp_replace(x=asdfasdf12345asdf3456345, ([0-9]+),xxxx );


    URL #
    parse_url   专门用来解析URL数据
    http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere

    select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  REF);

    [HOST 域名,PATH 路径,QUERY 查询,REF 锚点,PROTOCOL 协议,FILE,AUTHORITY IP地址+端口,USERINFO]

    输出结果
        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  HOST);
        OK
        _c0
        www.candle.com
        Time taken: 0.07 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  PATH);
        OK
        _c0
        /dir1/dir2/dir3/file.html
        Time taken: 0.093 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  QUERY);
        OK
        _c0
        key1=value1&key2=value2
        Time taken: 0.051 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  QUERY, key1);
        OK
        _c0
        value1
        Time taken: 0.105 seconds, Fetched: 1 row(s)

        hive (hadoop)> select parse_url(http://www.candle.com:50070/dir1/dir2/dir3/file.html?key1=value1&key2=value2#imhere,  REF);
        OK
        _c0
        imhere



    JSON  #

    {"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}

    
    {
        "store":
            {
                "fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],
                "bicycle":{"price":19.95,"color":"red"}
            } ,


        "email":"amy@only_for_json_udf_test.net",
        "owner":"amy"
    }


    hive (hadoop)> select get_json_object({"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}, $.store.bicycle.price);
    OK
    _c0
    19.95


    hive (hadoop)> select get_json_object({"store":{"fruit":\[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.95,"color":"red"} }, "email":"amy@only_for_json_udf_test.net","owner":"amy"}, $.store.fruit[0].type);
    OK
    _c0
    apple




7.宽表和长表转化  !!!
    explode 表生成函数

    hive (hadoop)> select explode(city) from student1;

    错误:
    hive (hadoop)> select id, name, explode(city) from student1;
    FAILED: SemanticException [Error 10081]: UDTFs are not supported outside the SELECT clause, nor nested in expressions

    数组
        hive 中 查询,表生成函数不能直接和普通列直接查询
        需要先生成一个临时视图 拼接起来
        select id, name, mycity from student1 lateral view  explode(city) tmp_view AS mycity;

        SELECT id, name, mycity FROM student1 LATERAL VIEW explode(city) 临时视图名称 AS 自定义列名


    键值对
        键值对 explode生成两列 key value

        > select id, name , mykey, myvalue from student3 LATERAL VIEW explode(pairs) tmp_view AS mykey, myvalue;

        OK
        id    name    mykey    myvalue
        1    candle    k1    v1
        1    candle    k2    v2
        2    jack    k1    v1
        2    jack    k2    v2
        3    tom    k1    v1

    结构体:
        多列
        hive (hadoop)> select explode(addr) from student2;
        FAILED: UDFArgumentException explode() takes an array or a map as a parameter

        注意:
        结构体成员不能直接转化成多列,explode只接受arry和map
        其实,把结构体每个成员转化成每一行 没有意义

        一般再处理结构体采用如下形式
        hive (hadoop)> select id,name, addr.city, addr.area, addr.streetid from student2;
        OK
        id    name    city    area    streetid
        1    candle    shanghai    minhang    35
        2    jack    beijing    haidian    100
        3    tom    hefei    shushan    103



        一般不采用如下格式
        hive (hadoop)> select id, name, info from student2 lateral view explode(array(addr.city, addr.area, addr.streetid)) tmp_view as info;
        OK
        id    name    info
        1    candle    shanghai
        1    candle    minhang
        1    candle    35
        2    jack    beijing
        2    jack    haidian
        2    jack    100
        3    tom    hefei
        3    tom    shushan
        3    tom    103


    长转宽  聚合函数 group by
      collect_list  多行数据 合并到一行 生成数组 list 允许成员重复
      select age, collect_list(name),  collect_list(id) from student group by age;

      age    _c1    _c2
      18    ["candle","candle"]    [1,1]
      19    ["aa","jack"]    [10,2]
      20    ["c2","tom"]    [1,4]
      100    ["cc"]    [12]
      200    ["dd"]    [13]

  
      collect_set  不允许重复

      select age, collect_set(name),  collect_set(id) from student group by age;

      age    _c1    _c2
        18    ["candle"]    [1]
        19    ["aa","jack"]    [10,2]
        20    ["c2","tom"]    [1,4]
        100    ["cc"]    [12]
        200    ["dd"]    [13]


        select map(id, name) from student; 生成键值对
        select struct(id, name, age) from student; 生成结构体


8.窗口函数 !!!
    分组排序

    1) 创建表格
       CREATE TABLE windows
       (
       id INT,
       class STRING,
       score INT
       )
       ROW FORMAT DELIMITED FIELDS TERMINATED BY ,
       LINES TERMINATED BY \n
       STORED AS TEXTFILE;

    2) row_number
        partition by  指定分组列
        order by   指定排序列
        SELECT class, score, row_number() over(partition by class order by score desc)  from windows;


        排名  不会因为score相同改变次序
        class    score    paiming
        class1    78    1
        class1    78    2
        class1    66    3
        class1    66    4
        class1    66    5
        class1    56    6
        class1    56    7
        class1    56    8
        class1    56    9
        class1    56    10


    3) rank

        排名  如果score重复,排序也重复,下一个名词会跳过重复的部分 

        SELECT class, score, rank() over(partition by class order by score desc)  from windows;
        class    score    _wcol0
        class1    78    1
        class1    78    1
        class1    66    3
        class1    66    3
        class1    66    3
        class1    56    6
        class1    56    6
        class1    56    6
        class1    56    6
        class1    56    6


    4) dense_rank

        如果score重复,排序也重复,下一个名词不会跳过重复的部分
        SELECT class, score, dense_rank() over(partition by class order by score desc)  from windows;

        class    score    _wcol0
        class1    78    1
        class1    78    1
        class1    66    2
        class1    66    2
        class1    66    2
        class1    56    3
        class1    56    3
        class1    56    3
        class1    56    3
        class1    56    3

    5) 分组 topn模型
    不能直接筛选

    hive (hadoop)> SELECT class, score, row_number() over(partition by class order by score desc) as paiming  from windows WHERE paiming <= 5;


    FAILED: SemanticException [Error 10004]: Line 1:110 Invalid table alias or column reference paiming: (possible column names are: id, class, score)


    使用子查询
    SELECT class, score from 
        (SELECT class, score, row_number() over(partition by class order by score desc) as paiming  from windows)  t WHERE paiming <=5;





    

 

HIVE 常见函数

标签:order   limited   price   parse   结果   json   显示   ted   HERE   

原文地址:https://www.cnblogs.com/alpha-cat/p/13184139.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!