hive开窗开窗函数进阶

成长之路 2020-06-25

概述

开窗函数的理解参见: 理解hive中的开窗函数

over()中除了可以使用partition by选择分组字段外, 还有以下函数

  • order by 排序
  • 指定聚合行的范围, 配合order by使用
    • current row: 当前行
    • n PRECEDING: 往前 n 行数据
    • n FOLLOWING: 往后 n 行数据
    • UNBOUNDED PRECEDING 表示从前面的起点
    • UNBOUNDED FOLLOWING 表示到后面的终点
  • 生成新的列
    • LAG(col,n): 往后第 n 行数据
    • LEAD(col,n): 往前第 n 行数据
    • NTILE(n): 把有序分区中的行分发到指定数据的组中, 各个组有编号, 编号从 1 开始,对于每一行, NTILE 返回此行所属的组的编号。 注意: n 必须为 int 类型。
  • 生成新的列: Rank
    • RANK()排序相同时会重复, 总数不会变
    • DENSE_RANK() 排序相同时会重复, 总数会减少
    • ROW_NUMBER() 会根据顺序计算

在over()中使用排序

order by

排序

  • 未排序
    hive (default)> select name,orderdate,cost, sum(cost) over(partition by name) as sample from business;
    
    name    orderdate    cost    sample
    jack    2017-01-01    10    176
    jack    2017-02-03    23    176
    jack    2017-01-05    46    176
    jack    2017-04-06    42    176
    jack    2017-01-08    55    176
    mart    2017-04-13    94    299
    mart    2017-04-08    62    299
    mart    2017-04-09    68    299
    mart    2017-04-11    75    299
  • 排序
    hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate) as sample3 from business;
    
    name    orderdate    cost    sample3
    jack    2017-01-01    10    10
    jack    2017-01-05    46    56
    jack    2017-01-08    55    111
    jack    2017-02-03    23    134
    jack    2017-04-06    42    176
    mart    2017-04-08    62    62
    mart    2017-04-09    68    130
    mart    2017-04-11    75    205
    mart    2017-04-13    94    299

指定开窗聚合的行

rows between UNBOUNDED PRECEDING and CURRENT ROW

从起行到当前行的聚合, 而不是对行组内的所有行进行聚合, 配合order by使用

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between UNBOUNDED PRECEDING and current row ) as sample from business;

name    orderdate    cost    sample
jack    2017-01-01    10    10
jack    2017-01-05    46    56
jack    2017-01-08    55    111
jack    2017-02-03    23    134
jack    2017-04-06    42    176
mart    2017-04-08    62    62
mart    2017-04-09    68    130
mart    2017-04-11    75    205
mart    2017-04-13    94    299
neil    2017-05-10    12    12
neil    2017-06-12    80    92
tony    2017-01-02    15    15
tony    2017-01-04    29    44
tony    2017-01-07    50    94

rows between 1 PRECEDING and current row

将上一行与本行聚合, 而不是将行组内所有行聚合

hive (default)> select name,orderdate,cost, sum(cost) over(partition by name order by orderdate rows between 1 PRECEDING and current row) as sample  from business;

jack    2017-01-01    10    10
jack    2017-01-05    46    56
jack    2017-01-08    55    101
jack    2017-02-03    23    78
jack    2017-04-06    42    65
mart    2017-04-08    62    62
mart    2017-04-09    68    130
mart    2017-04-11    75    143
mart    2017-04-13    94    169
neil    2017-05-10    12    12
neil    2017-06-12    80    92
tony    2017-01-02    15    15
tony    2017-01-04    29    44
tony    2017-01-07    50    79

rows between 1 PRECEDING AND 1 FOLLOWING

当前行和前一行及后面一行聚合

rows between current row and UNBOUNDED FOLLOWING

当前行及后面所有行

rows between current row and UNBOUNDED FOLLOWING

当前行及后面所有行

生成新的列: LAG, LEAD, NTILE

LAG(col, n)

第三个参数用于代替"NLL"

select
    name,orderdate,cost,
    lag(orderdate,1,‘1900-01-01‘) over(partition by name order by orderdate ) as time1,
    lag(orderdate,2) over (partition by name order by orderdate) as time2
from
    business;

结果: time1将orderdate列往后移一行, time2往后移两行

name    orderdate    cost    time1    time2
jack    2017-01-01    10    1900-01-01    NULL
jack    2017-01-05    46    2017-01-01    NULL
jack    2017-01-08    55    2017-01-05    2017-01-01
jack    2017-02-03    23    2017-01-08    2017-01-05
jack    2017-04-06    42    2017-02-03    2017-01-08
mart    2017-04-08    62    1900-01-01    NULL
mart    2017-04-09    68    2017-04-08    NULL
mart    2017-04-11    75    2017-04-09    2017-04-08
mart    2017-04-13    94    2017-04-11    2017-04-09
neil    2017-05-10    12    1900-01-01    NULL
neil    2017-06-12    80    2017-05-10    NULL
tony    2017-01-02    15    1900-01-01    NULL
tony    2017-01-04    29    2017-01-02    NULL
tony    2017-01-07    50    2017-01-04    2017-01-02

LEAD(col, n)

同时, 只是往前移指定的行数

NTILE(n)

需要配合order by使用

select * from (
    select
        name,orderdate,cost,
        ntile(5) over() sorted
    from
        business
) t;

结果: ntitle(5), 增加一列(组标签), 用于分组, 这里为5组

t.name    t.orderdate    t.cost    t.sorted
jack    2017-01-01    10    1
tony    2017-01-02    15    1
jack    2017-02-03    23    1
tony    2017-01-04    29    2
jack    2017-01-05    46    2
jack    2017-04-06    42    2
tony    2017-01-07    50    3
jack    2017-01-08    55    3
mart    2017-04-08    62    3
mart    2017-04-09    68    4
neil    2017-05-10    12    4
mart    2017-04-11    75    4
neil    2017-06-12    80    5
mart    2017-04-13    94    5

如果分组是行数数以n有余数, 则从上到下每组增加一行, 直到所以行都有组标签.

生成新的列: Rank

RANK() 排序相同时会重复, 总数不会变
DENSE_RANK() 排序相同时会重复, 总数会减少
ROW_NUMBER() 会根据顺序计算

select name,
subject,
score,
rank() over(partition by subject order by score desc) rp,
dense_rank() over(partition by subject order by score desc) drp,
row_number() over(partition by subject order by score desc) rmp
from score;
name subject score rp drp rmp
sk 数学 95 1 1 1
zs 数学 86 2 2 2
ls 数学 85 3 3 3
ww 数学 56 4 4 4

zs 英语 84 1 1 1
ww 英语 84 1 1 2
ls 英语 78 3 2 3
sk 英语 68 4 3 4

ww 语文 94 1 1 1
sk 语文 87 2 2 2
ls 语文 65 3 3 3
zs 语文 64 4 4 4

相关推荐