victorzhzh 2013-11-10
Hive中在做多表关联时,由于Hive的SQL优化引擎还不够强大,表的关联顺序不同往往导致产生不同数量的MapReduce作业数。这时就需要通过分析执行计划对SQL进行调整,以获得最少的MapReduce作业数。举一个例子(案例基于Hive 0.6.0):
create table ljn1(
k1 bigint,
k2 String,
v1 int
);
create table ljn2(
k1 bigint,
v2 int
);
create table ljn3(
k1 bigint,
v3 int
);
create table ljn4(
k1 bigint,
v4 int
);
create table ljn5(
k1 bigint,
v5 int
);
create table ljn6(
k2 string,
v6 int
);
然后看一下下面这个SQL的执行计划:
explain
select a.v1
from
ljn1 a
left outer join ljn2 b on (a.k1 = b.k1)
left outer join ljn3 c on (a.k1 = c.k1)
left outer join ljn4 d on (a.k1 = d.k1)
left outer join ljn6 e on (a.k2 = e.k2)
left outer join ljn5 f on (a.k1 = f.k1);
STAGE DEPENDENCIES:
Stage-5 is a root stage
Stage-1 depends on stages: Stage-5
Stage-2 depends on stages: Stage-1
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-5
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 0
value expressions:
expr: k1
type: bigint
expr: k2
type: string
expr: v1
type: int
b
TableScan
alias: b
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
1
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.Hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
$INTNAME
Reduce Output Operator
key expressions:
expr: _col0
type: bigint
sort order: +
Map-reduce partition columns:
expr: _col0
type: bigint
tag: 0
value expressions:
expr: _col1
type: string
expr: _col2
type: int
c
TableScan
alias: c
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 1
d
TableScan
alias: d
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 2
f
TableScan
alias: f
Reduce Output Operator
key expressions:
expr: k1
type: bigint
sort order: +
Map-reduce partition columns:
expr: k1
type: bigint
tag: 3
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
Left Outer Join0 to 2
Left Outer Join0 to 3
condition expressions:
0 {VALUE._col3} {VALUE._col4}
1
2
3
handleSkewJoin: false
outputColumnNames: _col3, _col4
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.SequenceFileInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Stage: Stage-2
Map Reduce
Alias -> Map Operator Tree:
$INTNAME
Reduce Output Operator
key expressions:
expr: _col3
type: string
sort order: +
Map-reduce partition columns:
expr: _col3
type: string
tag: 0
value expressions:
expr: _col4
type: int
e
TableScan
alias: e
Reduce Output Operator
key expressions:
expr: k2
type: string
sort order: +
Map-reduce partition columns:
expr: k2
type: string
tag: 1
Reduce Operator Tree:
Join Operator
condition map:
Left Outer Join0 to 1
condition expressions:
0 {VALUE._col10}
1
handleSkewJoin: false
outputColumnNames: _col10
Select Operator
expressions:
expr: _col10
type: int
outputColumnNames: _col0
File Output Operator
compressed: true
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
常规来讲,这个SQL非常简单,a表是主表,与其他表左外关联用到了k1和k2两个关联键,使用两个MapReduce作业完全可以搞定。但是这个SQL的执行计划却给出了3个作业:(Stage-0用做数据的最终展示,该作业可以忽略不计)第1个作业(Stage-5)是a表与b表关联;第2个作业(Stage-1)是第1个作业的中间结果再与c、d、f三表关联;第3个作业(Stage-2)是第2个作业的中间结果再与e表关联。
Hive 的详细介绍:请点这里
Hive 的下载地址:请点这里
相关阅读: