herryyy 2012-12-18
EXPLAIN [EXTENDED] query
EXTENDED参数:输出执行计划中操作符的额外信息;通常,展示物理信息,如文件名等
hive查询转换为一个 有向无环图 的阶段序列;这些阶段可能是 Map/Reduce阶段 或者是执行元数据与文件操作(例如:重命名,移动); explain 输出包括三部分:
阶段描述信息以操作符和与其相关元数据来显示 操作序列;操作符元数据有以下东西组成,像 FilterOperator 的过滤表达式;SelectOperator 的 选择表达式;FileSinkOperator 的输出文件名
执行计划语法介绍到此结束,下面给出一个例子。
EXPLAIN FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;
ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))
STAGE DEPENDENCIES: Stage-1 is a root stage Stage-2 depends on stages: Stage-1 Stage-0 depends on stages: Stage-2
stage-1 是 root 阶段
stage-2在stage-1执行完后执行
stage-0在stage-2执行结束后执行
STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: src Reduce Output Operator key expressions: expr: key type: string sort order: + Map-reduce partition columns: expr: rand() type: double tag: -1 value expressions: expr: substr(value, 4) type: string Reduce Operator Tree: Group By Operator aggregations: expr: sum(UDFToDouble(VALUE.0)) keys: expr: KEY.0 type: string mode: partial1 File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.mapred.SequenceFileOutputFormat name: binary_table Stage: Stage-2 Map Reduce Alias -> Map Operator Tree: /tmp/hive-zshao/67494501/106593589.10001 Reduce Output Operator key expressions: expr: 0 type: string sort order: + Map-reduce partition columns: expr: 0 type: string tag: -1 value expressions: expr: 1 type: double Reduce Operator Tree: Group By Operator aggregations: expr: sum(VALUE.0) keys: expr: KEY.0 type: string mode: final Select Operator expressions: expr: 0 type: string expr: 1 type: double Select Operator expressions: expr: UDFToInteger(0) type: int expr: 1 type: double File Output Operator compressed: false table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe name: dest_g1 Stage: Stage-0 Move Operator tables: replace: true table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe name: dest_g1
上例中,包含两个map/reduce 阶段(stage-1,stage-2),一个文件系统相关阶段(stage-0).stage-0将结果从临时目录移动到dest_g1表相应的目录下;
分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;
提前过滤数据数据集,减少不必要的读取操作;例如: hive join 操作先于 where 条件顾虑,将 分区条件放入 on语句中,能够有效减少 输入数据集;