hive 执行计划

herryyy 2012-12-18

hive执行计划语法

EXPLAIN [EXTENDED] query

EXTENDED参数:输出执行计划中操作符的额外信息;通常,展示物理信息,如文件名等

hive查询转换为一个 有向无环图 的阶段序列;这些阶段可能是 Map/Reduce阶段 或者是执行元数据与文件操作(例如:重命名,移动); explain 输出包括三部分:

  1. 查询语句的抽象语法树
  2. 执行计划不同阶段间的依赖关系
  3. 每个阶段的描述

阶段描述信息以操作符和与其相关元数据来显示 操作序列;操作符元数据有以下东西组成,像 FilterOperator 的过滤表达式;SelectOperator 的 选择表达式;FileSinkOperator 的输出文件名

执行计划语法介绍到此结束,下面给出一个例子。

执行计划示例

EXPLAIN
FROM src INSERT OVERWRITE TABLE dest_g1 SELECT src.key, sum(substr(src.value,4)) GROUP BY src.key;

 

执行计划输出如下:

抽象语法树:

ABSTRACT SYNTAX TREE:
  (TOK_QUERY (TOK_FROM (TOK_TABREF src)) (TOK_INSERT (TOK_DESTINATION (TOK_TAB dest_g1)) (TOK_SELECT (TOK_SELEXPR (TOK_COLREF src key)) (TOK_SELEXPR (TOK_FUNCTION sum (TOK_FUNCTION substr (TOK_COLREF src value) 4)))) (TOK_GROUPBY (TOK_COLREF src key))))

 

阶段依赖关系图:

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-0 depends on stages: Stage-2

stage-1 是 root 阶段

stage-2在stage-1执行完后执行

stage-0在stage-2执行结束后执行

各阶段执行计划 

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Alias -> Map Operator Tree:
        src
            Reduce Output Operator
              key expressions:
                    expr: key
                    type: string
              sort order: +
              Map-reduce partition columns:
                    expr: rand()
                    type: double
              tag: -1
              value expressions:
                    expr: substr(value, 4)
                    type: string
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(UDFToDouble(VALUE.0))
          keys:
                expr: KEY.0
                type: string
          mode: partial1
          File Output Operator
            compressed: false
            table:
                input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                output format: org.apache.hadoop.mapred.SequenceFileOutputFormat
                name: binary_table

  Stage: Stage-2
    Map Reduce
      Alias -> Map Operator Tree:
        /tmp/hive-zshao/67494501/106593589.10001
          Reduce Output Operator
            key expressions:
                  expr: 0
                  type: string
            sort order: +
            Map-reduce partition columns:
                  expr: 0
                  type: string
            tag: -1
            value expressions:
                  expr: 1
                  type: double
      Reduce Operator Tree:
        Group By Operator
          aggregations:
                expr: sum(VALUE.0)
          keys:
                expr: KEY.0
                type: string
          mode: final
          Select Operator
            expressions:
                  expr: 0
                  type: string
                  expr: 1
                  type: double
            Select Operator
              expressions:
                    expr: UDFToInteger(0)
                    type: int
                    expr: 1
                    type: double
              File Output Operator
                compressed: false
                table:
                    input format: org.apache.hadoop.mapred.TextInputFormat
                    output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                    serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                    name: dest_g1

  Stage: Stage-0
    Move Operator
      tables:
            replace: true
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe
                name: dest_g1

 上例中,包含两个map/reduce 阶段(stage-1,stage-2),一个文件系统相关阶段(stage-0).stage-0将结果从临时目录移动到dest_g1表相应的目录下;

一个Map/Reduce阶段有两部分组成:

  1. 从表别名到Map Operator Tree映射:该映射通知Mapper,其操作树被调用来处理特定表的行或者前一个Map/Reduce阶段的输出数据;上例stage-1中,src表的行被以Reduce Output Operator为根的操作符树处理;在stage-2中,stage-1输出行被stage2中以Reduce Output Operator为根的操作符树处理;这两个Reduce Output Operator 根据元数据中展示条件 分区数据到Reducers
  2. Reduce Operator Tree:处理Map/Reduce Job 中Reducers所有数据行;在stage-1中, Reducer Operator Tree执行部分聚合;在stage-2中, Reducer Operator Tree从stage-1的部分聚合结果计算最终聚合结果

hive执行计划作用

分析作业执行过程,优化作业执行流程,提升作业执行效率;例如,数据过滤条件从reduce端提前到map端,有效减少map/reduce间shuffle数据量,提升作业执行效率;

提前过滤数据数据集,减少不必要的读取操作;例如: hive join 操作先于 where 条件顾虑,将 分区条件放入 on语句中,能够有效减少 输入数据集;

相关推荐