hive-1 受控表简介(分区表桶表)

1 受控表(managed table)：

hive中将创建的表和实际对应hdfs目录结构和文件相对应，如果删除hive中创建的表，对应hdfs的目录和文件

将不复存在，这种表叫做受控表。

受控表(managed table)包括内部表、分区表、桶表。

2 分区表简介：

a) 分区表是把数据放在不同的磁盘文件中，hive数据库会对不同分区进行单独管理，优化，

最终会加快数据查询速度。

每一年的数据放在不同hive目录下, 业务上也有这种需求。
b) 分区表分区的含义：也是把数据进行划分不同的区，hive中的区是指不同子文件夹中。

c) 分区表创建原因： 100M的学生信息，如果查询班级为1班的学生，在不使用分区表时，
需要去100M中遍历查询，如果使用分区表，只需要去文件夹名称为1班的hdfs文件中查找即可。

d) 分区字段简介：

d.1) 分区字段就是文件夹的标识名称，

d.2) 在正常使用的时候，分区字段是作为正常字段被使用,但是在数据文件中不存在,

仅作为虚拟列(virtual column)存在

e) 分区过多的坏处：
如果分区过多，hive在扫描时，一级级的扫来扫去，会增加扫描成本，在运行时，对于map端造成map任务增多。

f) 选用哪些字段作为分区呢：
f.1) 选用平时查询比较频繁的字段，比如地区时间查，
f.2) 分区后产生的文件并不是很多的字段来，比如按照姓名 ID来查询，就不能使用分区，否则要产生很多分区

g) 操作：

g.1) 单分区表创建和查询：

linux内文件为：

/root/test/11/student
内容为：
1,zhangsan
2,lisi
/root/test/12/student1
内容为：
3,wangwu
4,zhaoliu
 

hive创建分区表：
create table student(id int, name string) partitioned by(grade int) row format delimited fields terminated by '\t';


导入数据到hive：
hive (default)>load data local inpath 'test/11/' into table student partition(grade=11); 
hive (default)>load data local inpath 'test/12/' into table student partition(grade=12); 

查询hive表：
hive (default)> select * from student where grade=12;
OK
id      name    grade
3       wangwu  12
4       zhaoliu 12
Time taken: 1.259 seconds
hive (default)> select * from student where grade=11;
OK
id      name    grade
1       zhangsan        11
2       lisi    11
Time taken: 0.218 seconds

分区后hdfs的样子：

hive-1 受控表简介(分区表桶表)

g.2) 多分区字段插入和查询写法：

create table member(id int, name string) partitioned by(year int, month int) row format delimited fields terminated by '\t';
load data local inpath '/usr/local/data/user4' into table member partition(year=2014, month=1);// hdfs中目录为  2014/1/member
load data local inpath '/usr/local/data/user5' into table member partition(year=2014, month=2);// hdfs中目录为  2014/2/member



hive> select * from member where year=2014 and month=1;
OK  此时索引字段会显示展示出来
1       zhangsan        2014    1
2       lisi    2014    1
3       wangwu  2014    1
4       zhaoliu 2014    1

g.3) 缺陷：

选定分区字段之后，结果会造成数据偏差特别大，这样整个查询时间受制于分区特别大的，对于整个作业的运行效率是不好的，
比如淘宝按照用户所在省份来分区，北京的订单用户要比青海西藏等偏远省的总和还要多很多

g.4) 动态分区：

如果不开启动态分区, 在A表分区很多情况下，将A表数据加载到B表同时也需要B表有A表分区结构下,

写法为:

insert into table t8 partition(class="job1",city="beijing") select * from t3 where class="job1" and city="beijing";
这样有多少分区就需要写多少遍，麻烦。

开启动态分区做法如下：
hive>set hive.exec.dynamic.partition=true;
hive>set hive.exec.dynamic.partition.mode=nostrict;

hive>select * from t8 partition(class,city) select keys,class,city from t3;
hive>show partitions t8; 看到t3数据以class,city形式进入到t8中。

分区表其余知识补充:

1) 分区字段数值应该是正规的不会被转义的：

分区表对分区字段格式有要求,但是对分区字段格式没有要求,

但是,分区字段应该以规范字段,如果是2015/11/12 或者非法字符，或者可能出现乱码的符号或者中文等来指定分区字段时,hive会自动给转码, 这样你在查询的时候,只能去hdfs目录下找到这个转码后的字段,

粘到Hive命令行写才能查询这个分区字段下的数据,即使这样,数据查询也会出现失误或者精度丢失。

比如
create table student(id int, name string) partitioned by(day string) row format delimited fields terminated by '\t';

load data local inpath 'test/11/2.txt' into table student partition(day="2015/11/12"); ----> 出现问题会被转码

2) 查看表的分区字段命令:

hive>show partitions student;

3) 查看指定分区下数据:

select * from t2 where day="2015-11-12";

4) 添加新分区:

hdfs dfs -mkdir /user/hive/warehouse/student/day=2015-08-09/
hdfs dfs -put install.log /user/hive/warehouse/student/day=2015-08-09/
hive>alter table student add partition(day="2015-08-09"); --->给student表增加分区

或者直接
hive (default)>load data local inpath 'install.log' into table student partition(day="2015-08-09");

5) 删除分区:

删除分区: hive>alter table student drop partition(day="2015-08-09");
删除分区后那么这个分区下的数据没有了

6) 设置分区为不可删除方式:

hive> alter table student partition(day="2015-08-09") enable no_drop;
还原分区为可以删除状态
hive> alter table student partition(day="2015-08-09") disable no_drop;

7) 禁止查看某分区下数据:
hive> alter table student partition(day="2015-08-09") enable offline;
此时,只能查别的分区的数据, 如果查询表所有数据也是无法查到的。

恢复查看：
hive> alter table student partition(day="2015-08-09") disable offline;

8) 禁止全表扫描: 在工作中数据都很大,设置此方式能防止无谓IO和资源消耗

hive>set hive.mapred.mode=strict | nostrict;#严格模式设置设置为strict下无法进行全表扫描

使用 7)禁止查看某分区下的数据也能实现。

3 桶表简介：

a) 概念：桶表是对数据进行哈希取值，然后放到不同文件中存储，数据加载到桶表时，会对某字段(这个字段会在创建桶表时通过clustered by(xx)指定)取hash值，然后与桶的数量取模。把数据放到对应的文件中，

Hive会启动一个MapReduce的job来产生数据，该job中reduce任务的数量与桶的数量是一致的。每个reduce任务会产生一个文件

b) 使用场景：

适用于：抽样查询或者表连接查询
不适用于：根据业务查询数据(因为数据是按照hash来存放，和业务没有任何关系)

c) 和分区表的异同：

相同点：都是用于对数据的划分
不同点：前者是根据业务来进行划分，后者是抛弃业务字段从纯数据角度来划分

d) 操作写法：

创建表
	create table buck(id int, name string) clustered by(id) into 4 buckets; 分成4个桶,使用id和4取模，根据结果不同分到不同文件中存储	
加载数据
	set hive.enforce.bucketing = true; // 启用桶 (默认是不用桶的)	
	insert overwrite table bucket_table select name from stu;  //  会对id进行hash计算然后在将数据放在不同桶中， 分区表中加载数据仅仅是将磁盘数据直接加载到hive中

桶表数据必须是从hive表中在通过mr计算后到创建好的桶表里来的,上述流程就是桶表加载数据的过程。

创建桶表 tablesample是固定死的。

结果：

stu数据：
1       zhangsan
2       lisi
3       wangwu
4       zhaoliu
1       zhangsan
2       lisi
3       wangwu
4       zhaoliu
1       zhangsan
2       lisi
3       wangwu
4       zhaoliu

注意：
物理上，每个桶就是表(或分区）目录里的一个文件
一个作业产生的桶(输出文件)和reduce任务个数相同


桶表的抽样查询。 貌似 out of y on id的id是固定写死的。
select * from bucket_table tablesample(bucket 1 out of 4 on id);
tablesample是抽样语句
语法解析：TABLESAMPLE(BUCKET x OUT OF y)
y必须是table总bucket数的倍数或者因子。
hive根据y的大小，决定抽样的比例。
例如，table总共分了64份(桶)，当y=32时，抽取(64/32=)2个bucket的数据，当y=128时，抽取(64/128=)1/2个bucket的数据。x表示从哪个bucket开始抽取。
例如，table总bucket数为32，tablesample(bucket 3 out of 16)，表示总共抽取（32/16=）2个bucket的数据，分别为第3个bucket和第（3+16=）19个bucket的数据。

hive-1 受控表简介(分区表桶表)

其余三个文件内容在此不再展示 ....

4 外部表简介：

只需要指定目录即可，比较灵活，

外部表

create external table ext_table(c1 string, c2 string) row format delimited fields terminated by '\t' location '/files';

1 使用关键词 external表面外部表。 '/files' 表示关联 hdfs文件系统根目录下files目录内将文件 hello hello1内的数据
2 location用于指定数据在哪里，只能使用文件夹来指定位置，
3 删除外部表 不会损坏hdfs文件内容

创建后数据为：

hive> select * from ext_table;
OK
1,zhangsan      NULL
2,lisi  NULL
3,wangwu        NULL
1,45    NULL
2,56    NULL
3,89    NULL

删除后 hdfs/files内文件 hello hell1均存在：

hive-1 受控表简介(分区表 桶表)

相关推荐

hive-1 受控表简介(分区表桶表)