标签:with 表数据 含义 命令 作用 运行 ict 最优 ado
在HAWQ中创建一个表时,应该预先对数据如何分布、表的存储选项、数据导入导出方式和其它HAWQ特性做出选择,这些都将对数据库性能有极大影响。理解有效选项 的含义以及如何在数据库中使用它们,将有助于做出正确的选择。db1=# create table t1 (a int) with db1-# (appendonly=true, db1(# blocksize=8192, db1(# orientation=row, db1(# compresstype=zlib, db1(# compresslevel=1, db1(# fillfactor=50, db1(# oids=false); CREATE TABLE除了在表级别指定存储选项,HAWQ还支持在一个特定分区或子分区上设置存储选项。以下语句在特定子分区上使用with子句,指定对应分区的存储属性。
db1=# create table sales db1-# (id int, year int, month int, day int,region text) db1-# distributed by (id) db1-# partition by range (year) db1-# subpartition by range (month) db1-# subpartition template ( db1(# start (1) end (13) every (1), db1(# default subpartition other_months ) db1-# subpartition by list (region) db1-# subpartition template ( db1(# subpartition usa values (‘usa‘) with db1(# (appendonly=true, db1(# blocksize=8192, db1(# orientation=row, db1(# compresstype=zlib, db1(# compresslevel=1, db1(# fillfactor=50, db1(# oids=false), db1(# subpartition europe values (‘europe‘), db1(# subpartition asia values (‘asia‘), db1(# default subpartition other_regions) db1-# ( start (2002) end (2010) every (1), db1(# default partition outlying_years); ... CREATE TABLE下面说明HAWQ所支持的存储选项。
db1=# create table t1(a int) with (appendonly=true); CREATE TABLE db1=# create table t2(a int) with (appendonly=false); ERROR: tablespace "dfs_default" does not support heap relation
db1=# create table t1(a int) with (blocksize=8192); ERROR: invalid option ‘blocksize‘ for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (appendonly=true,blocksize=8192); CREATE TABLE db1=# create table t2(a int) with (appendonly=true,blocksize=8192,orientation=parquet); ERROR: invalid option ‘blocksize‘ for parquet table db1=# create table t2(a int) with (appendonly=true,blocksize=8192,orientation=row); CREATE TABLE
db1=# create table t1(a int) with (bucketnum=1) distributed by (a); CREATE TABLE
db1=# create table t1(a int) with (orientation=parquet); ERROR: invalid option "orientation" for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (orientation=parquet,appendonly=true); CREATE TABLE老版本的HAWQ还支持一种column的格式,2.1.1中已经过时而不再支持,应该用Parquet存储格式代替它。
db1=# create table t1(a int) with (orientation=column,appendonly=true); ERROR: Column oriented tables are deprecated. Not support it any more.row格式对于全表扫描类型的读操作效率很高。面向行的存储适合的情况主要有频繁插入,SELECT或WHERE子句中包含表所有列或大部分列,并且一行中所有列的总长度相对较小时,适合OLTP的应用场景。而parquet面向列的格式对于大型查询更高效,适合数据仓库应用。应该根据实际的数据和查询评估性能,选择最适合的存储模型。row与parquet之间的格式转换工作由用户的应用程序完成,HAWQ不会进行这种转换。
db1=# create table t1(a int) with (compresstype=zlib); ERROR: invalid option ‘compresstype‘ for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (compresstype=zlib,appendonly=true); CREATE TABLE db1=# create table t2(a int) with (compresstype=zlib,appendonly=true,orientation=parquet); ERROR: parquet table doesn‘t support compress type: ‘zlib‘ db1=# create table t2(a int) with (compresstype=snappy,appendonly=true,orientation=parquet); CREATE TABLE
db1=# create table t1(a int) with (compresstype=snappy,compresslevel=1); ERROR: invalid option ‘compresslevel‘ for compresstype ‘snappy‘. db1=# create table t1(a int) with (compresslevel=1); ERROR: invalid option ‘compresslevel‘ for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (compresslevel=1,appendonly=true); CREATE TABLE
db1=# create table t1(a int) with (fillfactor=100,orientation=parquet); ERROR: invalid option "orientation" for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (fillfactor=100); CREATE TABLE
db1=# create table t1(a int) with (pagesize=1024,rowgroupsize=1024,orientation=parquet); ERROR: row group size for parquet table must be larger than pagesize. Got rowgroupsize: 1024, pagesize 1024 db1=# create table t1(a int) with (pagesize=1024,rowgroupsize=8096,orientation=parquet); ERROR: invalid option "orientation" for base relation. Only valid for Append Only relations db1=# create table t1(a int) with (pagesize=1024,rowgroupsize=8096,orientation=row); ERROR: invalid option ‘pagesize‘ for non-parquet table db1=# create table t1(a int) with (pagesize=1024,rowgroupsize=8096,orientation=parquet,appendonly=true); CREATE TABLE
二、数据分布策略
首先需要指出的是,这里所说的数据分布策略并不直接决定数据的物理存储位置,数据块的存储位置是由HDFS决定的。这里的数据分布策略概念是从GreenPlum继承来的,存储移植到HDFS上后,数据分布决定了HDFS上数据文件的生成规则,以及在此基础上的资源分配策略。
节点数 | default_hash_table_bucket_number |
<= 85 | 6 * #nodes |
> 85 and <= 102 | 5 * #nodes |
> 102 and <= 128 | 4 * #nodes |
> 128 and <= 170 | 3 * #nodes |
> 170 and <= 256 | 2 * #nodes |
> 256 and <= 512 | 1 * #nodes |
> 512 | 512 |
2. 选择数据分布策略
在选择分布策略时,应该考虑具体数据和查询的情况,包括以下几点:下面用一个例子说明两种数据分布策略。建立三个表,t1使用单列哈希分布,t2使用随机分布,t3使用多列哈希分布。
db1=# create table t1 (a int) distributed by (a); CREATE TABLE db1=# create table t2 (a int) distributed randomly; CREATE TABLE db1=# create table t3 (a int,b int,c int) distributed by (b,c); CREATE TABLE使用下面的语句可以查询表的分布键列。
db1=# select c.relname, sub.attname db1-# from pg_namespace n db1-# join pg_class c on n.oid = c.relnamespace db1-# left join (select p.attrelid, p.attname db1(# from pg_attribute p db1(# join (select localoid, unnest(attrnums) as attnum db1(# from gp_distribution_policy) as g on g.localoid = p.attrelid db1(# and g.attnum = p.attnum) as sub on c.oid = sub.attrelid db1-# where n.nspname = ‘public‘ db1-# and c.relname in (‘t1‘, ‘t2‘, ‘t3‘) db1-# and c.relkind = ‘r‘; relname | attname ---------+--------- t1 | a t2 | t3 | c t3 | b (4 rows)前面已经提到哈希分布表中桶的概念。从根本上说,每个哈希桶对应一个HDFS文件。在数据库初始化时,default_hash_table_bucket_number参数得到设置,缺省值按表1所示的公式计算得到。我的环境中有4个segment节点,default_hash_table_bucket_number=24。现在表中没有数据,表目录下是空的。
db1=# select c.relname, d.dat2tablespace tablespace_id, d.oid database_id, c.relfilenode table_id db1-# from pg_database d, pg_class c, pg_namespace n db1-# where c.relnamespace = n.oid db1-# and d.datname = current_database() db1-# and n.nspname = ‘public‘ db1-# and c.relname in (‘t1‘, ‘t2‘); relname | tablespace_id | database_id | table_id ---------+---------------+-------------+---------- t1 | 16385 | 25270 | 156897 t2 | 16385 | 25270 | 156902 (2 rows) [gpadmin@hdp3 ~]$ hdfs dfs -ls /hawq_data/16385/25270/156897 [gpadmin@hdp3 ~]$ hdfs dfs -ls /hawq_data/16385/25270/156902 [gpadmin@hdp3 ~]$向表中插入数据后,哈希分布表t1对应的HDFS目录下有24个数据文件(每个哈希桶对应一个文件),而随机分布表t2只有一个数据文件。
db1=# insert into t1 values (1),(2),(3); INSERT 0 3 db1=# insert into t2 values (1),(2),(3); INSERT 0 3 [gpadmin@hdp3 ~]$ hdfs dfs -ls /hawq_data/16385/25270/156897 Found 24 items -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/1 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/10 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/11 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/12 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/13 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/14 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/15 -rw------- 3 gpadmin gpadmin 16 2017-04-01 14:40 /hawq_data/16385/25270/156897/16 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/17 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/18 -rw------- 3 gpadmin gpadmin 16 2017-04-01 14:40 /hawq_data/16385/25270/156897/19 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/2 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/20 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/21 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/22 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/23 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/24 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/3 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/4 -rw------- 3 gpadmin gpadmin 16 2017-04-01 14:40 /hawq_data/16385/25270/156897/5 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/6 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/7 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/8 -rw------- 3 gpadmin gpadmin 0 2017-04-01 14:40 /hawq_data/16385/25270/156897/9 [gpadmin@hdp3 ~]$ hdfs dfs -ls /hawq_data/16385/25270/156902 Found 1 items -rw------- 3 gpadmin gpadmin 48 2017-04-01 14:40 /hawq_data/16385/25270/156902/1 [gpadmin@hdp3 ~]$表一旦建立,哈希桶数就是固定不变且不能修改的。查询t1表时,将分配24个虚拟段,每个文件一个。扩展集群时,查询t1依然分配24个虚拟段,但是这些虚拟段将在所有节点中分配。比如扩展到8节点,则24个虚拟段被分配到8个节点上。集群扩展后,应该根据集群中segment节点的数量调整default_hash_table_bucket_number的值,并重建t1表,这样它才能获得正确的桶数。
3. 数据分布用法
数据分布的原理虽然复杂,但DISTRIBUTED子句的语法却很简单,DISTRIBUTED BY (<column>, [ … ] )用来声明一列或多列,作为哈希分布表的分布键。DISTRIBUTED RANDOMLY显式指定表使用随机分布策略。
db1=# create table t1(id int) with (bucketnum=8) distributed by (id); CREATE TABLE db1=# create table t2(id int) with (bucketnum=8) distributed randomly; CREATE TABLE db1=# create table t3(id int) distributed randomly; CREATE TABLE注意t2表,虽然指定了bucketnum=8,但分布策略使用的是随机分布,bucketnum是不起作用的。如果将t2的分布策略修改为哈希会报错:
db1=# \d t2 Append-Only Table "public.t2" Column | Type | Modifiers --------+---------+----------- id | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Distributed randomly db1=# alter table t2 set distributed by (id); ERROR: bucketnum requires a numeric value查看相关系统表可以看到,虽然设置了bucketnum=8,但t2的哈希分布键列为空,也说明是随机分布表。同时看到无论哪种分布策略,bucketnum的缺省值就是default_hash_table_bucket_number参数值,只是在随机分布表中不起作用。
db1=# select t1.*,t2.relname from gp_distribution_policy t1,pg_class t2 db1-# where t1.localoid=t2.oid; localoid | bucketnum | attrnums | relname ----------+-----------+----------+--------- 40651 | 8 | {1} | t1 40656 | 8 | | t2 40661 | 24 | | t3 (3 rows)可以在建表后修改它的分布策略。从随机分布修改为哈希分布,或者更改一个哈希分布表的分布键时,表数据会自动在所有segment上重新分布。而从哈希分布修改为随机分布时,不会重新分布数据。
db1=# create table t1 (a int); CREATE TABLE db1=# \d t1 Append-Only Table "public.t1" Column | Type | Modifiers --------+---------+----------- a | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Distributed randomly db1=# alter table t1 set distributed by (a); ALTER TABLE db1=# \d t1 Append-Only Table "public.t1" Column | Type | Modifiers --------+---------+----------- a | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Distributed by: (a) db1=# alter table t1 set distributed randomly; ALTER TABLE为了重新分布随机分布表的数据(或者在没有改变哈希分布策略时需要重新分布数据),使用reorganize=true。该命令使用当前分布策略在所有segment中重新分布表数据。
db1=# alter table t1 set with (reorganize=true); ALTER TABLE
这里有一个需要注意的细节,如果在建表时显式指定了bucketnum,那么不能再使用ALTER TABLE语句修改表的分布策略,也不能重新分布数据。
db1=# create table t1(a int) with (bucketnum=10) distributed by (a); CREATE TABLE db1=# alter table t1 set distributed by (a); ERROR: bucketnum requires a numeric value db1=# alter table t1 set distributed randomly; ERROR: bucketnum requires a numeric value db1=# alter table t1 set with (reorganize=true); ERROR: bucketnum requires a numeric value db1=# alter table t1 set with (bucketnum=10,reorganize=true); ERROR: option "bucketnum" not supported如果在建表时需要使用不同于缺省值的bucketnum,可以在会话级设置default_hash_table_bucket_number系统参数,这样以后就可以使用ALTER TABLE语句修改表的分布策略或重新组织表数据了。
db1=# set default_hash_table_bucket_number=10; SET db1=# create table t1(a int) distributed by (a); CREATE TABLE db1=# alter table t1 set distributed randomly; ALTER TABLE db1=# alter table t1 set distributed by (a); ALTER TABLE db1=# alter table t1 set with (reorganize=true); ALTER TABLE推荐使用这种为表设置bucketnum的方法,而不要在CREATE TABLE中显式指定。
| 语法 |
INHERITS | CREATE TABLE new_table INHERITS (origintable) [WITH(bucketnum=x)] [DISTRIBUTED BY col] |
LIKE | CREATE TABLE new_table (LIKE origintable) [WITH(bucketnum=x)] [DISTRIBUTED BY col] |
AS | CREATE TABLE new_table [WITH(bucketnum=x)] AS SUBQUERY [DISTRIBUTED BY col] |
SELECT INTO | CREATE TABLE origintable [WITH(bucketnum=x)] [DISTRIBUTED BY col]; SELECT * INTO new_table FROM origintable; |
1. INHERITS
CREATE TABLE语句的INHERITS子句指定一个或多个父表,新建的表作为子表,自动继承父表的所有列。INHERITS在子表与父表之间建立了一种永久性关系。对父表结构的修改会传递到子表,缺省时,子表中新增的数据也会在包含在父表中。db1=# create table t1(a int); CREATE TABLE db1=# create table t2(b int); CREATE TABLE db1=# create table t3() inherits (t1,t2); NOTICE: Table has parent, setting distribution columns to match parent table CREATE TABLE db1=# \d t3 Append-Only Table "public.t3" Column | Type | Modifiers --------+---------+----------- a | integer | b | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Inherits: t1, t2 Distributed randomly db1=# alter table t1 alter a type text; ALTER TABLE db1=# \d t3 Append-Only Table "public.t3" Column | Type | Modifiers --------+---------+----------- a | text | b | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Inherits: t1, t2 Distributed randomly db1=# insert into t3 values (‘a‘,1); INSERT 0 1 db1=# select * from t1; a --- a (1 row) db1=# select * from t2; b --- 1 (1 row) db1=# drop table t1; NOTICE: append only table t3 depends on append only table t1 ERROR: cannot drop append only table t1 because other objects depend on it HINT: Use DROP ... CASCADE to drop the dependent objects too. db1=# drop table t1 cascade; NOTICE: drop cascades to append only table t3 DROP TABLE db1=# \dt List of relations Schema | Name | Type | Owner | Storage --------+------+-------+---------+------------- public | t2 | table | gpadmin | append only (1 row)建立分区表时不能使用INHERITS子句。
db1=# CREATE TABLE sales (id int, date date, amt decimal(10,2)) inherits (t1) db1-# DISTRIBUTED BY (id) db1-# PARTITION BY RANGE (date) db1-# ( PARTITION Jan08 START (date ‘2008-01-01‘) INCLUSIVE , db1(# PARTITION Feb08 START (date ‘2008-02-01‘) INCLUSIVE db1(# END (date ‘2009-01-01‘) EXCLUSIVE ); ERROR: cannot mix inheritance with partitioning如果存在多个父表中同名的列,当列的数据类型也相同时,在子表中会被合并为一个列,否则会报错。
db1=# create table t1(a int); CREATE TABLE db1=# create table t2(a smallint); CREATE TABLE db1=# create table t3 () inherits (t1,t2); NOTICE: Table has parent, setting distribution columns to match parent table NOTICE: merging multiple inherited definitions of column "a" ERROR: inherited column "a" has a type conflict DETAIL: integer versus smallint db1=# alter table t2 alter a type int; ALTER TABLE db1=# create table t3 () inherits (t1,t2); NOTICE: Table has parent, setting distribution columns to match parent table NOTICE: merging multiple inherited definitions of column "a" CREATE TABLE如果新建表的列名也包含在父表中,处理方式类似,数据类型相同则合并成单列,否则报错。
db1=# create table t1(a int); CREATE TABLE db1=# create table t3 (a text) inherits (t1); NOTICE: Table has parent, setting distribution columns to match parent table NOTICE: merging column "a" with inherited definition ERROR: column "a" has a type conflict DETAIL: integer versus text db1=# create table t3 (a int) inherits (t1); NOTICE: Table has parent, setting distribution columns to match parent table NOTICE: merging column "a" with inherited definition CREATE TABLE如果新建表指定了一个列的缺省值,该缺省值会覆盖从父表继承的列的缺省值。
db1=# create table t1(a int default 1); CREATE TABLE db1=# create table t2(a int default 2) inherits (t1); NOTICE: Table has parent, setting distribution columns to match parent table NOTICE: merging column "a" with inherited definition CREATE TABLE db1-# \d t2 Append-Only Table "public.t2" Column | Type | Modifiers --------+---------+----------- a | integer | default 2 ...子表会自动从父表继承分布策略。
db1=# create table t1(a int) with (bucketnum=8) distributed by (a); CREATE TABLE db1=# create table t2 () inherits (t1); NOTICE: Table has parent, setting distribution columns to match parent table CREATE TABLE db1=# \d t2 Append-Only Table "public.t2" Column | Type | Modifiers --------+---------+----------- a | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Inherits: t1 Distributed by: (a) db1=# create table t3 () inherits (t1) with (bucketnum=8) distributed by (a); CREATE TABLE db1=# create table t4 () inherits (t1) with (bucketnum=16) distributed by (a); ERROR: distribution policy for "t4" must be the same as that for "t1" db1=# create table t4 (b int) inherits (t1) with (bucketnum=8) distributed by (b); ERROR: distribution policy for "t4" must be the same as that for "t1"
db1=# create table t1 (a int) with (bucketnum=8) distributed by (a); CREATE TABLE db1=# create table t2 (like t1); NOTICE: Table doesn‘t have ‘distributed by‘ clause, defaulting to distribution columns from LIKE table CREATE TABLE db1=# create table t3 (like t1) with (bucketnum=16) distributed by (a); CREATE TABLE db1=# select t1.*,t2.relname from gp_distribution_policy t1,pg_class t2 db1-# where t1.localoid=t2.oid and t2.relname in (‘t1‘,‘t2‘,‘t3‘); localoid | bucketnum | attrnums | relname ----------+-----------+----------+--------- 43738 | 8 | {1} | t1 43743 | 24 | {1} | t2 43748 | 16 | {1} | t3 (3 rows)非空约束总是被复制到新表。但对CHECK约束而言,只有指定了INCLUDING CONSTRAINTS子句时才会被复制到新表。
db1=# create table t1 (a int not null check (a > 0)); CREATE TABLE db1=# create table t2 (like t1); NOTICE: Table doesn‘t have ‘distributed by‘ clause, defaulting to distribution columns from LIKE table CREATE TABLE db1=# \d t2 Append-Only Table "public.t2" Column | Type | Modifiers --------+---------+----------- a | integer | not null Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Distributed randomly db1=# create table t3 (like t1 including constraints); NOTICE: Table doesn‘t have ‘distributed by‘ clause, defaulting to distribution columns from LIKE table CREATE TABLE db1=# \d t3 Append-Only Table "public.t3" Column | Type | Modifiers --------+---------+----------- a | integer | not null Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Check constraints: "t1_a_check" CHECK (a > 0) Distributed randomlyLIKE还有一点与INHERITS不同,它不会合并新表与原表的列。不能在新表或LIKE子句中显式定义列。
db1=# create table t1 (a int); CREATE TABLE db1=# insert into t1 values (100); INSERT 0 1 db1=# create table t2 (b) with --只定义列名,不能指定列的数据类型 db1-# (bucketnum=8, db1(# appendonly=true, db1(# blocksize=8192, db1(# orientation=row, db1(# compresstype=zlib, db1(# compresslevel=1, db1(# fillfactor=50, db1(# oids=false) db1-# as select * from t1 db1-# distributed by (b); SELECT 1 db1=# select * from t2; b ----- 100 (1 row)
db1=# create table t1 (a int) with db1-# (bucketnum=8, db1(# appendonly=true, db1(# blocksize=8192, db1(# orientation=row, db1(# compresstype=zlib, db1(# compresslevel=1, db1(# fillfactor=50, db1(# oids=false) db1-# distributed by (a); CREATE TABLE db1=# insert into t1 values (1); INSERT 0 1 db1=# select * into t2 from t1; SELECT 1 db1=# \d t2 Append-Only Table "public.t2" Column | Type | Modifiers --------+---------+----------- a | integer | Compression Type: None Compression Level: 0 Block Size: 32768 Checksum: f Distributed randomly db1=# select * from t2; a --- 1 (1 row)
标签:with 表数据 含义 命令 作用 运行 ict 最优 ado
原文地址:http://blog.csdn.net/wzy0623/article/details/61196229