标签:数据仓库 top one tin numbers 数据 compact run 分区
select avg(income) from census_data where state = ‘CA‘;
select * from census_data;
TBD...
Dictionary encoding:...
Gathers information about volume and distribution of data in a table and all associated columns and partitions. [通过compute语句收集表以及所有相关的列和分区的体积和分布信息。]
The information is stored in the metastore database, and used by Impala to help optimize queries. [impala可以根据收集到的信息优化查询。]
For example, if Impala can determine that a table is large or small, or has many or few distinct values it can organize parallelize the work appropriately for a join query or insert operation. [比如,如果impala知道表的大小,或者是否有很多不同值,它就可以为join查询以及insert操作组织合适的并行度]
TBD...
Complex types (also referred to as nested types) let you represent multiple data values within a single row/column position.
The reasons for using it:
already have data produced by Hive or other non-Impala component that uses the complex type column names;
Your analytic queries involving multiple tables could benefit from greater locality during join processing. By packing more related data items within each HDFS data block, complex types let join queries avoid the network overhead of the traditional Hadoop shuffle or broadcast join techniques. [包含多个表的分析查询可以受益于join操作中的局部性。通过将很多相关的数据项packing到一个HDFS数据块中,复杂类型减轻了join带来的网络负担。]
The ARRAY and MAP types are closely related: they represent collections with arbitrary numbers of elements, where each element is the same type. 【key of map is not necessarily unique.】
STRUCT groups together a fixed number of items into a single element.
The elements of an ARRAY or MAP, or the fields of a STRUCT, can also be other complex types. —> nested
when visualizing ur data model in familiar SQL terms, u can think of each ARRAY or MAP as a miniature table, and each STRUCT as a row within such a table.
By default, the table represented by an ARRAY has two columns, POS to represent ordering of elements, and ITEM representing the value of each element. Likewise, by default, the table represented by a MAP encodes key-value pairs, and therefore has two columns, KEY and VALUE.
Split across normalized tables, using foreign key columns. This arrangement avoided duplicate data and therefore the data wad compact. But join queries could be expensive because the related data had to be retrieved from separate locations. [分隔成多个多个normalized表,通过外键关联到一起。这种方式避免了重复数据,因为表是紧凑的。但是join查询是昂贵的,因为相关数据需要从不同的位置取得。]
Flattened into a single denormalized table. This removing the need for join queries, but values were repeated.The extra data volume could cause performance issues in other parts of the workflow, such as longer ETL cycles or more expensive full-table scans during queries.[扁平化成一个denormalized表。这种方式避免了join查询,但是数据重复。这种重复还会造成其他情况下的性能问题,比如更昂贵的全表扫描操作。]
...
create table contacts_array_of_phones ( id BIGINT, name STRING, address STRING, phone_number ARRAY<STRING> ) stored as parquet;
create table contacts_unlimited_phones ( id BIGINT, name STRING, address STRING, phone_number MAP<STRING, STRING> ) stored as PARQUET;
select c_orders from customer limit 1; ERROR: AnalysisException: Expr ‘c_orders‘ in select list returns a complex type ‘ARRAY<STRUCT<o_orderkey:BIGINT,o_orderstatus:STRING, ... l_receiptdate:STRING,l_shipinstruct:STRING,l_shipmode:STRING,l_comment:STRING>>>>‘. Only scalar types are allowed in the select list.
--- only scalar in select, and add region.r_nations in from select r_name, r_nations.item.n_name from region, region.r_nations limit 7;
select c.c_name, o.o_orderkey from customer c, c.c_orders o limit 5;
+--------------------+------------+ | c_name | o_orderkey | +--------------------+------------+ | Customer#000072578 | 558821 | | Customer#000072578 | 2079810 | | Customer#000072578 | 5768068 | | Customer#000072578 | 1805604 | | Customer#000072578 | 3436389 | +--------------------+------------+
select c.c_name, o.o_orderkey from customer c inner join c.c_orders o limit 5; +--------------------+------------+ | c_name | o_orderkey | +--------------------+------------+ | Customer#000072578 | 558821 | | Customer#000072578 | 2079810 | | Customer#000072578 | 5768068 | | Customer#000072578 | 1805604 | | Customer#000072578 | 3436389 | +--------------------+------------+
select c.c_custkey, o.o_orderkey from customer c left outer join c.c_orders o limit 5; +-----------+------------+ | c_custkey | o_orderkey | +-----------+------------+ | 60210 | NULL | | 147873 | NULL | | 72578 | 558821 | | 72578 | 2079810 | | 72578 | 5768068 | +-----------+------------+
select c_name, howmany from customer c, (select count(*) from c.c_orders) v limit 5; +--------------------+---------+ | c_name | howmany | +--------------------+---------+ | Customer#000030065 | 15 | | Customer#000065455 | 18 | | Customer#000113644 | 21 | | Customer#000111078 | 0 | | Customer#000024621 | 0 | +--------------------+---------+
标签:数据仓库 top one tin numbers 数据 compact run 分区
原文地址:http://www.cnblogs.com/wttttt/p/6917481.html