码迷,mamicode.com
首页 > 数据库 > 详细

ClickHouse开源数据库

时间:2017-12-10 22:51:13      阅读:610      评论:0      收藏:0      [点我收藏+]

标签:intern   bytes   格式   rtu   溢出   new   规格   cat   max   

ClickHouse是一个开源的面向列式数据的数据库管理系统,能够使用SQL查询并且生成实时数据报告。

 

优点:

1.并行处理单个查询(利用多核)

2.在多个服务器上分布式处理

3.非常快的扫描,可用于实时查询

4.列存储非常适用于“宽”/“非规格化”表(多列)

5.良好的压缩特性

6.SQL支持(有限的支持)

7.一系列函数的支持,包括对近似计算的支持

8.不同的存储引擎的支持(磁盘存储格式)

9.非常适合结构性日志/事件数据以及时间序列数据(引擎的合并树需要日期字段)

10. 索引支持(仅主键支持,不是所有的存储引擎都支持)

11.漂亮的命令行界面,用户友好的进度条和格式

 

缺点:

1.不支持实时的删除/更新操作,不支持事物(与Spark和大部分大数据系统一样)

2.不支持二级索引(与Spark和大部分大数据系统一样)

3.只支持自己的协议(不支持MySQL协议)

4.有限的SQL支持,join的实现方式不同。如果您正在从MySQL或Spark迁移数据,那么您必须重新编写包含join的所有查询。

5.没有窗口功能

 

Group by: in-memory vs. on-disk

内存不足是在ClickHouse中处理大型数据集时可能遇到的潜在问题之一:

SELECT 
    min(toMonth(date)), 
    max(toMonth(date)), 
    path, 
    count(*), 
    sum(hits), 
    sum(hits) / count(*) AS hit_ratio
FROM wikistat 
WHERE (project = en) 
GROUP BY path
ORDER BY hit_ratio DESC
LIMIT 10
 
↖ Progress: 1.83 billion rows, 85.31 GB (68.80 million rows/s., 3.21 GB/s.) ██████████▋                                                                                                                                                    6%Received exception from server:
Code: 241. DB::Exception: Received from localhost:9000, 127.0.0.1. 
DB::Exception: Memory limit (for query) exceeded: would use 9.31 GiB (attempt to allocate chunk of 1048576 bytes), maximum: 9.31 GiB: 
(while reading column hits): 

默认情况下,ClickHouse会限制group by使用的内存量(它使用 hash table来处理group by)。这很容易解决 - 如果你有空闲的内存,增加这个参数:

SET max_memory_usage = 128000000000; #128G,

 

如果你没有那么多的内存可用,ClickHouse可以通过设置这个“溢出”数据到磁盘:

set max_bytes_before_external_group_by=20000000000; #20G
set max_memory_usage=40000000000; #40G

根据文档,如果需要使用max_bytes_before_external_group_by,建议将max_memory_usage设置为max_bytes_before_external_group_by大小的两倍。

(原因是聚合需要分两个阶段进行:1.查询并且建立中间数据 2.合并中间数据。 数据“溢出”到磁盘一般发生在第一个阶段,如果没有发生数据“溢出”,ClickHouse在阶段1和阶段2可能需要相同数量的内存)

 

 性能对比:

ClickHouse vs. Spark

ClickHouse 和Spark 都是分布式的; 但是为了方便测试我们只进行单节点测试。 测试结果令人震惊

Size / compression Spark v. 2.0.2 ClickHouse
数据存储格式 Parquet, compressed: snappy Internal storage, compressed
Size (uncompressed: 1.2TB) 395G 212G

 

 技术分享图片

 Test  Spark v. 2.0.2 ClickHouse Diff
Query 1: count (warm) 7.37 sec (no disk IO) 6.61 sec  ~same
Query 2: simple group (warm)  792.55 sec (no disk IO) 37.45 sec 21x better
Query 3: complex group by   2522.9 sec  398.55 sec  6.3x better

 

附录:

硬件

  • CPU: 24xIntel(R) Xeon(R) CPU L5639 @ 2.13GHz (physical = 2, cores = 12, virtual = 24, hyperthreading = yes)
  • Disk: 2 consumer grade SSD in software RAID 0 (mdraid)

Query 1

select count(*) from wikistat

ClickHouse:

:)  select count(*) from wikistat;
 
SELECT count(*)
FROM wikistat 
 
┌─────count()─┐
│ 26935251789 │
└─────────────┘
 
1 rows in set. Elapsed: 6.610 sec. Processed 26.88 billion rows, 53.77 GB (4.07 billion rows/s., 8.13 GB/s.) 

 

Spark:

spark-sql> select count(*) from wikistat;
26935251789
Time taken: 7.369 seconds, Fetched 1 row(s)

Query 2

select count(*), month(dt) as mon
from wikistat where year(dt)=2008
and month(dt) between 1 and 10
group by month(dt)
order by month(dt);

ClickHouse:

:) select count(*), toMonth(date) as mon from wikistat 
where toYear(date)=2008 and toMonth(date) between 1 and 10 group by mon;
 
SELECT 
    count(*), 
    toMonth(date) AS mon
FROM wikistat 
WHERE (toYear(date) = 2008) AND ((toMonth(date) >= 1) AND (toMonth(date) <= 10))
GROUP BY mon
 
┌────count()─┬─mon─┐
│ 21001626041 │
│ 19697570692 │
│ 20813715303 │
│ 21568785124 │
│ 24768906215 │
│ 25266628966 │
│ 24897232447 │
│ 24803563588 │
│ 25227465449 │
│ 261437235210 │
└────────────┴─────┘
 
10 rows in set. Elapsed: 37.450 sec. Processed 23.37 billion rows, 46.74 GB (623.97 million rows/s., 1.25 GB/s.) 

Spark:

spark-sql> select count(*), month(dt) as mon from wikistat where year(dt)=2008 and month(dt) between 1 and 10 group by month(dt) order by month(dt);
2100162604      1                                                               
1969757069      2
2081371530      3
2156878512      4
2476890621      5
2526662896      6
2489723244      7
2480356358      8
2522746544      9
2614372352      10
Time taken: 792.552 seconds, Fetched 10 row(s)

 

Query 3

SELECT
path,
count(*),
sum(hits) AS sum_hits,
round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = en
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100;

ClickHouse:

:) SELECT
:-]     path,
:-]     count(*),
:-]     sum(hits) AS sum_hits,
:-]     round(sum(hits) / count(*), 2) AS hit_ratio
:-] FROM wikistat
:-] WHERE (project = en)
:-] GROUP BY path
:-] ORDER BY sum_hits DESC
:-] LIMIT 100;
 
SELECT 
    path,
    count(*),
    sum(hits) AS sum_hits,
    round(sum(hits) / count(*), 2) AS hit_ratio
FROM wikistat
WHERE project = en
GROUP BY path
ORDER BY sum_hits DESC
LIMIT 100
 
┌─path────────────────────────────────────────────────┬─count()─┬───sum_hits─┬─hit_ratio─┐
│ Special:Search                                      │   447954544605711101453.41 │
│ Main_Page                                           │   31930211589697766266.74 │
│ Special:Random                                      │   3015953383053417700.54 │
│ Wiki                                                │   10237404884163955.11 │
│ Special:Watchlist                                   │   3820637200069973.67 │
│ YouTube                                             │    9960343498043448.78 │
│ Special:Randompage                                  │    8085289596243581.9 │
│ Special:AutoLogin                                   │   3441324436845710.11 │
│ Facebook                                            │    7153182633532553.24 │
│ Wikipedia                                           │   2373217848385752.08 │
│ Barack_Obama                                        │   13832169657751226.56 │
│ index.html                                          │    6658169215832541.54 │
…
 
100 rows in set. Elapsed: 398.550 sec. Processed 26.88 billion rows, 1.24 TB (67.45 million rows/s., 3.10 GB/s.)

Spark:

spark-sql> SELECT
         >     path,
         >     count(*),
         >     sum(hits) AS sum_hits,
         >     round(sum(hits) / count(*), 2) AS hit_ratio
         > FROM wikistat
         > WHERE (project = en)
         > GROUP BY path
         > ORDER BY sum_hits DESC
         > LIMIT 100;
...
Time taken: 2522.903 seconds, Fetched 100 row(s)

 

 

参考:

https://www.percona.com/blog/2017/02/13/clickhouse-new-opensource-columnar-database/

https://clickhouse.yandex/docs/en/single/#ClickHouse%20features%20that%20can%20be%20considered%20disadvantages

 

 

ClickHouse开源数据库

标签:intern   bytes   格式   rtu   溢出   new   规格   cat   max   

原文地址:http://www.cnblogs.com/davygeek/p/8018292.html

(0)
(0)
   
举报
评论 一句话评论(0
登录后才能评论!
© 2014 mamicode.com 版权所有  联系我们:gaon5@hotmail.com
迷上了代码!