博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
PostgreSQL sharding : citus 系列6 - count(distinct xx) 加速 (use 估值插件 h...
阅读量:6758 次
发布时间:2019-06-26

本文共 8130 字,大约阅读时间需要 27 分钟。

标签

PostgreSQL , hll , hyperloglog , distinct , 加速 , citus.count_distinct_error_rate


背景

在分布式数据库中,计算count(distinct xxx),需要对distinct 的字段,

1、去重,

2、重分布去重后的数据,(这一步,如果distinct值特别多,那么就会比较耗时)

3、然后再去重,

4、最后count (xxx),

5、求所有节点的count SUM。

例如,以下是Greenplum的执行计划例子

postgres=# explain analyze select count(distinct c_acctbal) from customer;                                                                               QUERY PLAN                                                                               --------------------------------------------------------------------------------------------------------------------------------------------------------------------   Aggregate  (cost=182242.41..182242.42 rows=1 width=8)     Rows out:  1 rows with 0.006 ms to first row, 69 ms to end, start offset by 23 ms.     ->  Gather Motion 16:1  (slice2; segments: 16)  (cost=53392.85..173982.82 rows=660767 width=8)           Rows out:  818834 rows at destination with 3.416 ms to first row, 447 ms to end, start offset by 23 ms.           ->  HashAggregate  (cost=53392.85..61652.43 rows=41298 width=8)                 Group By: customer.c_acctbal                 Rows out:  Avg 51177.1 rows x 16 workers.  Max 51362 rows (seg3) with 0.004 ms to first row, 33 ms to end, start offset by 25 ms.                 ->  Redistribute Motion 16:16  (slice1; segments: 16)  (cost=30266.00..43481.34 rows=41298 width=8)                       Hash Key: customer.c_acctbal                       Rows out:  Avg 89865.6 rows x 16 workers at destination.  Max 90305 rows (seg3) with 18 ms to first row, 120 ms to end, start offset by 25 ms.                       ->  HashAggregate  (cost=30266.00..30266.00 rows=41298 width=8)                             Group By: customer.c_acctbal                             Rows out:  Avg 89865.6 rows x 16 workers.  Max 89929 rows (seg2) with 0.007 ms to first row, 33 ms to end, start offset by 26 ms.                             ->  Append-only Columnar Scan on customer  (cost=0.00..22766.00 rows=93750 width=8)                                   Rows out:  Avg 93750.0 rows x 16 workers.  Max 93751 rows (seg4) with 20 ms to first row, 30 ms to end, start offset by 26 ms.   Slice statistics:     (slice0)    Executor memory: 387K bytes.     (slice1)    Executor memory: 6527K bytes avg x 16 workers, 6527K bytes max (seg0).     (slice2)    Executor memory: 371K bytes avg x 16 workers, 371K bytes max (seg0).   Statement statistics:     Memory used: 1280000K bytes   Optimizer status: legacy query optimizer   Total runtime: 723.143 ms  (23 rows)

以下是citus的例子

postgres=# explain analyze select count(distinct bid) from pgbench_accounts ;                                                                              QUERY PLAN                                                                              ------------------------------------------------------------------------------------------------------------------------------------------------------------------   Aggregate  (cost=0.00..0.00 rows=0 width=0) (actual time=31.748..31.749 rows=1 loops=1)     ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0) (actual time=31.382..31.510 rows=1280 loops=1)           Task Count: 128           Tasks Shown: One of 128           ->  Task                 Node: host=172.24.211.224 port=1921 dbname=postgres                 ->  HashAggregate  (cost=231.85..231.95 rows=10 width=4) (actual time=3.700..3.702 rows=10 loops=1)                       Group Key: bid                       ->  Seq Scan on pgbench_accounts_106812 pgbench_accounts  (cost=0.00..212.48 rows=7748 width=4) (actual time=0.017..2.180 rows=7748 loops=1)                     Planning time: 0.445 ms                     Execution time: 3.781 ms   Planning time: 1.399 ms   Execution time: 32.159 ms  (13 rows)

对于可估值计算的场景,即不需要精确distinct值的场景,PostgreSQL提供了一个名为hll的插件,可以用来估算distinct元素个数。

citus 结合hll,可以实现超高速的count(distinct xxx),即使distinct值非常非常多,也不慢。

SET citus.count_distinct_error_rate to 0.005;    0.005表示失真度

hll加速citus count(distinct xxx)使用举例

部署

1、所有节点(coordinator 与 worker节点),安装hll软件

yum install -y gcc-c++    cd ~/    git clone https://github.com/citusdata/postgresql-hll    cd postgresql-hll    . /var/lib/pgsql/.bash_profile     USE_PGXS=1 make  USE_PGXS=1 make install

2、所有节点(coordinator 与 worker节点),在需要用到HLL的DB中增加插件

su - postgres -c "psql -d postgres -c 'create extension hll;'"    su - postgres -c "psql -d newdb -c 'create extension hll;'"

使用举例

1、创建测试表,128 shard

create table test (id int primary key, a int, b int, c int);    set citus.shard_count =128;     select create_distributed_table('test', 'id');

2、写入10亿测试数据,a字段10唯一值,b字段100唯一值,c字段100万唯一值

insert into test select id, random()*9, random()*99, random()*999999 from generate_series(1,1000000000) t(id);

3、(coordinator节点)设置全局或当前会话级参数,指定失真度,越小失真度越小

SET citus.count_distinct_error_rate to 0.005;    newdb=# explain select count(distinct bid) from pgbench_accounts group by bid;                                                            QUERY PLAN                                                             -------------------------------------------------------------------------------------------------------------------------------   HashAggregate  (cost=0.00..0.00 rows=0 width=0)     Group Key: remote_scan.worker_column_2     ->  Custom Scan (Citus Real-Time)  (cost=0.00..0.00 rows=0 width=0)           Task Count: 128           Tasks Shown: One of 128           ->  Task                 Node: host=172.24.211.224 port=8001 dbname=newdb                 ->  GroupAggregate  (cost=97272.79..105102.29 rows=1000 width=36)                       Group Key: bid                       ->  Sort  (cost=97272.79..99227.04 rows=781700 width=4)                             Sort Key: bid                             ->  Seq Scan on pgbench_accounts_102008 pgbench_accounts  (cost=0.00..20759.00 rows=781700 width=4)  (12 rows)

4、对比是否使用HLL加速(少量唯一值,HLL没有性能提升,因为本身就不存在瓶颈)

4.1、未使用hll
newdb=# set citus.count_distinct_error_rate to 0;  newdb=# select count(distinct bid) from pgbench_accounts;   count   -------    1000  (1 row)    Time: 423.364 ms    postgres=# set citus.count_distinct_error_rate to 0;  postgres=# select count(distinct a) from test;   count   -------      10  (1 row)    Time: 2392.709 ms (00:02.393)
4.2、使用hll
newdb=# set citus.count_distinct_error_rate to 0.005;  newdb=# select count(distinct bid) from pgbench_accounts;   count   -------    1000  (1 row)    Time: 444.287 ms    postgres=# set citus.count_distinct_error_rate to 0.005;  postgres=# select count(distinct a) from test;   count   -------      10  (1 row)    Time: 2375.473 ms (00:02.375)

5、对比是否使用HLL加速(大量唯一值,HLL性能提升显著)

5.1、未使用hll
postgres=# set citus.count_distinct_error_rate to 0;      count   ---------- 10000000(1 row)Time: 5826241.205 ms (01:37:06.241)

128个节点,每个节点最多发送10亿/128条数据给coordinator,慢是可以理解的。另一方面,coordinator可以边接收边去重(postgresql 11增加了parallel gather, merge sort等能力,citus coordinator可以借鉴),没必要等所有数据都收完再去重。

5.2、使用hll
postgres=# set citus.count_distinct_error_rate to 0.005;  postgres=# select count(distinct (a,c)) from test;    count    ---------   9999995  (1 row)    Time: 4468.749 ms (00:04.469)

6、设置不同的精度参数,性能对比

newdb=# set citus.count_distinct_error_rate to 0.1;  newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;    count     ----------   94778491  (1 row)  Time: 545.301 ms    newdb=# set citus.count_distinct_error_rate to 0.01;  newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;     count     -----------   100293937  (1 row)  Time: 554.333 ms    -- 推荐设置0.005    newdb=# set citus.count_distinct_error_rate to 0.005;  newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;     count     -----------   100136086  (1 row)  Time: 1053.070 ms (00:01.053)    newdb=# set citus.count_distinct_error_rate to 0.001;  newdb=#  select count(distinct (aid,bid)) from pgbench_accounts ;     count     -----------   100422107  (1 row)  Time: 9287.934 ms (00:09.288)

小结

hll是应用广泛的PostgreSQL估值插件。

使用hll,大幅提升了citus count(disinct xxx)的性能(特别当distinct结果集很大时,hll大幅降低了重分布开销,性能提升非常明显(本例1000万唯一值,耗时5826秒降低到了4秒))。

唯一值精度可通过参数citus.count_distinct_error_rate进行设置。

参考

转载地址:http://vyzeo.baihongyu.com/

你可能感兴趣的文章
Redis
查看>>
XINS 3.0 正式版发布,远程 API 调用规范
查看>>
sqlserver 2005 64bit express
查看>>
(转)Oracle中For和while及一些应用
查看>>
动态链接导入库与静态链接库
查看>>
jQuery基础及选择器
查看>>
DragonFly BSD 3.2 发布
查看>>
软件开发中常见的十大系统瓶颈(转)
查看>>
Mozilla 发布 Popcorn Maker,在线创作视频
查看>>
C#中为什么需要装箱拆箱操作?
查看>>
PHP类中一般方法与静态方法的疑问
查看>>
[转]PHP花括号变量
查看>>
【Opencv学习】摄像头采集、录像、截图小工具
查看>>
Fedora16安装中文语言包和中文输入法
查看>>
iframe加载完成监控兼容IE/FF/Chrome
查看>>
JS:1.1,消息框(alert,confirm,prompt)
查看>>
quartz
查看>>
Windows 8实用窍门系列:14.windows 8中粘贴板(剪切板)的使用
查看>>
长连接API小心“窜包”问题
查看>>
开发者基础知识游戏,共10关,欢迎挑战
查看>>