PostgreSQL vs MySQL TPC-H 测试
PostgreSQL vs MySQL TPC-H 测试
分析类查询即便是TPC-H中的简单分析查询,PostgreSQL的性能都要远远优于MySQL。 Klustron(原KunlunBase) 继承并扩展了PostgreSQL在分析类SQL查询处理方面的强大能力,可以支持TPC-H和TPC-DS的所有查询。同时,一个Klustron 集群可以管理的数据规模远远大于 一个PostgreSQL 实例。
1 测试环境
**服务器配置:**PostgreSQL 和 MySQL 分别部署在一台:亚马逊 m5.4xlarge(CPU 8cores 16 Threads,内存:64 G,存储 gp3, 通用型 SSD 卷 3000IOPS ,125 MB/s 吞吐量)上。
软件版本:
PostgreSQL:PostgreSQL 12.7 on x86_64-koji-linux-gn
MySQL: percona 8.0.26-16
数据库参数配置:
PostgreSQL:shared_buffers 8192MB
MySQL: innodb_buffer_pool_size 8192MB
**测试背景:**PostgreSQL 和 MySQL 采用默认的安装配置,只调整了内存参数, 整个测试过程 PostgreSQL 没有任何优化行为。
2 表的信息
1张事实表:lineorder
4张维度表:customer,part,supplier,dates
表占用操作系统存储空间:19 G
数据查询:11 条标准 SQL 查询测试语句(统计查询、多表关联、sum、复杂条件、group by、order by 等组合方式)。
具体SQL语句:附录1 TPC-H 测试 SQL。
3 测试结果
**MySQL 错误信息:**ERROR 3 (HY000): Error writing file '/kunlun2/data10/6010/tmp/MYfd=332'(OS errno 28-No space left on device)
**分析及总结:**通过对比同一 SQL 语句在 PostgreSQL 和 MySQL 执行计划,会发现 MySQL 的执行计划没有采用最优的 join 次序及并行操作,导致性能差。
譬如 Q2.1 MySQL 执行计划:
**执行计划分析:**上述执行计划首先将几个维度表做 join(dates 和 supplier 和 part),得到的结果再与事实表 lineorder join,因而得出了一个超级大的中间结果集,数据量达到 10 的 15 次方的数量(查询计划第五行返回的 rows),最终导致临时文件耗尽磁盘空间而未能完成查询。
**优化方案:**通过在 SQL 语句中强制指定表 join 次序:首先与 part 表 join 得到一个最小的数据子集,然后再与 supplier 和 dates join,逐步缩小范围,查询语句及查询计划的效果如下:
explain format=tree selectsum(lo_revenue) as lo_revenue, d_year as year, p_brand from ((lineorderstraight_join part on lo_partkey = p_partkey) straight_join supplier on lo_suppkey = s_suppkey) straight_join dates ON lo_orderdate = d_datekey where p_category ='MFGR#12' and s_region = 'AMERICA' group by year, p_brand order by year,p_brand;
在 MySQL 上指定 join 次序,
同时开启动并行查询:set local innodb_parallel_read_threads=16
执行时间:1 min 53.72 sec
而 PostgreSQL 的执行计划:
执行时间:1133 ms
PostgreSQL 自动采用最优的 join 次序,大大减少了最后排序的数据量,并且启动了 3 个并行分割数据集,因此执行的性能比手动优化后的 MySQL 还快 10 倍。
**结论:**MySQL 执行 TPC-H 的测试,需要手动优化查询语句,即使如此,性能任然远远低于 PostgreSQL,而 PostgreSQL 默认的配置就可以达到相对好的性能。
4 附录:测试 SQL 及查询计划
Q1.1
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;
PostgreSQL 执行计划:
MySQL 执行计划:
Q1.2
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_yearmonthnum = 199401
and lo_discount between 4 and 6
and lo_quantity between 26 and 35;
PostgreSQL 执行计划:
MySQL 执行计划:
Q1.3
select sum(lo_revenue) as revenue from lineorder
join dates on lo_orderdate = d_datekey
where d_weeknuminyear = 6 and year(d_datekey) = 1994
and lo_discount between 5 and 7
and lo_quantity between 26 and 35;
PostgreSQL 执行计划:
MySQL 执行计划:
Q2.1
select sum(lo_revenue) as lo_revenue, d_year as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_category = 'MFGR#12' and s_region = 'AMERICA'
group by year, p_brand
order by year, p_brand;
PostgreSQL 执行计划:
MySQL 执行计划:
Q2.2
select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand between 'MFGR#2221' and 'MFGR#2228' and s_region = 'ASIA'
group by year, p_brand
order by year, p_brand;
PostgreSQL 执行计划:
MySQL 执行计划:
Q2.3
select sum(lo_revenue) as lo_revenue, year(d_datekey) as year, p_brand
from lineorder
join dates on lo_orderdate = d_datekey
join part on lo_partkey = p_partkey
join supplier on lo_suppkey = s_suppkey
where p_brand = 'MFGR#2239' and s_region = 'EUROPE'
group by year, p_brand
order by year, p_brand;
PostgreSQL 执行计划:
MySQL 执行计划:
Q3.1
select c_nation, s_nation, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_region = 'ASIA' and s_region = 'ASIA' and year(d_datekey) between 1992 and 1997
group by c_nation, s_nation, year
order by year asc, lo_revenue desc;
PostgreSQL 执行计划:
MySQL 执行计划:
Q3.2
select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where c_nation = 'UNITED STATES' and s_nation = 'UNITED STATES'
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;
PostgreSQL 执行计划:
MySQL 执行计划:
Q3.3
select c_city, s_city, year(d_datekey) as year, sum(lo_revenue) as lo_revenue
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
where (c_city='UNITED KI1' or c_city='UNITED KI5')
and (s_city='UNITED KI1' or s_city='UNITED KI5')
and year(d_datekey) between 1992 and 1997
group by c_city, s_city, year
order by year asc, lo_revenue desc;
PostgreSQL 执行计划:
MySQL 执行计划:
Q4.1
select year(d_datekey) as year, c_nation, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA' and s_region = 'AMERICA' and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, c_nation
order by year, c_nation;
PostgreSQL 执行计划:
MySQL 执行计划:
Q4.2
select year(d_datekey) as year, s_nation, p_category, sum(lo_revenue) - sum(lo_supplycost) as profit
from lineorder
join dates on lo_orderdate = d_datekey
join customer on lo_custkey = c_custkey
join supplier on lo_suppkey = s_suppkey
join part on lo_partkey = p_partkey
where c_region = 'AMERICA'and s_region = 'AMERICA'
and (year(d_datekey) = 1997 or year(d_datekey) = 1998)
and (p_mfgr = 'MFGR#1' or p_mfgr = 'MFGR#2')
group by year, s_nation, p_category
order by year, s_nation, p_category;
PostgreSQL 执行计划:
MySQL 执行计划: