PostgreSQLでSQLの勉強

  • このエントリーをはてなブックマークに追加

色々な事を考えていたら、ハマったのでその記録。

Postgresのバージョンは9.5.3です。(正確にはPipelineDBの0.9.5です)
例えば、以下のテーブルがあります。

CREATE TABLE imp   (campaign_id BIGINT, campaign_creative_id BIGINT,   imp_count BIGINT, CONSTRAINT   pk_imp1 PRIMARY KEY (campaign_id, campaign_creative_id));
CREATE TABLE vimp (campaign_id BIGINT, campaign_creative_id BIGINT, vimp_count BIGINT, CONSTRAINT pk_vimp1 PRIMARY KEY (campaign_id, campaign_creative_id));
CREATE TABLE click (campaign_id BIGINT, campaign_creative_id BIGINT, click_count BIGINT, CONSTRAINT pk_click1 PRIMARY KEY (campaign_id, campaign_creative_id));
CREATE TABLE conv (campaign_id BIGINT, campaign_creative_id BIGINT, conv_count BIGINT, CONSTRAINT pk_conv1 PRIMARY KEY (campaign_id, campaign_creative_id));

これを次のようにまとめたいとします。なんとなくやりたい事は理解してもらえると思います。

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN click
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN conv
USING (campaign_id, campaign_creative_id)
;

期待する結果としては次のようになります。

INSERT INTO imp VALUES(100, 1000, 111);
INSERT INTO imp VALUES(100, 1001, 221);
INSERT INTO imp VALUES(101, 1000, 333);
INSERT INTO vimp VALUES(100, 1000, 30);
INSERT INTO click VALUES(102, 1004, 5);
INSERT INTO conv VALUES(102, 1004, 1);
INSERT INTO conv VALUES(103, 1005, 2);
 campaign_id | campaign_creative_id | imp_count | vimp_count | click_count | conv_count
-------------+----------------------+-----------+------------+-------------+------------
100 | 1000 | 111 | 30 | 0 | 0
100 | 1001 | 221 | 0 | 0 | 0
101 | 1000 | 333 | 0 | 0 | 0
102 | 1004 | 0 | 0 | 5 | 1
103 | 1005 | 0 | 0 | 0 | 2

ただし欲しい情報は全部ではないので、campaign_idで絞り込みます。
最終的にはFUNCTIONにして可変長引数としてbigint[]を受け取り、それをcampaign_id = ANY(campaign_ids)みたいな事をしますが、問題を簡単にするためにひとまず1件のcampaign_idを指定します。

いくつかのパターンを試す

ダミーデータで実行してもあまり意味が無いので、手元にあるテーブルを使います。
テーブル名が上記のサンプルからがらりと変わりますが、適宜脳内変換をして頂ければ…

愚直にWHERE句に条件を書く

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN click
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN conv
USING (campaign_id, campaign_creative_id)
WHERE
campaign_id = 102

imp.campaign_idではなくcampaign_id
WHERE句に書いたフィルタリングは結合の後に実行されます。

 campaign_id | campaign_creative_id | imp_count | vimp_count | click_count | conv_count
-------------+----------------------+-----------+------------+-------------+------------
102 | 1004 | 0 | 0 | 5 | 1
Hash Full Join  (cost=257.71..361.29 rows=8 width=96) (actual time=0.051..0.056 rows=1 loops=1)
Hash Cond: ((COALESCE(COALESCE(imp.campaign_id, vimp.campaign_id), click.campaign_id) = conv.campaign_id) AND (COALESCE(COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id), click.campaign_creative_id) = conv.campaign_creative_id))
Filter: (COALESCE(COALESCE(COALESCE(imp.campaign_id, vimp.campaign_id), click.campaign_id), conv.campaign_id) = 102)
Rows Removed by Filter: 3
-> Merge Full Join (cost=243.22..335.01 rows=1570 width=72) (actual time=0.024..0.028 rows=4 loops=1)
Merge Cond: ((click.campaign_id = (COALESCE(imp.campaign_id, vimp.campaign_id))) AND (click.campaign_creative_id = (COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id))))
-> Index Scan using pk_click1 on click (cost=0.15..71.70 rows=1570 width=24) (actual time=0.003..0.003 rows=1 loops=1)
-> Sort (cost=243.06..246.99 rows=1570 width=48) (actual time=0.019..0.020 rows=3 loops=1)
Sort Key: (COALESCE(imp.campaign_id, vimp.campaign_id)), (COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id))
Sort Method: quicksort Memory: 25kB
-> Merge Full Join (cost=0.30..159.72 rows=1570 width=48) (actual time=0.006..0.010 rows=3 loops=1)
Merge Cond: ((imp.campaign_id = vimp.campaign_id) AND (imp.campaign_creative_id = vimp.campaign_creative_id))
-> Index Scan using pk_imp1 on imp (cost=0.15..71.70 rows=1570 width=24) (actual time=0.002..0.002 rows=3 loops=1)
-> Index Scan using pk_vimp1 on vimp (cost=0.15..71.70 rows=1570 width=24) (actual time=0.002..0.003 rows=1 loops=1)
-> Hash (cost=14.37..14.37 rows=8 width=24) (actual time=0.013..0.013 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Bitmap Heap Scan on conv (cost=4.21..14.37 rows=8 width=24) (actual time=0.011..0.011 rows=1 loops=1)
Recheck Cond: (campaign_id = 102)
Heap Blocks: exact=1
-> Bitmap Index Scan on pk_conv1 (cost=0.00..4.21 rows=8 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (campaign_id = 102)
Planning time: 0.295 ms
Execution time: 0.119 ms

WHERE句に全部条件を書いてみる

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN click
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN conv
USING (campaign_id, campaign_creative_id)
WHERE
imp.campaign_id = 102
OR vimp.campaign_id = 102
OR click.campaign_id = 102
OR conv.campaign_id = 102
;

※今回のケースではANDで繋げるのはNG。とてもひどいことになった気がする。

Merge Full Join  (cost=418.50..510.92 rows=31 width=96) (actual time=0.034..0.035 rows=1 loops=1)
Merge Cond: ((conv.campaign_id = (COALESCE(COALESCE(imp.campaign_id, vimp.campaign_id), click.campaign_id))) AND (conv.campaign_creative_id = (COALESCE(COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id), click.campaign_creative_id))))
Filter: ((imp.campaign_id = 102) OR (vimp.campaign_id = 102) OR (click.campaign_id = 102) OR (conv.campaign_id = 102))
Rows Removed by Filter: 4
-> Index Scan using pk_conv1 on conv (cost=0.15..71.70 rows=1570 width=24) (actual time=0.006..0.006 rows=2 loops=1)
-> Sort (cost=418.35..422.28 rows=1570 width=72) (actual time=0.022..0.022 rows=4 loops=1)
Sort Key: (COALESCE(COALESCE(imp.campaign_id, vimp.campaign_id), click.campaign_id)), (COALESCE(COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id), click.campaign_creative_id))
Sort Method: quicksort Memory: 25kB
-> Merge Full Join (cost=243.22..335.01 rows=1570 width=72) (actual time=0.015..0.018 rows=4 loops=1)
Merge Cond: ((click.campaign_id = (COALESCE(imp.campaign_id, vimp.campaign_id))) AND (click.campaign_creative_id = (COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id))))
-> Index Scan using pk_click1 on click (cost=0.15..71.70 rows=1570 width=24) (actual time=0.001..0.001 rows=1 loops=1)
-> Sort (cost=243.06..246.99 rows=1570 width=48) (actual time=0.012..0.012 rows=3 loops=1)
Sort Key: (COALESCE(imp.campaign_id, vimp.campaign_id)), (COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id))
Sort Method: quicksort Memory: 25kB
-> Merge Full Join (cost=0.30..159.72 rows=1570 width=48) (actual time=0.005..0.008 rows=3 loops=1)
Merge Cond: ((imp.campaign_id = vimp.campaign_id) AND (imp.campaign_creative_id = vimp.campaign_creative_id))
-> Index Scan using pk_imp1 on imp (cost=0.15..71.70 rows=1570 width=24) (actual time=0.002..0.003 rows=3 loops=1)
-> Index Scan using pk_vimp1 on vimp (cost=0.15..71.70 rows=1570 width=24) (actual time=0.001..0.002 rows=1 loops=1)
Planning time: 0.194 ms
Execution time: 0.084 ms

JOINにONを使って、そこで絞り込む

https://www.postgresql.jp/document/9.5/html/queries-table-expressions.html

この理由はON句の中の制約は結合の前に処理され、一方WHERE句の中の制約は結合の後に処理されることによります。 これは内部結合には影響がありませんが、外部結合には大きな影響があります。

JOINは結合前にフィルタリングされる。

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
ON (imp.campaign_id = vimp.campaign_id AND imp.campaign_creative_id = vimp.campaign_creative_id)
FULL OUTER JOIN click
ON (imp.campaign_id = click.campaign_id AND imp.campaign_creative_id = click.campaign_creative_id)
FULL OUTER JOIN conv
ON (imp.campaign_id = conv.campaign_id AND imp.campaign_creative_id = conv.campaign_creative_id)
;

あれ..この時点で結果が違うぞ..

 campaign_id | campaign_creative_id | imp_count | vimp_count | click_count | conv_count
-------------+----------------------+-----------+------------+-------------+------------
100 | 1000 | 111 | 30 | 0 | 0
100 | 1001 | 221 | 0 | 0 | 0
101 | 1000 | 333 | 0 | 0 | 0
102 | 1004 | 0 | 0 | 5 | 0
102 | 1004 | 0 | 0 | 0 | 1
103 | 1005 | 0 | 0 | 0 | 2

こうでした。

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
ON (imp.campaign_id = vimp.campaign_id AND imp.campaign_creative_id = vimp.campaign_creative_id)
FULL OUTER JOIN click
ON (vimp.campaign_id = click.campaign_id AND vimp.campaign_creative_id = click.campaign_creative_id)
FULL OUTER JOIN conv
ON (click.campaign_id = conv.campaign_id AND click.campaign_creative_id = conv.campaign_creative_id)
;
 campaign_id | campaign_creative_id | imp_count | vimp_count | click_count | conv_count
-------------+----------------------+-----------+------------+-------------+------------
102 | 1004 | 0 | 0 | 5 | 1
100 | 1000 | 111 | 30 | 0 | 0
100 | 1001 | 221 | 0 | 0 | 0
101 | 1000 | 333 | 0 | 0 | 0
103 | 1005 | 0 | 0 | 0 | 2

これを元にフィルタリング。

SELECT
COALESCE(imp.campaign_id, vimp.campaign_id, click.campaign_id, conv.campaign_id) as campaign_id,
COALESCE(imp.campaign_creative_id, vimp.campaign_creative_id, click.campaign_creative_id, conv.campaign_creative_id) as campaign_creative_id,
COALESCE(imp.imp_count, 0) as imp_count,
COALESCE(vimp.vimp_count, 0) as vimp_count,
COALESCE(click.click_count, 0) as click_count,
COALESCE(conv.conv_count, 0) as conv_count
FROM
imp
FULL OUTER JOIN vimp
ON (imp.campaign_id = 102 AND vimp.campaign_id = 102 AND imp.campaign_id = vimp.campaign_id AND imp.campaign_creative_id = vimp.campaign_creative_id)
FULL OUTER JOIN click
ON (vimp.campaign_id = 102 AND click.campaign_id = 102 AND vimp.campaign_id = click.campaign_id AND vimp.campaign_creative_id = click.campaign_creative_id)
FULL OUTER JOIN conv
ON (click.campaign_id = 102 AND conv.campaign_id = 102 AND click.campaign_id = conv.campaign_id AND click.campaign_creative_id = conv.campaign_creative_id)
;

期待する結果にならないな.. FULL OUTER JOINは難しい。
そもそも何をしたいかと言うと、WHERE句に書くとJOINした後のフィルタリングなので、JOINする前に必要な情報だけに削ってからJOINしたい。
という事は、campaign_id=102で絞った各テーブルを結合する事になる。

WITH構文

WITH 
fimp AS (SELECT * FROM imp WHERE campaign_id = 102),
fvimp AS (SELECT * FROM vimp WHERE campaign_id = 102),
fclick AS (SELECT * FROM click WHERE campaign_id = 102),
fconv AS (SELECT * FROM conv WHERE campaign_id = 102)
SELECT
COALESCE(fimp.campaign_id, fvimp.campaign_id, fclick.campaign_id, fconv.campaign_id) as campaign_id,
COALESCE(fimp.campaign_creative_id, fvimp.campaign_creative_id, fclick.campaign_creative_id, fconv.campaign_creative_id) as campaign_creative_id,
COALESCE(fimp.imp_count, 0) as imp_count,
COALESCE(fvimp.vimp_count, 0) as vimp_count,
COALESCE(fclick.click_count, 0) as click_count,
COALESCE(fconv.conv_count, 0) as conv_count
FROM
fimp
FULL OUTER JOIN fvimp
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN fclick
USING (campaign_id, campaign_creative_id)
FULL OUTER JOIN fconv
USING (campaign_id, campaign_creative_id)
;
Hash Full Join  (cost=58.33..58.70 rows=8 width=96) (actual time=0.071..0.075 rows=1 loops=1)
Hash Cond: ((COALESCE(COALESCE(fimp.campaign_id, fvimp.campaign_id), fclick.campaign_id) = fconv.campaign_id) AND (COALESCE(COALESCE(fimp.campaign_creative_id, fvimp.campaign_creative_id), fclick.campaign_creative_id) = fconv.campaign_creative_id))
CTE fimp
-> Bitmap Heap Scan on imp (cost=4.21..14.37 rows=8 width=24) (actual time=0.004..0.004 rows=0 loops=1)
Recheck Cond: (campaign_id = 102)
-> Bitmap Index Scan on pk_imp1 (cost=0.00..4.21 rows=8 width=0) (actual time=0.003..0.003 rows=0 loops=1)
Index Cond: (campaign_id = 102)
CTE fvimp
-> Bitmap Heap Scan on vimp (cost=4.21..14.37 rows=8 width=24) (actual time=0.014..0.014 rows=0 loops=1)
Recheck Cond: (campaign_id = 102)
-> Bitmap Index Scan on pk_vimp1 (cost=0.00..4.21 rows=8 width=0) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: (campaign_id = 102)
CTE fclick
-> Bitmap Heap Scan on click (cost=4.21..14.37 rows=8 width=24) (actual time=0.003..0.003 rows=1 loops=1)
Recheck Cond: (campaign_id = 102)
Heap Blocks: exact=1
-> Bitmap Index Scan on pk_click1 (cost=0.00..4.21 rows=8 width=0) (actual time=0.002..0.002 rows=1 loops=1)
Index Cond: (campaign_id = 102)
CTE fconv
-> Bitmap Heap Scan on conv (cost=4.21..14.37 rows=8 width=24) (actual time=0.010..0.010 rows=1 loops=1)
Recheck Cond: (campaign_id = 102)
Heap Blocks: exact=1
-> Bitmap Index Scan on pk_conv1 (cost=0.00..4.21 rows=8 width=0) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (campaign_id = 102)
-> Hash Full Join (cost=0.56..0.86 rows=8 width=72) (actual time=0.042..0.042 rows=1 loops=1)
Hash Cond: ((COALESCE(fimp.campaign_id, fvimp.campaign_id) = fclick.campaign_id) AND (COALESCE(fimp.campaign_creative_id, fvimp.campaign_creative_id) = fclick.campaign_creative_id))
-> Hash Full Join (cost=0.28..0.51 rows=8 width=48) (actual time=0.028..0.028 rows=0 loops=1)
Hash Cond: ((fimp.campaign_id = fvimp.campaign_id) AND (fimp.campaign_creative_id = fvimp.campaign_creative_id))
-> CTE Scan on fimp (cost=0.00..0.16 rows=8 width=24) (actual time=0.004..0.004 rows=0 loops=1)
-> Hash (cost=0.16..0.16 rows=8 width=24) (actual time=0.014..0.014 rows=0 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 8kB
-> CTE Scan on fvimp (cost=0.00..0.16 rows=8 width=24) (actual time=0.014..0.014 rows=0 loops=1)
-> Hash (cost=0.16..0.16 rows=8 width=24) (actual time=0.005..0.005 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on fclick (cost=0.00..0.16 rows=8 width=24) (actual time=0.003..0.003 rows=1 loops=1)
-> Hash (cost=0.16..0.16 rows=8 width=24) (actual time=0.017..0.017 rows=1 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> CTE Scan on fconv (cost=0.00..0.16 rows=8 width=24) (actual time=0.014..0.015 rows=1 loops=1)
Planning time: 0.513 ms
Execution time: 0.347 ms

あれ..随分遅くなった..
件数が少ないからかしら。

データを多く投入してみる

各テーブルに10万件

WITH RECURSIVE tmp(c1,c2,c3) AS (
SELECT 1001, 10000,floor(random()*10000000)
UNION ALL
SELECT c1+1, c2+2,floor(random()*10000000)
FROM tmp WHERE c1 < 1000 + 100000
)
INSERT INTO imp SELECT * FROM tmp;

WITH RECURSIVE tmp(c1,c2,c3) AS (
SELECT 1001, 10000,floor(random()*10000000)
UNION ALL
SELECT c1+1, c2+2,floor(random()*10000000)
FROM tmp WHERE c1 < 1000 + 100000
)
INSERT INTO vimp SELECT * FROM tmp;

WITH RECURSIVE tmp(c1,c2,c3) AS (
SELECT 1001, 10000,floor(random()*10000000)
UNION ALL
SELECT c1+1, c2+2,floor(random()*10000000)
FROM tmp WHERE c1 < 1000 + 100000
)
INSERT INTO click SELECT * FROM tmp;

WITH RECURSIVE tmp(c1,c2,c3) AS (
SELECT 1001, 10000,floor(random()*10000000)
UNION ALL
SELECT c1+1, c2+2,floor(random()*10000000)
FROM tmp WHERE c1 < 1000 + 100000
)
INSERT INTO conv SELECT * FROM tmp;

なんだかんだでWITH構文(CTE)使う方法が一番良いっぽい。
最終的には、WITHで絞り込んだ情報をJOINすると、240msくらいから0.1msくらいまで高速にできました。

とはいえ、やりたいことはこれではなく、この続きでまた躓くことになるのです。
(PostgresではなくPipelineDBの不具合によって..)

参照