Our analytics database uses a clustered shared-nothing architecture with many sophisticated internal mechanisms delivering outstanding performance. All without the need for much admin. But getting the distribution of rows between cluster nodes right is one of the few critical tasks left.
Let’s say there are two tables – t1 and t2:
The two tables are joined on the JoinCol column. WHERE conditions for filtering are done with the WhereCol column. Other columns aren’t shown to keep the sketches small and simple.
Say these two tables are stored on a three-node cluster. Again, for simplicity only active nodes are displayed – reserve or license nodes aren’t shown. Let’s also ignore the fact that small tables will be replicated across all active nodes.
What happens if no distribution key is specified?
Without specifying a distribution key, the table rows are distributed at random across the nodes:
Global joins in the absence of proper distribution keys
The tables are then joined:
SELECT FROM t1 JOIN t2 ON t1.JoinCol = t2.JoinCol;
This is processed internally as a global join. This means network communication is needed between the nodes on behalf of the join because some rows don’t find local join partners on the same node:
Distribution on join columns: local joins
If the two tables were distributed on their join columns with statements such as:
ALTER TABLE t1 DISTRIBUTE BY JoinCol; ALTER TABLE t2 DISTRIBUTE BY JoinCol;
the same query can be processed internally as a local join:
Every row finds a local join partner on the same node. This requires no network communication between the nodes on behalf of the join. The performance with a local join is much better than with a global join although the statement remains the same.
Why shouldn’t you distribute on WHERE columns?
It’s generally a good idea to distribute on Join columns. However, it’s a bad idea to distribute on columns used for filtering with WHERE conditions. If both tables had been distributed on WhereCol columns, you’d see this:
The distribution is worse than the initial random distribution. Not only does this cause global joins between the two tables as explained, but statements like:
<Any DQL or DML> WHERE t2.WhereCol='A';
will use only one node – the first with this WHERE condition. This effectively disables one of our analytics database’s strengths, the massive parallel processing (MPP) functionality. This distribution leads to poor performance because all other nodes in the cluster stand by idle while one node does all the work.
Focus: existing distribution with iproc()
The function iproc() helps investigate the existing distribution of rows across cluster nodes. This statement shows the distribution of the t1 table:
SELECT iproc(),COUNT(*) FROM t1 GROUP BY 1 ORDER BY 1;
The effect of distribution keys with value2proc()
Use the function value2proc() to display the effect of a – new – distribution key:
SELECT home_node,COUNT(*) FROM (SELECT value2proc(JoinCol) AS home_node FROM t1) GROUP BY 1 ORDER BY 1;
Distribution on Join columns leads to local joins that perform better than global ones – keep doing it.
Distribution on Where columns leads to global joins and disable the MPP functionality, both delivering poor performance – avoid doing it.