Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-251

Bad data distribution after cluster enlargement

    XMLWordPrintable

    Details

    • Type: Explanation
    • Status: Published
    • Affects Version/s: EXASolution 4.0.0, EXASOL 6.0.0, Exasol 6.1.0, EXASolution 5.0.0
    • Fix Version/s: None
    • Component/s: EXASolution
    • Labels:
      None
    • Symptoms:
      • After a cluster enlargement, performance may not increase as expected, or even decrease.
      • Profile of affected queries shows strong data imbalance (NODE SYNC) and possibly disc access
    • Explanation:
      Hide

      This may be caused by the semantics of the cluster enlargement (REORGANIZE TABLE):

      Prerequisites:

      1. Database running on N nodes
      2. Fact tables have no distribution keys
      3. Data is inserted in a mostly sorted way (ie. daily data with daily timestamps)
      4. Data is queried using strong date filters

      Steps causing the problem

      1. Cluster enlargement to N+X nodes, including REORGANIZE of the fact tables

      Result
      Reorganize is content-agnostic and tries to move as little data as possible. In fact, on each of the N nodes the data inserted last is taken and transmitted to the X new nodes until data is balanced across all nodes.
      With chronologically sorted data, data inserted last equals latest data... this means that data will be split across the cluster, with N nodes storing 'old data' and X nodes storing 'new data'.
      In worst case, X==1 and any query asking for the latest data is performed on one node only...

      Show
      This may be caused by the semantics of the cluster enlargement (REORGANIZE TABLE): Prerequisites : Database running on N nodes Fact tables have no distribution keys Data is inserted in a mostly sorted way (ie. daily data with daily timestamps) Data is queried using strong date filters Steps causing the problem Cluster enlargement to N+X nodes, including REORGANIZE of the fact tables Result Reorganize is content-agnostic and tries to move as little data as possible. In fact, on each of the N nodes the data inserted last is taken and transmitted to the X new nodes until data is balanced across all nodes. With chronologically sorted data, data inserted last equals latest data ... this means that data will be split across the cluster, with N nodes storing 'old data' and X nodes storing 'new data'. In worst case, X==1 and any query asking for the latest data is performed on one node only...
    • Solution:
      Hide

      To avoid this, always put a distribution key on your fact tables, but not on a date column.

      Show
      To avoid this, always put a distribution key on your fact tables, but not on a date column.
    • Category 1:
      Database Design
    • Category 2:
      Database Administration - Data Organization

      Attachments

        Issue Links

          Activity

            People

            • Assignee:
              CaptainEXA Captain EXASOL
              Reporter:
              CaptainEXA Captain EXASOL
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: