This article gives a detailed overview of the Exasol cluster architecture.
Exasol cluster nodes: hardware
Our clusters are built with commodity Intel servers without particularly expensive components. SAS hard drives and Ethernet cards are sufficient. There’s no need for an additional storage layer like a storage area network (SAN).
The cluster nodes’ hard drives are configured as RAID 1 pairs as best practice. Each cluster node holds four different areas:
- OS (50GB) contains CentOS Linux, EXAClusterOS and our database executables
- Swap (4GB)
- Data (50GB) contains log files, core dumps and BucketFS
- Storage uses the remaining hard-drive capacity for data and archive volumes
The first three can be stored on dedicated disks. They’re also configured in RAID 1 pairs, usually with a smaller size than those containing the volumes.
It’s common to have servers with only one disk type rather than using dedicated disks. They’re configured as hardware RAID 1 pairs. Software RAID 0 partitions are being added across all disks to contain OS, swap and data partitions.
Exasol 4+1 cluster: software layers
This popular multi-node cluster serves as an example to illustrate the concepts. It’s called 4+1 cluster because it has four active nodes and one reserve node. Active and reserve nodes have the same layers of software available.
Once a cluster is installed, the license server copies these layers as tarballs across the private network to the other nodes. The license server is the only node in the cluster that boots directly from disk. When a cluster starts up, it provides the required software layers to the other cluster nodes.
Exasol license essentials
There are three types of licenses available:
Database RAM license
This most commonly used model specifies the total RAM that can be assigned to databases in a cluster.
Raw data license
This specifies the maximum raw data size you can store across databases in the cluster.
Memory data license
This specifies the maximum amount of compressed data you can store across all databases.
For RAM licenses, we check the RAM assignment when the database starts. If the RAM usage exceeds the maximum RAM specified by the license, the database won’t start.
For data licenses such as raw data and memory data, we periodically check the data size. If it exceeds the value specified in the license, the database doesn’t allow any further data insertion until the usage drops below that specified value.
Customers receive their license as a separate file. To activate the license, upload the license file to the license server using EXAOperation.
Storage volumes are created with EXAOperation on specified nodes. EXAStorage offers two volume options:
Each database needs one volume for persistent data and a temporary one for temporary data.
The temporary volume is automatically created by a database process. The persistent data volume has to be established by an Exasol administrator when the database is created.
Archive volumes store backup files of our database.
Exasol 4+1 cluster: data and archive volume distribution
The hard drives of the cluster’s active nodes host data and archive volumes. They consume most of the capacity of these drives. The license server usually hosts EXAOperation.
EXAOperation is the major management graphical user interface (GUI) for the clusters, consisting of an application server and a small configuration database. Both are generally located on the license server.
You can access EXAOperation from all cluster nodes via HTTPS. Should the license server go down, EXAOperation will failover to another node. This doesn’t affect the availability of our database at all.
Shared-nothing architecture (MPP processing)
We developed Exasol as a parallel system. It’s constructed based on the shared-nothing principle. This means data is distributed across all nodes in a cluster. When responding to queries, all nodes co-operate and special parallel algorithms ensure that most data is processed locally in each individual node’s main memory.
When a query is sent to the system, it’s accepted first by the node the client is connected to. The query is then distributed to all nodes. Intelligent algorithms optimize the query, determine the best plan of action and generate the needed indexes on the fly. The system then processes the partial results based on the local datasets. This processing paradigm is known as single program multiple data (SPMD). All cluster nodes operate on an equal basis, there’s no master node. The user receives the global query result through the original connection.
The previous graphic shows a cluster with four data and one reserve node. The license server provides the OS used by the other nodes over the network.
The letters A, B, C and D symbolize the data stored in the database. They indicate that each node contains a different data set. Each of the active nodes n11 to n14 hosts database instances. They operate locally on their part of the database in a massive parallel processing (MPP) way. The instances communicate and co-ordinate over the private network.
Exasol network essentials
Each cluster node needs at least two network connections – one for the public and one for the private network.
The public network is used for client connections and 1GB Ethernet is usually sufficient.
The private network is used for the cluster interconnect of the nodes with 10GB Ethernet or more recommended. The private network can also be separated into a database network, so database instances communicate over it, and a storage network to synchronize mirrored segments over it.
Exasol redundancy essentials
Redundancy is an attribute that can be set when the EXAStorage volume is created. It specifies the number of data copies hosted on active cluster nodes. In practice, it’s either redundancy 1 or redundancy 2.
Redundancy 1 means there is no redundancy. So if a node fails, the volume with that redundancy is no longer available. Typically, it’s only seen with one-node clusters.
Redundancy 2 means that each node holds a copy of the data that is operated on by a neighboring node. The volume remains available if one node fails.
Exasol 4+1 cluster: redundancy 2
Best practice is to configure volumes with redundancy 2. Each node then holds a mirror of data operated on by a neighboring node. If, for example, node n11 modifies A, its mirror A’ on n12 is synchronized over the private network. Should an active node fail, the reserve node will step in starting an instance.