Distributed Database Topologies

eXtremeDB extension modules provide the tools to implement a variety of distributed data solutions. A distributed database system allows applications to access data from local and remote databases. General goals of a distributed database management design are to:

The terms "distributed database" and "distributed processing" are closely related, yet have distinct meanings. A distributed database is a set of databases in a distributed system that appear to applications as a single database. A distributed processing solution distributes its tasks among different computers in a network. In the context of database systems and applications, it means that database tasks are distributed amongst two or more nodes in a cluster that cooperate to maintain a single logical database instance.

Similarly, the terms "distributed database system" and "database replication" are related, yet distinct. In a “pure” (i.e., not replicated) distributed database, the system manages a single copy of all database objects, and a given database object exists in only one database instance. Typically, distributed database applications use distributed transactions to access both local and remote data and modify the global database in real-time. The term replication refers to the operation of copying and maintaining database objects in multiple database instances belonging to a distributed system. While replication relies on distributed database technology, database replication offers benefits that are not possible within a pure distributed database environment.

Replication is used either to improve local database performance (eXtremeDB Cluster) or improve the availability of the database (eXtremeDB High Availability). For example, with eXtremeDB Cluster, each process on a node of a cluster works with a local database to minimize network traffic and achieve maximum performance. With eXtremeDB High Availability, a standby system can continue to function if the primary server experiences a failure.

eXtremeDB Distributed Database solutions

The eXtremeDB extension modules offer three distributed database solution architectures:

High Availability

The purpose of eXtremeDB High Availability is to preserve the availability of a mission critical system. To accomplish this, a master database instance is replicated (distributed) to one or more standby instances, usually on separate physical devices. In the eXtremeDB High Availability usage, there are also master and standby instances of the processes that use the database. In the event the device that hosts the master database fails, the main replica task initiates the failover process (becomes the master) and causes all other replica tasks to assume normal processing.

eXtremeDB High Availability provides a “master-slave” architecture where the master application has READ_WRITE capability and replica applications have READ_ONLY capability. Accordingly, only the master database instance can be modified, but all transactions are replicated to the slave (read-only) copies of the database. Various communication channels are available, and replication can be synchronous or asynchronous. eXtremeDB High Availability provides extremely fast failover where a replica application takes over as master in the event of an unexpected failure of the node hosting the original master database instance. (Please refer to the eXtremeDB High Availability User's Guide for further details.)

Cluster

The purpose of eXtremeDB Cluster is to provide scalability for a group of devices that work cooperatively on a single logical database. Unlike High Availability, each node in a cluster is a peer to every other node, i.e. there is no master-standby relationship. eXtremeDB Cluster has in common with eXtremeDB High Availability the fact that the physical database is replicated (distributed) to each node in the cluster, but still represents a single logical database. Because each node has its own replicated copy of the database, read requests (queries) execute very fast because there is no network communication involved. Insert, update and delete operations, however, must be replicated to each node in the cluster, which causes these operations to be slower compared to operations against a single (not distributed) local database. However, in the aggregate, the insert, update, and delete operations on all nodes in the cluster taken together can exceed the performance of any single node. For example, a node in a cluster may only perform at 40% of the speed of a stand-alone node, but three nodes operating at 40% of the theoretical maximum will outperform the single node. This, combined with the fact that queries are always local, creates a platform for scaling processing with the addition of cost-effective commodity hardware. Scalability is limited by replication, however, as replicating to 2, 3, 4, … nodes in a cluster introduces incremental latency to READ_WRITE transactions.

eXtremeDB Cluster provides a “node-to-node” replication architecture where all nodes in a cluster are peers, e.g. each node can perform READ_WRITE and READ_ONLY operations; READ_ONLY transactions are always local (no network access, so very fast) while READ_WRITE transactions are, by default, distributed by the eXtremeDB Cluster runtime to all nodes in the cluster. The Cluster implementation automatically controls the availability of nodes using configurable timeouts and keep-alive messages. There are a number of ways to improve scalability with eXtremeDB Cluster, such as using “local tables” or a scatter/gather API in lieu of automatic (default) replication. (Please refer to the eXtremeDB Cluster User's Guide for further details.)

Sharding (aka horizontal partitioning) and distributed query processing.

Whereas eXtremeDB High Availability and eXtremeDB Cluster provide replication-based distributed processing solutions, data distribution is supported through eXtremeSQL by the DistributedSql Engine.

Distributed databases are often implemented through “Sharding”. The concept of database sharding has been gaining popularity over the past several years due to the enormous growth in transaction volume and size of business application databases for services such as:

  • Online service providers
  • Software as a Service (SaaS) companies
  • Social networking Web sites
  • Capital Market applications

Sharding can be simply defined as “shared-nothing” horizontal partitioning of a database into a number of smaller database instances that collectively comprise one logical database. What drives the need for sharding is the fact that as the size and transaction volume of the database incurs linear growth, response times tend to grow logarithmically. Distributed queries allow far faster processing due to performing parallel execution of on each shard.

It is well known that database performance drops in concert with an increase in database size. This is due in large part to the increasing size (depth) of index structures such as b-trees. When a database exists on a spinning disk, then mechanical artifacts exacerbate the problem (greater “rotational latency” waiting for the platter to spin or the disk head to slew to the proper track as the database occupies more of the disk). Partitioning a logical database into, e.g., five physical databases means that each physical database is only 20% the size of the same logical database if it existed as a single physical database. Consequently, index structures are more shallow and the effects of spinning disk geometry (if any) are mitigated. Furthermore, instead of a single database server providing query execution for a single physical database, we can allow for (e.g., again) five server processes to provide the query execution. This is called distributed query processing, and is transparent to a database client application. The client application merely opens the logical database and the database configuration (described in a JSON document) determines the physical makeup of the database (i.e. whether logical and physical mean the same thing, or whether the logical database consists of 2, 5 or 100 physical partitions, etc). Once connected, the client application begins/commits/aborts its transactions and queries in the normal way. If the database is physically partitioned, the eXtremeSQL query engine takes care of distributing the query on behalf of the client application and gathering (appending) the result sets from each partition to present a single (logical) view to the client application.

Sharding and eXtremeDB High Availability can be combined. In this configuration, a single logical database is horizontally partitioned into two or more physical database instances. Each shard (physical database instance) then has a master and replica server process, and the replica can be promoted to master in the event of an unexpected termination of the original master. In this way, we can preserve the availability of the single logical database that would otherwise not be possible if any of the servers processing the shards were to fail. (Please refer to the eXtremeSQL User's Guide for further details.)