eXtremeDB HA Overview

When employing synchronous replication, eXtremeDB High Availability implements a time-cognizant two-phase commit replication protocol (synchronous replication) that provides the means to instantiate one or more standby database instances and bring the standby to a state of synchronization with the main database instance. From the point that a standby database is synchronized, and as long as the connection between the main and standby instance is maintained, the two-phase commit protocol ensures that the standby database is always an exact copy of the main database instance.

The eXtremeDB High Availability runtime is a context-less library; it does not create any tasks or run any processes. Instead it provides an API for applications to implement database high availability through both synchronous and asynchronous replication. Synchronous replication is implemented via a time-cognizant two-phase commit replication protocol that causes the master application to block at the transaction commit until the transaction has been propagated to, and committed by, the replica.

A simplified description of a transaction using time-cognizant synchronous replication follows.

Alternatively, replication can be achieved via asynchronous replication. Whereas synchronous replication causes the master application to block at the transaction commit call until the transaction has also been committed by the replica(s), asynchronous replication does not block at the transaction commit, and thus update transactions process faster. Replica databases are available to the replica applications for read-only access; the consequence of asynchronous replication is that replica applications may read ‘stale’ data if transactions were committed at the master but not yet propagated to the replica.

It is possible to run the master and the replica within the same process against the same database. If replica mode is enabled (regardless of the fact that the master mode is enabled), database modifications are not allowed. For example, a “cascading replication” scenario might be implemented with the following three applications:

1. master: acts as master only; can perform database modifications

2. rplmst: acts as first level replica connected to master and also the master for the replica

3. replica: second level replica connected to rplmst

Note: After an HA system has been running, if it is necessary to shut it down for any reason, an in-memory database can be saved via the C/C++ function mco_db_save() or the C# or Java MasterConnection or ReplicaConnection class SaveSnapshot() method, and subsequently loaded via mco_db_load(), or the C#/Java Database.Open() method, by the master and each of the replicas. The databases will contain a transaction sequencer that will be examined by the HA runtimes and, if the sequencer is identical, the synchronization step will be bypassed. This significantly accelerates the restart process, since the master and replicas will load their databases in parallel, versus the inherently serial nature of the synchronization step.

Communication Channel

One of the main challenges in the design of an efficient high availability embedded database management system is to enable the HA interface to be integrated into a wide variety of embedded real-time applications. Embedded applications require considerably different qualities of service in terms of expected throughput and delay, acceptable error levels, and the ability to adjust resource requirements. Embedded systems use a great variety of media access protocols and transports. While some high-end embedded systems communicate over a VME backplane or similar architecture, there are many that use multiple physical CPUs and require a LAN-based communications bus. A variety of industry-standard and proprietary media access protocols serve as foundations for LANs.

The eXtremeDB High Availability replication protocol adopts the communication protocol used by any given embedded application. To achieve this, eXtremeDB High Availability utilizes a network (transport) level abstraction called a communication channel to support the requirements of real-time and embedded applications. The channel represents a simple end-to-end communication between a master and a replica. The channel is configured by a set of specific performance and other parameters. The required performance characteristics of a channel are specified in terms of timeouts, thus the primary attribute of the channel is its on-time reliability.

The communication channel abstraction allows eXtremeDB High Availability to be independent of the underlying media and the operating environment. This generic approach, however, requires the application to implement its communication layer. The eXtremeDB High Availability distribution includes communication channel implementations for TCP, UDP, Named Pipes and QNX Messaging.

HA performance

eXtremeDB transactions are optimized for commits. eXtremeDB keeps “before image” pages during the transaction. In the event the transaction is aborted, the “before image” pages will be returned to their original location, restoring the database to its state at the start of the transaction. In the case of a commit, the before images are simply discarded (returned to the free database memory pool), thus making commits much more efficient than aborts.

So, as an application is modifying an eXtremeDB database (adding, deleting, updating), the data is effectively committed as the transaction progresses. When the transaction is committed, the indexes are updated. If there are no errors (e.g. uniqueness violations) then the “before image” pages are returned to the free memory pool and the transaction is complete. In essence, then, a transaction commit involves updating the indexes.

In eager replication mode, an additional step is inserted in the process. When the transaction is committed (i.e. when the C/C++ function mco_trans_commit() or the C# or Java MasterConnection.CommitTransaction() method, is called by a master) the commit data (the write set) is sent to the replicas. In parallel, the master updates the indexes. When the indexes are updated and the replica has returned the result code of the replica commit, the master runtime returns from the transaction commit call.

Note that updating the indexes can cause a uniqueness constraint violation. But, since the transaction has already been forwarded to the replica(s), the commit will fail on the master and each replica. This is okay because, just as is reasonable to assume that an aborted transaction is a rare occurrence, it is reasonable to assume that uniqueness constraint violations are also a rare occurrence and the benefit obtained from updating the indexes in parallel with the replication outweighs the occasional inefficiency of transmitting a transaction that is certain to fail.

 

The duration of the synchronous transaction commit in HA operation will be T+C(d)+C(i)+R+C(d) where:

T = the time to transmit the commit data to the replica application

C(d) = the time to commit the data in the master and replica database

C(i) = the time to commit the indexes in the master and replica database

R = the time for the replica runtime to transmit the return code to the master

And the difference in transaction commit time in HA synchronous replication mode versus non-HA mode will be T+C(d)+R. In any database, the majority of time is spent updating indexes, so because the index updates execute in parallel in the master and replica, the incremental increase in transaction commit time is mostly T+R, which puts a premium on very fast communication, such as over a high-speed bus.

Note: Explicit transaction rollbacks (e.g. the C/C++ function mco_trans_rollback() or the C# or Java RollbackTransaction method) are always handled locally by the master runtime. Because the updates are never committed, there is never any commit data transmitted to the replica, so HA operation has no impact on the performance of aborted transaction.

Read-only Transactions and Load-balancing

eXtremeDB High Availability replica databases may be used for concurrent read-only access by other processes and/or threads. This makes it possible for application developers to implement a load-balancing scheme, distributing database read requests across the master and standby instances. Any MCO_READ_ONLY database transactions performed on the master will not be affected by HA operation, i.e. no write-set is sent to replicas when a read-only transaction is executed.

It is important to note that the replica commit phase will wait for all read-only transactions to complete before processing commit data from the master when using the MURSIW transaction manager (not with MVCC. See the eXtremeDB Concurrency Management topic for a complete discussion of eXtremeDB Transaction Managers). This could adversely impact the duration of the two-phase commit. In an extreme case, it could exceed the timeout specified for replication To mitigate this, the replica transaction commit has a higher priority than any read request, so the maximum wait time will be the time it takes for the longest currently running read-only transaction to complete. For this reason, it is best to keep read transactions short.

What can and cannot be done

eXtremeDB High Availability can replicate between persistent and transient databases, between different schema layouts (using the Binary Schema Evolution feature), and between different operating systems, even between a little-endian and a big-endian machines (for example x86 to PowerPC), as long as the hardware architecture on the master and the replica nodes are the same or binary compatible (for example Linux / x86 and VxWorks / x86). However eXtremeDB High Availability does not support replication between architectures with different integer sizes (x32 to x64) .