Subsections
The WANdisco Failover Agent uses a heartbeat mechanism to detect if a replicator node
has died. After a configurable
heartbeat interval (default is 1 second), the WANdisco Failover Agent sends a heartbeat to each replicator in the replication group. This is transmitted over a DConeNet connection. The replicator in turn sends a, "I am alive", reply back to WANdisco Failover Agent. If the WANdisco Failover Agent does not receive any reply to a configurable number of heartbeats, it marks the replicator node as dead. The actual failover happens
lazily when a request is received from a SCM client. This reduces the false alarms when a WANdisco Replicator node is re-started.
The WANdisco Failover Agent simply relays data between the SCM clients and the current active primary. The current active primary is elected based on a priority assigned to each replicator. The replicator with a priority equal to 1 is also knows as the designated primary. If the primary replicator is unavailable, the replicator with the next highest priority is elected as the current active primary.
The WANdisco HADR guarantees zero data loss when a site dies. This is achieved by using :
- With 3 or more replicators in the group, majority quorum is used to commit a transaction. As long as a majority of replicators are alive, failover can be done without any data loss. This is the recommended configuration.
- With 2 replicators in the group, singleton quorum is used with the backup (priority 2) replicator acting as a distinguished node. This ensures data will always be available with the backup if the primary fails. This is the minimal configuration. To deal with rolling failure scenarios a 3 or more replicator deployment is recommended.
The WANdisco HADR can support 2 or more replicators in the replication group. If there are only 2 replicators in the group, special consideration applies with respect to the failover mechanism:
- The main objective of the two replicator based deployment is to deal with a single replicator node failure.
- If the designated primary (priority 1) dies, failover to the backup is triggered.
Once failover to the backup happens, the backup can not be excluded from the replication group automatically if the backup dies,
until the capability is restored via an administrative action as described below.
- If the backup was never failed to before and the backup dies, the WANdisco Failover Agent will run with just the primary replicator by automatically excluding the backup
- After the backup has been excluded as above, an administrative action is required to re-include the backup in the group.
- The two replicator based deployment can not automatically deal with the rolling failure scenario (nodes keep going up and down). For maximum availability under rolling failure scenario, please use 3 or more replicators in the group.
If there are only 2 replicators in the group, some failure scenarios (documented below) require administrative action. The Web administration console will have an alert for the administrator. Email alerts can also be configured.
As noted above, once failover to the backup happens, the backup can not be excluded from the replication group automatically if the backup dies, unless an administrative action is taken.
The required administrative action involves the following steps:
- Stop new SCM client connections to Failover Agent
- Ensure both Replicator nodes are up
- Wait until all submitted transactions are executed at both nodes
- Reset the flag using the WANdisco HADR's Web console
- Re-enable new SCM client connections.
Note: The above applies to only if two WANdisco Replicators are configured with the WANdisco Failover Agent.
As noted above, when the backup fails, the WANdisco Failover Agent will run with just the primary replicator by automatically excluding the backup. After the backup has been excluded, an administrative action is required to re-include the backup in the group.
The required administrative action involves the following steps:
- Stop new SCM client connections to the Failover Agent
- When there are no remaining pending transactions at the Primary, run reset to clean-up the system database at Primary and Backup
- Rsync FROM the Primary TO the Backup
- Restart the Primary and Backup
- Reset the flag
- Enable new SCM client connections.
Note: The above applies to only if two WANdisco Replicators are configured with the WANdisco Failover Agent.