HADR Admin Guide
WANdisco

Download

Call me

Whitepaper

 
 

1. Introduction

WANdisco HADR provides high availability (HA) and disaster recovery (DR) for a software configuration management (SCM)/source code repository like CVS or Subversion. It allows an SCM user to transparently failover to the next available replica in the event the designated primary SCM server fails. This is achieved using the WANdisco Failover Agent. The SCM user connects to the WANdisco Failover Agent on the standard SCM port (configurable) like 2401 for CVS, 3690 for Subversion, 80 for Subversion-HTTP. The failover-agent in turn connects to one of the available WANdisco Replicators. The WANdisco HADR guarantees RPO (Recovery Point Objective) equal to 0 i.e. zero data loss even if the failure happens in the middle of a transaction.

In this administration guide, you will learn how to easily setup WANdisco HADR as part of a normal WANdisco Replicator install.

1.1. Definitions

SCM Repository
Software Configuration Management repository like CVS or Subversion.
SCM Server
A network server that provides remote access to an SCM Repository
Replica
A repository that is an exact equivalent/copy of another repository.
Replicator2
It is the intermediary that acts as an application proxy/gateway between the Failover Agent and a given SCM server. Each Replica has an associated Replicator. It coordinates with other peer replicators to ensure that all replicas of the SCM repositories stay in sync with each other.
Failover Agent
It is the intermediary that acts as an application proxy/gateway between the SCM client and the Replicators. The Failover Agent keeps track of which Replicas are available, and proxies the SCM client's request to one of them.
Replication Group
A collection of replicators that work together to keep replicas of a SCM Repository in sync.
[replicator] directory
The base directory under which WANdisco HADR is installed.
GUID
Globally Unique Identifier. WANdisco HADR and the underlying Distributed Agreement Engine use 16 byte DCE UUIDs.

1.2. Pre-requisites

This guide is intended for an SCM administrator or a user who is reasonably comfortable with:

If you don't meet the above pre-requisites you may want to contact your SCM administrator or request WANdisco to do a professional install for you.

2. Understanding the deployment architecture

The diagram below illustrates a typical deployment architecture for a Subversion based backend. Similar deployment architecture applies to CVS or Subversion-HTTP. As you can see in the diagram below, each replica uses the WANdisco replication technology to ensure the primary and backups stay in sync with each other. WANdisco replication technology supports active-active replication. For the WANdisco HADR, only one replica node is allowed to be active at any given time by the license manager.

As show in the above diagram, the WANdisco HADR exchanges heartbeats with the replicas over a configurable control port (defaults to 6444). The heartbeats are used to test the liveness of a replica and mark a node as the active primary, it is not used for replication. The WANdisco Replicators communicate with each other on the DConeNet control port (defaults to 6444) for replication. The control ports are multi-protocol, in other words they can speak Http, DConeNet, DFTP etc protocols. This reduces the need for the administrator to open manage multiple ports for various protocols used by the WANdisco HADR or WANdisco Replicator.

Here is an explanation of various TCP ports used in the above deployment:

Port 3690 (WANdisco Failover Agent)
Used by Subversion clients for normal Subversion request processing
Port 6444 (WANdisco Failover Agent)
Multi-protocol port, used by administrator for Web management or for exchanging heartbeats with WANdisco Replicator
Port 4690 (WANdisco Replicator)
The WANdisco Failover Agent forwards Subversion client requests to WANdisco Replicator on this port
Port 6444 (WANdisco Replicator)
Multi-protocol port, used by WANdisco Replicators for replicating data as well as web administration
Port 5690 (Subversion)
Subversion server listens on this port for normal Subversion client requests. The Subversion server is configured to allow only WANdisco Replicator to connect to this port. This ensure the WANdisco Replicators are not bypassed accidentally causing the Subversion repository to evolve independently

3. Key concepts

The WANdisco Failover Agent uses a heartbeat mechanism to detect if a replicator node has died. After a configurable heartbeat interval (default is 1 second), the WANdisco Failover Agent sends a heartbeat to each replicator in the replication group. This is transmitted over a DConeNet connection. The replicator in turn sends a, "I am alive", reply back to WANdisco Failover Agent. If the WANdisco Failover Agent does not receive any reply to a configurable number of heartbeats, it marks the replicator node as dead. The actual failover happens lazily when a request is received from a SCM client. This reduces the false alarms when a WANdisco Replicator node is re-started.

The WANdisco Failover Agent simply relays data between the SCM clients and the current active primary. The current active primary is elected based on a priority assigned to each replicator. The replicator with a priority equal to 1 is also knows as the designated primary. If the primary replicator is unavailable, the replicator with the next highest priority is elected as the current active primary.

The WANdisco HADR guarantees zero data loss when a site dies. This is achieved by using :

3.1. Two replicator based failover

The WANdisco HADR can support 2 or more replicators in the replication group. If there are only 2 replicators in the group, special consideration applies with respect to the failover mechanism:

If there are only 2 replicators in the group, some failure scenarios (documented below) require administrative action. The Web administration console will have an alert for the administrator. Email alerts can also be configured.

3.1.1. Failure of Primary

As noted above, once failover to the backup happens, the backup can not be excluded from the replication group automatically if the backup dies, unless an administrative action is taken. The required administrative action involves the following steps:

  1. Stop new SCM client connections to Failover Agent
  2. Ensure both Replicator nodes are up
  3. Wait until all submitted transactions are executed at both nodes
  4. Reset the flag using the WANdisco HADR's Web console
  5. Re-enable new SCM client connections.

Note: The above applies to only if two WANdisco Replicators are configured with the WANdisco Failover Agent.

3.1.2. Failure of Backup

As noted above, when the backup fails, the WANdisco Failover Agent will run with just the primary replicator by automatically excluding the backup. After the backup has been excluded, an administrative action is required to re-include the backup in the group. The required administrative action involves the following steps:

  1. Stop new SCM client connections to the Failover Agent
  2. When there are no remaining pending transactions at the Primary, run reset to clean-up the system database at Primary and Backup
  3. Rsync FROM the Primary TO the Backup
  4. Restart the Primary and Backup
  5. Reset the flag
  6. Enable new SCM client connections.
Note: The above applies to only if two WANdisco Replicators are configured with the WANdisco Failover Agent.

4. Installation

4.1. Requirements

Before running the WANdisco HADR, please ensure:

4.2. Installing the bits

Important: The WANdisco HADR is installed as part of a WANdisco Replicator install. You will want to consult the WANdisco Replicator install guide for further information on installing the WANdisco Replicator. This guide only highlights the steps relevant to installing and configuring the WANdisco Failover Agent.

Untar or unzip (using WinZip for example on Windows) the WANdisco Replicator package (tar.gz) into the intended subdirectory. You should see the following directory structure:

  $ cd [replicator]
  $ ls
  config  lib  logs bin docs systemdb
bin
Contains scripts like failoveragent , shutdown
config
Contains the [replicator]/config/prefs.xml file used to configure the WANdisco HADR.
lib
Contains the jar files and DLLs that are required to run the product.
docs
Contains the administration guide in various formats: PDF, Html and UNIX man page.
logs
Contains the pid file, log files and other temporary files. WANdisco HADR's log file is named FailoverAgent-prefs.log.0.
systemdb
Contains the system database with its transaction journal. Warning: Deleting or modifying files from systemdb will likely corrupt your installation.

5. Setup

If the installation requirements as specified in the previous section have been met, the express setup should take 20 minutes or less to get a basic WANdisco HADR environment configured.

The express setup option can be used to quickly create the prefs*.xml configuration files used by WANdisco HADR. This is accomplished by running the bundled program, [replicator]/bin/setup. The text console based UI will guide you through basic configuration options.

At the end of the setup program:

Simply copy the [replicator]/config/prefs-(hostname).xml to the [replicator]/config/prefs.xml file on the corresponding replicator host. The prefs-failoveragent.xml file needs to be copied to the [replicator]/config/prefs.xml file on the corresponding HADR/WANdisco Failover Agent host.

It is recommended that you install the WANdisco Failover Agent on a separate machine from the WANdisco Replicators. This will ensure WANdisco Failover Agent is available even when a WANdisco Replicator machine fails.

We will now walk you step by step, through the setup screens for WANdisco Failover Agent. For rest of the installation, please follow the WANdisco Replicator administration guide.

The setup screen below presume a Subversion deployment but they are applicable (with different default ports) to CVS or Subversion-HTTP deployments as well.

  1. Startup the setup program:
      $ [replicator]/bin/setup
    
    Please go through the initial WANdisco Replicator setup screens as documented in the WANdisco Replicator administration guide

  2. Select how many replicas (including the primary) do you need

    We recommend 3 replicas (1 primary plus 2 backups) as that ensures maximum availability under rolling failure scenarios (multiple node or communication link/network failures).

    You need a minimum of 2 replicas with the WANdisco HADR.

      Now you will specify the number of replicators/replicas that
      are being setup. Each replicator will act as a proxy for the
      local repository replica. After specifying the number of replicas,
      you will be prompted for the network settings for each replica.
      
      How many replicas do you want? [2] : 2
    

  3. Select the option to install the WANdisco Failover Agent.
      
      If you have a license for HADR product, you can choose to configure
      the failoveragent. The failoveragent will then act as a proxy for the
      SCM clients. If the failoveragent detects the primary replicator
      node is down, it can automatically failover svn clients to one
      of the configured backups. In this interview you will define the priority
      for each replicator. A replicator node with priority 1 will act as
      the primary. When failing over to a backup, an alive node with the
      smallest priority number will be chosen. So for example a replicator
      node with priority 2 will be chosen over a node with priority 3 if
      both are alive.
      
      
      Is Failover Agent needed? Y/N [N] :
      
    
  4. Specify the Ethernet MAC address of the machine where WANdisco HADR would run.
      
              Setting up Failover Agent ....
              _________________________________
      
      Now you will specify the Ethernet MAC address of the host on which
      Failover-Agent would be running. It is required that you specify a unique
      MAC address for each host on which Failover-Agent would be running.
      The MAC address on UNIX can be obtained via "ifconfig" command
      and on Windows via "ipconfig /all" command. The MAC Address looks like
      this - 00-02-A5-C1-7A-2F (Windows) or 00:02:A5:C1:7A:2F (UNIX). If you
      don't have all the MAC addresses handy, now would be a good time to
      get them before proceeding further.
      
      Enter the MAC Address : 00:12:E5:C1:7A:2A
    
  5. Specify the WANdisco Failover Agent host and port. This is the host and port that would be accessed by the SCM (for instance CVS or Subversion) clients.
      
              Setting up Failover Agent ....
              _________________________________
      
      Now you will specify the host:port used by Subversion clients to
      connect with the Failover-Agent. Setting the port to 3690,
      would be the most transparent option from the Subversion client
      perspective. Note you can NOT specify 0.0.0.0 or localhost
      as the host on which Failover-Agent would be running. The
      hostname needs to be the DNS hostname or the valid IP address to
      which remote Subversion clients as well as remote Subversion Failover-Agents can connect.
      
      For example, let us say on a subnet  192.168.1 in Tokyo, the LAN address of
      Failover-Agent machine  is 192.168.1.29 and the external WAN address is
      203.23.12.129 (DNS hostname is tokyo.svnrus.org). The Failover-Agent address
      should be specified as 203.23.12.129 or tokyo.svnrus.org and NOT 192.168.1.29.
      
      Enter the hostname or IP address of the Failover-Agent : tao
      
      Enter the TCP port for the Failover-Agent [3690] :
      
    
  6. Specify the port used by the WANdisco HADR to communicate with other Replicators, as well as the Web Administration port. You can also specify a nice name for the WANdisco Failover Agent node. This will be used in the Web console to denote the WANdisco Failover Agent node instead of the default host:port name.

              Setting up Failover Agent ....
              _________________________________
      
      
      Now you will specify the DConeNet port used by the Failover-Agent to
      communicate with other Replicators. This is not visible to Subversion
      clients but used for actual data transfer between the Replicators
      and/or Failover-Agents.
      
      Enter the TCP port for DConeNet [6444] : 6444
      Enter a nice name for the node, for e.g. "Tokyo Site" [tao:6444] :
    
  7. For each replicator you install, choose an appropriate failover priority. The failover priority of 1 is used for the primary WANdisco Replicator. When the primary WANdisco Replicator node fails, an alive node with smallest priority number is selected as the current primary. Please choose the priority based on the preferred failover order for WANdisco Replicator nodes.
      
              Setting up replicator instance .... #1
              ______________________________________
      
      Since you have elected to configure a Failover Agent, you will need to
      specify the priority for each replicator. The priority order determines
      which replicator instance is picked when failing over. Choose priority
      of 1 for the Primary replicator. Smaller numbers indicate higher priority.
      
      
      Enter a failover priority [1-2] [1] :
    

  8. Please complete the rest of the WANdisco Replicator installation. You will want to follow the WANdisco Replicator administration guide here

  9. Please copy the generated files to WANdisco HADR and WANdisco Replicator's config directory on each host and rename them to prefs.xml. Now you are ready to run the WANdisco HADR.

5.1. Post-install

5.1.1. Tune the heartbeat frequency

The WANdisco Failover Agent by default uses a heartbeat frequency of 1 heartbeat every second. This is appropriate for a LAN deployment. If you are deploying the WANdisco HADR over a WAN, then you may want to change the heartbeat frequency based on WAN latencies. The heartbeat interval should be 2-3 times the expected WAN latency between WANdisco Failover Agent and the replicator site. This will ensure missing heartbeats can be distinguished from a slow network link.

You can also tune the number of missing heartbeats that dictates when the WANdisco Failover Agent marks a replicator node as dead and triggers failover. Default is 4.

These values can be dynamically tuned from the WANdisco Failover Agent web console:

5.1.2. Setup startup commands

The WANdisco Failover Agent is capable of starting the WANdisco Replicator nodes from the web-console directly, provided you enter the startup commands via the web-console. The WANdisco Failover Agent can use ssh to launch WANdisco Replicators on remote machines also as depicted in the following screen-shot:

5.1.3. Setup an email for alerts

The WANdisco Failover Agent is capable of generating email alerts provided an email address is specified via the web console. The email alerts are generated whenever the WANdisco Failover Agent detects an event related to failover :

These values can be dynamically set from the WANdisco Failover Agent web console:

5.2. Silent install option

Express setup tool supports a -silent and -record option to allow an admin to perform a silent install without being prompted for input on the console.

An admin could start the setup program in the record mode and then latter use the recorded answers file to replay the answers and perform a silent install. The admin could modify the recorded answers in a text editor and then use -silent to create new configuration files. For example

  $ ./svn-replicator/bin/setup  -record my-answers
  $ vi my-answers
  $ ./svn-replicator/bin/setup  -silent my-answers
  $ ./svn-replicator/bin/setup  -silent old-ans -record new-ans

The answers are recorded continuously, so if you restart setup you can also use the recorded file to pick up from where you left off, without having to re-enter the answers.

For more information look at the usage of the setup command:

  $ ./replicator/bin/setup -h
  
  setup [-silent recorded-setup-file]  [-record file-to-record-to]
  
  -silent recorded-setup-file :
                  Silent install will use the supplied "recorded-setup-file" to
                  automatically answer the setup interview questions. If all the
                  answers are not supplied, it will prompt on the console.
  
  -record file-to-record-to
                  Will record all the valid interview answers to  the
                  "file-to-record-to". Can latter be used for silent install.
  
                  Both options can also be used at the same time. For
                  example to continue an install from where you last
                  left off you could do:
  
                  setup -silent prev-silent-file -record new-silent-file
  

6. Running

That's it, now you are ready to run the WANdisco HADR.

Note: Please startup all the WANdisco Replicators first before starting the WANdisco Failover Agent. See the WANdisco Replicator administration guide on how to launch the WANdisco Replicator. The WANdisco Replicator must be started in the watchdog mode using the -wdog option.

Using the startup script provided to run WANdisco HADR from the command line:

  $ [replicator]/bin/failoveragent
  $ tail -f [replicator]/logs/FailoverAgent-prefs.log.0
  ....
  INFO:  [main] Failover Agent listener is now turned ON at port : ....

When you see the last line, you know WANdisco HADR has started successfully. The WANdisco HADR will also startup the remote replicators if the SSH based startup commands are provided. When you start it for the first time, the SSH startup commands do not exist, so the WANdisco Failover Agent will startup without any replicators running. In that situation you can either manually startup the remote replicators or use the web console to specify the SSH startup command and startup from the web console itself.

Alternatively, you can go to the web console and check the status.

To shutdown WANdisco HADR, just run

  $ [replicator]/bin/shutdown

This will trigger shutdown of the replicators also.

Caution: We recommend taking all possible precautions to avoid direct access to the SCM server, bypassing the WANdisco HADR. For example, you could setup SCM server to only allow connection from the IP address of the host on which WANdisco Failover Agent is running, you could limit shell access to the replicator and the SCM repository machine.

7. Web Administration

The WANdisco HADR has a built-in web-server that can be used for monitoring and dynamic configuration of the WANdisco Failover Agent and WANdisco Replicator. You can connect to the web-server on WANdisco Failover Agent's DConeNet control port (defaults to 6444). Here is a screen-shot of the WANdisco Failover Agent's web console:

8. License Management

Please ensure your license.key file specifies valid number of sites and allowed IP addresses on which the WANdisco HADR is allowed to run. If you have an unlimited license, you do not have any restrictions on number of sites or IP addresses.

9. Replicator Management

Please follow the replicator administration guide to manage the WANdisco Replicator.