Page 1

Redundancy and Failover Operation 1. The Concept of Failover Pharos provides the failover/ dual database configuration as a means of ensuring:  Resilience  A secure way of performing software or hardware upgrades. Failover is based on Oracle‟s Dataguard Technology. As described by Oracle, Dataguard technology:  Ensures high availability, data protection, and disaster recovery for enterprise data.  Provides a comprehensive set of services that create, maintain, manage and monitor one or more Standby databases to enable Production databases to survive disasters and data corruptions.  Protects data, the core asset of any enterprise, and makes it available on a 24 x 7 basis despite disasters and other outages. Further information on Dataguard, can be found on the link: http://download.oracle.com/docs/cd/B19306_01/server.102/b14239/concepts.htm

2. A Brief Description of How Failover Works Server Monitor on the Pharos Information Centre is the pharos software responsible for performing the failover. This software is installed on both Core blades. By making ssh connections between the X and Y side databases, Server Monitor is able to determine which side is Active and which one is Standby. Oracle Dataguard in turn implements this by using the Redo Log and Standby Redo log infrastructure. If Server Monitor determines that Standby side is not in sync, it will change it to “In error” to safely stop you from performing a failover and will recover itself within 5 minutes. If it remains in error for more than 10 minutes, there is a likelihood that something has failed.


Redundancy and Failover Operation 3. Types of Failover 3.1

Scheduled/ Controlled Failover

This type of failover occurs due to deliberate action. Characteristics:  Planned for a particular day/time  Could be due to scheduled software/hardware upgrades  A site could deliberately failover to run on their Standby system to verify that it is robust and in a usable state as part of disaster preparedness.

3.2

Unscheduled/ Abrupt Failover

Characteristics:  Occurs due to a disruption, for example due to a power loss, active Core blade crashes.  In this case, the Standby side takes over automatically.  This failover should last between 30 seconds to a few minutes.

3.3 During    

Points to Note any failover, whether scheduled or unscheduled All transfers are aborted if they have not been stopped already. User Interface access is seamless and is not affected. Non committed transactions will be lost At the point of failover, API calls will fail but will return an error back, but the likelihood of API calls happening just at this point of failover is minimal.

3.4

Description of States

State Name

Description

Active

This side is responsible for sending messages and performing all operations

Standby

This side is ready for operation should a problem occur on the active side

Halted

This side has been manually stopped by the user at the Server monitor

Error

Either:  An error has occurred on this side or  Server Monitor is waiting for the database copy to start or  The Copy is in progress

Table 3.1 Description of Failover States

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 1


Redundancy and Failover Operation 4. Scheduled/Controlled Failover Procedure During a controlled Failover, the following steps should be taken to minimise Data Loss. 4.1

Preparation It is also not advisable to perform a Failover when off-air recordings are about to start or stop. Before the Failover, stop operations like Ingests and Transfers. It is not advisable to start any Ingests or transfers during this time. Note down all processes that require manual failover and halt them on both sides prior to the failover. Then start them on the active side once it is up and running.

4.2

Halting the Active Server Once all the preparation is complete, the Active Database can be halted. Using a browser, Navigate to the Information Centre at http://192.168.53.11:8080/info/ or http://192.168.53.31:8080/info/

Fig 4.1: Pharos Information Centre

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 2


Redundancy and Failover Operation

  

Under X-Core or Y-Core (depending on whichever is your active side) locate and click on the Server Monitor by collapsing the (+) tab. Login using the username „pharos‟ password „pharos1‟ Server Monitor displays the current Active and Standby systems.

Fig 4.2: Server Monitor indicating the Active and Standby side.

 

  

To halt the Active side, enter the password “haltnow” and hit the submit button. The Active side now appears with a red icon „In error’. This is expected during a scheduled/controlled failover while it goes into Standby and simply indicates that the copy is in progress. After a brief period, the side previously designated as Standby will change its status to Active. Navigate to Server Monitor on the previously Active side which was halted and is now ‘In error’ When asked to start this side, enter the password „startupnow’ and hit submit. After this command is submitted, the system will take a brief period to synchronise the 2 sides and bring the previously „Halted’ side now to „Standby‟

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 3


Redundancy and Failover Operation 5. Configuring Failover The X/Y configuration is undertaken by Pharos. In a properly synchronised system, both X and Y are mirror images of each other. There is no need to prefer using either X side or Y-side as both are identical. Ideally, a customer site should be able to operate on either X or Y seamlessly. RTL should ensure that all connections between the Active and Standby chassis are in place at all times. 

For the HP blades, the system side (X or Y) is always specified in the configuration. For future expansion, other blades can be added and designated as universal (U) sided. If there are multiple PCPs, they can be aligned to a side i.e. X or Y. In RTL‟s case where there is one PCP, this will be made universal i.e. aligned to the entire system

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 4


Redundancy and Failover Operation 6. Testing Failover This can be achieved by: a) Writing some test checks to perform before the failover. A few examples are given below in Table 6.1, but this can be modified as required. b) Run the failover as explained in Section 4 c) Perform the Post Checks after the failover. d) Attempt this several times before the system goes live. ID

Date

Pre Check 1 04/03/11

Required Check

Addition al Detail

Checked (Y/N)

Checked by

Comments

Required Action

Note down all processes running on both sides Advise users of planned maintenance

Y

All processes noted

None

Y

None

Transfers halted using info centre Stop all processes noted down in (1) above

Y

All users have been advised of pending failover. All transfers halted. All processes halted using Info Centre. All ingests through Mediator completed. Manual failover complete

None

2

04/03/11

3

04/03/11

4

04/03/11

4

04/03/11

Ensure all ingests have completed

Y

5

04/03/11

Y

6

04/03/11

Perform a manual system failover from X to Y using info centre. Ensure all services start up as expected

Y

Operation continues as expected.

None

Restart all processes noted down in (1) above.

Y

All processes have been restarted using Sentinel in info centre. All API calls that previously worked on x, work on y as well

None

Post Check 7 04/03/11

Y

8

04/03/11

Ensure that all API Calls work on the now active side

Y

9

04/03/11

Ensure that all ingests through mediator are working

Y

None None

None

None

Table 6.1: Sample Specifications for Testing Failover

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 5


Redundancy and Failover Operation

7. Preparing for Secured Operation Attempt to failover between both sides several times before the system goes live. Ensure that both sides are in sync with each other, and work according to your test specification as specified in Table 6.1 above. As good maintenance practice, Pharos Support team recommends that a Scheduled Failover should be done at least once every 6 months. Before this scheduled Failover, reboot the Standby side to ensure it is all “clean.� This prevents any hardware or software inconsistencies.

Author: Pharos Communications Version: 1.1 Date: 4th March 2011

Page 6

Redundancy and failover operation  

Redundancy and failover operation

Read more
Read more
Similar to
Popular now
Just for you