Ensure proper paging space is available on both the control workstation and each of the SP nodes. Also ensure switch configuration has been considered and implemented along with any other modifiable configuration parameters. In addition, SP monitoring Perspectives you desire to use should be put in place. Ensure the SP dsh, pcp, and pexec commands work. Design your database layout. You should also consider who the main DB2 instance owner will be and the access authorization this and other users will require.
|Published (Last):||28 October 2009|
|PDF File Size:||12.50 Mb|
|ePub File Size:||16.78 Mb|
|Price:||Free* [*Free Regsitration Required]|
An example to correct a paging space shortage by shutting down a database partition and forcing a transaction abort to free paging space is provided. Another common example is process death: You may want to restart a DB2 database partition, or you may want failover to occur if a process dies on a given node.
Each event in the file is made up of nine lines which are: Event name. Each event name must be unique. This is the qualifier for the event. The event name and state are the rule triggers. Resource Program Path. This is a full-path specification of the xxx. Recovery Type. This is reserved for future use. Recovery Level. Resource Variable Name. This is used for Event Manager events.
Instance Vector. The values uniquely identify the copy of the resource in the system and, by extension, the copy of the resource variable. Within Event Management, this is the relational expression between a resource variable and other elements that, when true, the Event Management subsystem generates an event to notify Cluster Manager and the appropriate application.
Rearm Predicate. Within Event Management, this is a predicate used to generate an event that alternates the status of the primary predicate. This predicate is typically the inverse of the primary predicate. It can also be used with the event predicate to establish an upper and a lower boundary for a condition of interest. Each object requires one line in the event definition even if the line is not used. And this may cause the system to hang. Any line beginning with " " is treated as a comment line and is not treated as part of the event definition.
Note: The rules file requires exactly nine lines for each event definition not counting any comment lines. When adding a user-defined event at the bottom of the rules file, it is important to remove the unnecessary empty line at the end of the file, or the node will hang.
According to the rules, the proper values are specified in the state, recovery type, and recovery level lines in the definition. There are four 4 empty lines for: resource variable, instance variable, predicate, and rearm predicate.
These events can be exploited when used within user-defined events. To make this happen, do the following: Stop the cluster. Edit the rules. Backup the file before modifying it. Add the predefined PSSP event manually.
If you need synchronizing points across all nodes in the cluster, use the barrier command in the recovery program. Restart the cluster. The rules. To accurately implement the changes, restart all the clusters. There should not be any inconsistent rules in a cluster. Cluster Manager uses all events in the rules.
The PSSP Event Management subsystem provides comprehensive event detection by monitoring various hardware and software resources. Resource states are represented by resource variables. Resource conditions are represented as expressions called predicates. Event Management receives resource variables from the Resource Monitor, which observes the state of specific system resources and transforms this state into several resource variables.
These variables are periodically passed to Event Management. When the predicate is evaluated as being true, an event is generated and sent to the Cluster Manager. Cluster Manager initiates the voting protocol and the recovery program file xxx.
The recovery program file xxx. Three types of relationship are supported: All. The specified command or program is executed only on the nodes where the event occurred.
The specified command or program is executed on all nodes where the event did not occur. With other scripts or programs, the full-path definition must be used even if these programs are located in the same directory as the HACMP event scripts. It is an integer value or an "x". If "x" is used, Cluster Manager does not care about the return code.
For all other codes, it must be equal to the expected return code. If it is not, Cluster Manager detects the event failure. The handling of this event "hangs" the process until the problem is solved through a manual intervention to recover.
Without manual intervention, the node does not hit the barrier to synchronize with the other nodes. Synchronization across all nodes is a requirement for the Cluster Manager to control all the nodes.
The word "NULL" must appear at the end of each line except the barrier line. If you specify multiple recovery commands between two barrier commands, or before the first one, the recovery commands are executed in parallel on the node itself and between the nodes.
The barrier command is used to synchronize all the commands across all the cluster nodes. When a node hits the barrier statement in the recovery program, Cluster Manager initiates the barrier protocol on this node. Since the barrier protocol is a two-phase protocol, when all nodes have met the barrier in the recovery program and "voted" to approve the protocol, then all nodes are notified that both phases have completed. The recovery program executes the recovery commands which may be shell scripts or binary commands.
The scripts will work "as is" or you can customize or change the recovery action. DB2 database partition recovery script rc. Six default events are included: one for process recovery, two for paging space, and three for NFS and automounter recovery.
DB2 instance NFS fileserver failover. This script provides for failover recovery of the server of the filesystem for a DB2 instance to a backup.
Network failover. Monitoring of failover and user-defined recovery is possible through the Event and Hardware Perspectives. The recovery scripts need to be installed on each node that will run recovery. The script files can be centrally installed from the SP control workstation or other designated SP node. The pcp and pexec commands are required for the install so ensure that you have the ability to run them. Customize the reg. Typically for mutual takeover configurations, your failure settings will be adjusted lower to one-half the size of your regular settings or less.
The three retries and the single failover settings should be adequate for almost all implementations. You should specify this setting if you wish rc. Also, edit pwq to change this to the DB2 instance owner. If this does occur, the DB2 database partition is stopped. Modify the script if additional recovery actions are required.
The reg. Additional configuration and customization in Perspectives is needed. Exiting by using a ctrl-C interrupt, or by killing the process, may re-enable failover recovery prematurely. This would result in not all database partitions being stopped. Note: rc. The old takeover node releases volume groups, logical volumes, filesystems, and IP addresses specified in resource groups to be owned by the reintegrating node.
HACMP will re-acquire volume groups, logical volumes, filesystems, and IP addresses specified in the resource group now owned by the reintegrating node. HACMP releases volume groups, logical volumes, filesystems, and IP addresses specified in resource groups now owned by the node. The node which had the failure restarts the correct DB2 database partition.
If a node has more than seventy 70 percent of paging space filled, a wall command is issued. If a node has more than ninety 90 percent of paging space filled, then DB2 database partitions on this physical node are stopped and restarted. If a NFS process stops running, then it is restarted. In a similar way, if the automount process stops running then it is restarted. This verifies the network as the SP switch network and verifies it is down.
If so, it waits a user-defined time interval. The default time interval is one hundred seconds. HACMP must be stopped and restarted.