During a failback proces of a HA cluster based on Heartbeat 3.x I faced a issue with moving resources from a node to another, let me explain why.
There are three Nodes in cluster
- Node_Alpha – Apache with VIP 10.10.10.10
- Node_Beta – Apache with VIP 10.10.10.11
- Node_Gamma – Apache with VIP 10.10.10.12
Each node should run a Web resource based on Apache. Each node can failover to any other node in the cluster.
Heartbeat is configured to not auto failback. To manage Apache resource I used ocf:heartbeat:apache RA.
Here is some kind of template I used to configure web resource and group on each node:
primitive VIP_10.10.10.X ocf:heartbeat:IPaddr \ params ip="10.10.10.X" nic="eth0" cidr_netmask="255.255.255.0" \ op monitor interval="5s" enabled="true" role="Started" timeout="20s" start-delay="1s" primitive apache_xxx ocf:heartbeat:apache \ params port="80" configfile="/etc/httpd/conf/httpd.conf" statusurl="http://10.10.10.X/server-status" \ testurl="http://10.10.10.X/test.php" testregex10="ok" \ op monitor interval="15" timeout="20" on-fail="standby" start-delay="15" OCF_CHECK_LEVEL="10" group gr_apache_xxx VIP_10.10.10.X apache_xxx \ meta target-role="Started" ordered="true" collocated="true" is-managed="true"
Note that in this configuration, in case if Apache will fail the whole node will go to standby. Lets name a group for each node:
- Node_Alpha – gr_apache_alpha
- Node_Beta – gr_apache_beta
- Node_Gamma – gr_apache_gamma
For some reason Node_Alpha went down and gr_apache_alpha failed over to Node_Beta. Now we have two web resources running on same node: gr_apache_alpha and gr_apache_beta. After a while Node_Alpha came back online and it is able now to run this resource. We are trying to move gr_apache_alpha it back:
crm resource move Node_Alpha
But for some reason gr_apache_beta failed and node went into standby. The reason of failure was the fact that both resources were using same instance of Apache. While gr_apache_alpha was moved back to Node_Alpha Apache was stopped, monitoring of gr_apache_beta showed that Apache is not running ( failed ) and it put node in standby mode. And this will happen every time I will try to move a gr_apache_* from where there are two or more such resources. There is workaround for this: make all gr_apache_* ( except for group you want to move ) unamanged, restart apache after move and make remained gr_apache_* managed. But this workaround can not give us 100% that a human will not make a mistake ( forget to unmanage a group ).
The only solution I see is to make Apache resource not a part of group but a clone:
primitive apache ocf:heartbeat:apache \ params port="80" configfile="/etc/httpd/conf/httpd.conf" statusurl="http://localhost/server-status" \ op monitor interval="15" timeout="20" on-fail="standby" start-delay="15" clone clone_apache apache \ meta globally-unique="false" clone-max="4" clone-node-max="1" target-role="Started"
Then to define an RA that is used just for monitoring a resource and nothing more. There is no such RA os I had to develop it by myself. As base script I took ocf:heartbeat:apache. You can find it in my Heartbeat resources repository: https://github.com/dotNox/heartbeat_resources . It’s name is httpmon ( as suggested in Linux-HA mailing list ) :). This RA can do almost the same as ocf:heartbeat:apache, but I added the ability to specify user/password for HTTP authentication ( in ocf:heartbeat:apache this is accomplished via an external configuration file which I consider not very practical … ). Here are meta parameters that are available at the time of writing this article:
- url – URL to check
- http_user – User for HTTP Auth ( if there is any )
- http_password – Password for HTTP Auth ( if there is any ) should be specified with http_user otherwise a default password “password” will be used
- client – Client to use ( curl, wget or other ) for curl and wget there are predefined client options that are used by default
- client_opts – Client Addition opts
- match – Output match regular expression
The primitive definition template will look like:
primitive web_xxx ocf:heartbeat:httpmon \ params utl="http://10.10.10.X/test.php" match='test string' \ op monitor interval="5s" enabled="true" role="Started" timeout="20s" start-delay="5s"
And of course group which consist of VIPs and web_xxx resources:
primitive fs_var_www_nfs_client ocf:heartbeat:fs_nfs_client \ params server="10.10.1.10" server_directory="/var/www" directory="/var/www" \ statusfile_prefix="nfs/.fs_nfs_client_ha" \ options="rw,soft,async,retrans=5,noatime,rsize=32768,wsize=32768,proto=udp" \ op monitor interval="30" timeout="60" on-fail="restart" start-delay="10" OCF_CHECK_LEVEL="20" \ meta is-managed="true" group gr_apache_xxx web_xxx VIP_10.10.10.X \ meta target-role="Started" ordered="true" collocated="true" is-managed="true"