Thursday, November 12, 2009

Clustered SQL Servers - lessons learned

We have an active/passive, clustered in our production environment. We have two SQL instances on the passive node, we do this so we don’t have a perfectly good server sitting there doing nothing. Our main application used to report off our production database, but we replicate the databases to the secondary as well as to the warehouse, so the application on the primary can report off the secondary. I feel I have to point out that it was not designed by the company the source code was purchased and we continue to enhance and expand it. The other instance on the secondary instance is used as a “warm stand by” backup Instance in case another application database fails. Note this warm backup is not mirroring. So we have our main database server on primary, a reporting instance on the secondary and an additional instance that just needs to be up and ready.

Active/Active Cluster
This is not an Active/Active cluster; an Active/Active cluster is really two clusters, each with its own shared storage accessible only from the primary node. Each server is a primary on its cluster and secondary on the other cluster, assuming only two nodes in the cluster. If one server fails the other will run all instances.

Active/Passive Cluster
In Active/Passive cluster the primary can fail over to the secondary, but if the secondary server fails then you have to run around and find a replacement server or servers for what resided on it.  So whatever you install on the secondary on an Active\Passive cluster make sure it is not critical. In our case we have a plan if the secondary goes down. Of the two instances on the secondary, one is just a warm stand by it has an up to date copy of the databases to take over the application when we decide, so its only critical if the main database server goes down. The instance for reporting is for the application and we also replicate to our data warehouse server for reporting servers and for feeds for the data warehouse nightly process. We have two options here one is point the application reporting to the main database on the primary, this is not desirable for an extended period, but it is the easiest to do. The second option is to point the application reporting to our warehouse/reporting services copy of the data. This is the desirable solution, but it does take some work. We actually have two projects; one to point the application reporting to our data warehouse copy of the data and the second is rewrite the reports in SSRS. But with all projects there are always others that are higher priority and have revenue attached to it

Memory and Clusters
When the primary fails over all three instances reside on the secondary. One thing you have to be careful of in cases like this is the memory allocation, you don’t want to over allocate the memory. For example our secondary and servers each have 6 GB of RAM, you want to leave 1 GB for the OS and then divide the remaining 5GB up between the instances. In our case we gave 4GB to the main instance in the cluster, 1 GB of the reporting instance and 50 MB to the warm standby. If instead I gave each instance 3GB and set that as the minimum memory limit for each instance, so the total allocation of 9 GB of RAM when there is only 6 GB of RAM the memory will be over allocated. Um yeah that would not be good.

Applications and Clusters
We have several SSIS packages on our primary database server that encrypt data and FTP/FTPS the data to and from other sites. We use GNU pgp for the encryption and a third party FTP software to send and retrieve data. With applications, like these, that are not cluster aware, they have to be loaded on all nodes in a cluster. This became painfully obvious this past week when our primary server failed. What we had on the secondary was an outdated GNU pgp files and an evaluation copy of the FTP software that was outdated. The GNU pgp was easy to resolve, since its file based I just copied the directory over. Also GNU pgp could have been set up on a non C: drive. My best guess is that when it was developed on local machines there was only a C: and was set up that way in SSIS packages. We’re manually sending retrieving files with the FTP on another server for now until we can install a valid version of the FTP software on the secondary node.

Conclusion
As you plan out your cluster, design it with these lessons learned; For an Active/Passive Cluster keep the secondary free of other SQL instances (and other software that SQL Server does not need) and accept the fact that it will just sit there “unused” waiting for the day for it to take over and rule the cluster’s world. For either Active/Active or Active/Passive that has a secondary with SQL instances residing on it allocate memory accordingly and do not over allocate the memory. And lastly install software that is not cluster aware on all nodes, and UPDATE IT!!