MS Cluster/MS SQL Server/SAN (HP MSA1000) troubleshooting

Executive Summary
A client of Digital Edge started experiencing delays on SAP operations accessing SAN storage. Initial conversation with software and hardware vendors did not yield any results as they both took the position of finger-pointing at one another.. Digital Edge took full responsibility for troubleshooting and correcting performance problems.

Original Configuration:
The client infrastructure implements 2 MS SQL clusters running 2 HP DL380 servers each attached to MSA1000 SAN through dual path fiber channel cables.



Troubleshooting process:
After initial information gathering we were convinced that the sluggishness occured because of MS SQL Server I/O problems accessing SAN storage. We found the following errors in MS SQL Server event log:

SQL Server has encountered 1 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [x:\USER\SAP_Reports_Data.MDF] in database [SAP_Reports] (10). The OS file handle is 0x000004E4. The offset of the latest long IO is: 0x00000004ca0000

Error: 823, Severity:24, State:2 I/O error (stale read) detected during read at offset 0x00000048c86000 in file


After enabling debugging flags in MS SQL Server and collecting and analyzing MS SQL Server crash dumps, Digital Edge was sure that the problem resided on SAN side.

On behalf of the client, Digital Edge opened a support case with HP providing full information that showed problems with HP MSA1000 I/O. HP concurred that the problem was on their end. After further research, HP found a problem in MSA1000 firmware and suggested a hot fix.

Repair process
Digital Edge received hot fix from HP and installation instructions. We scheduled down time with the client notifying world wide regions about the maintenance work.

Our engineers performed all the instructions including shutting down the cluster software, powering off and disconnecting cluster nodes, upgrading firmware one the MSA1000 and fiber switch. Both, backplane and the fiber switch showed correct version of its software. We completed this task in less then one hour and still had an hour and a half ahead of us.

1st Challenge
We began bringing the cluster nodes up and received a blue screen on all of them. We repeated the process and received the same failure. We contacted HP and were told that it was a Microsoft issue; HP has nothing to do with it.

Emergency recovery
Digital Edge executed its standard Microsoft Server recovery procedure which involves registry replacing with "first boot" registry backup and brought up first server with configuration of the "first boot". We copied original registry and one engineer started analyzing boot process going through original registry. The second engineer started analyzing hardware and fiber settings on the "first boot" registry configuration. We had 1 hour of the planned down time ahead of us.

Problem Discovery
By disabling boot drivers one by one, we identified that the "blue screen" was caused by a "Secure-Path" driver which is failover software for dual path fiber connection to SAN. The software checks fabrics and SAN availability on the primary fiber path and if it is not available, disables primary fabric and enables secondary. We called HP again with more information on the failure, denying their blame on Microsoft cluster software. After some brief research, HP notified us that the SAN and fiber switch firmware version was not compatible with the version of the Secure-Path software that we were using and gave us a link to download an upgrade for the version that was installed on the servers. We disabled Secure-Path driver on the boot, disabled backup fiber path and could start both clusters without any problems. Both clusters worked fine with the exception of the dual fiber path failover. We started all the servers in the production mode, tested Microsoft Cluster failover and reported to the client that both clusters are operational except dual path failover. At least one level of protection existed at that point.

Full recovery
Digital Edge scheduled second planned down time with the client to re-enabled dual path failover. Digital Edge engineer started installation of the newer Secure-Path software. It failed first time indicating that the QLogic fiber card software was not compatible with the newest Secure-Path software. We called HP again and got required fiber card driver. After upgrading all servers with latest fiber card driver and Secure-Path software we enable Secure-Path driver on the boot. After reboot, all the servers showed both fabrics and Microsoft Cluster server started without any problem.

We tested multiple scenarios of failover on each cluster and reported the client that both clusters were in production mode and all MS SQL Server errors are gone from the log.

Conclusion
Digital Edge has vast experience solving the most difficult problems laying on the edge between multiple vendors. We take ownership of complicated projects and challenging issues. We act as your liaison between all vendors involved to ensure a quick and satisfactory result becoming your single point of contact from inception to conclusion.