On Sunday, November 1, 2009, Digital Edge’s Oracle DBA team received a complaint that one of the Oracle PL/SQL batch jobs was causing an ORA-00600 error. The team performed immediate troubleshooting to see what was going on and detected that their client’s custom program crashed the RAC secondary node. Fortunately, Digital Edge always provisions databases with high availability, which in this case, insured that the entire database did not go down.
PL/SQL batch was running against:
· 2 Node Oracle RAC
· Linux version 2.6.18-92.el5 (Red Hat 4.1.2-41)
· Oracle Database 10g Enterprise Edition Release 10.2.0.4.0 - 64bit
· Accessing EMC san as a shared storage resource.
· RAC also implements multiple services for load balancing and failover.
Once Digital Edge found the problem and tried to recover, they noticed there was an issue, the instances that were supposed to be running on the node did not start. The attempt to start service on the node produced an error:
· $ srvctl start service -d DB_NAME -s SERVICE_NAME -i INSTANCE_NAME
· PRKP-1030 : Failed to start the service S01.
· CRS-0215: Could not start resource 'ora.DB_NAME.SERVICE_NAME.INSTANCE_NAME.srv'
However, after reviewing the RAC log file, we figured out a reason for the failure:
· 2009-11-01 06:59:58.350: [ RACG] [ora.SDPRD_SI1
· .SDPRD2.inst]: ORA-44305: service SDPRD_SI1 is running
· ORA-06512: at "SYS.DBMS_SYS_ERROR", line 86
· ORA-06512: at "SYS.DBMS_SERVICE", line 444
· ORA-06512: at "SYS.DBMS_SERVICE", line 365
· ORA-06512: at line 1
So we figured out that RAC was thinking that the service was already started. So to force a service shutdown we used:
· exec DBMS_SERVICE.STOP_SERVICE('SERVICE_NAME', 'INSTANCE_NAME')
After forcing the shutdown, the service started working normal.