10/8/2009

Apache-TomCat-MS SQL Server performance troubleshooting and tuning

Emergency troubleshooting, system stabilization and performance tuning for Apache, TomCat, Oracle, MS SQL server high volume web system.

One of Digital Edge clients infrastructure utilizes multiple Apache servers running on Linux as a front end web farm , multiple TomCat servers load balancing requests from Apache and connecting to multiple SQL and Oracle servers.

Digital Edge Knightwatch Monitoring system consistently detected that one of our customer’s applications was timing out almost every weekday at around noon and again mid-afternoon. The customer engaged DigitalEdge to quickly investigate the problem and bring stability to their environment as quickly as possible. Fine-tuning every possible parameter was not the objective. Bringing a revenue-generating web application back online was. This type of consulting can be thought of as firefighting. When firefighting, the crisis at hand often prohibits the luxury of reviewing network diagrams and application flow charts, interviewing developers, and doing extensive code review. A combination of experience and instinct leads the consultant to dissect the application as needed on the fly. We’d like to share an example of one such firefighting engagement. IP addresses, hostnames, and other miscellaneous information have been changed to protect our client’s application internals.

# netstat -anp | grep 80 | grep LISTEN

tcp 0 0 :::80 :::* LISTEN 8565/httpd

tcp 0 0 :::38039 :::* LISTEN 8565/httpd

Netstat’s “-p” flag reports the program associated with a socket. In the example above, we see that the port for HTTP traffic, 80, is owned by the process identifier (PID) 8565 and belongs to the binary ‘httpd’.

Now let’s learn more about httpd.

# ps auxww | grep 8565 | grep -v grep

apache 8565 0.0 0.1 86684 7884 ? S 14:02 0:00 /usr/local/apache-2.2.10/bin/httpd -k start

Now we know that Apache is the web server in use and its install directory is /usr/local/apache-2.2.10.

Let’s see how many error log directives are found in Apache’s configuration file.

# grep ErrorLog /usr/local/apache-2.2.10/conf/httpd.conf | wc -l

This is not a trivial setup. We manually reviewed httpd.conf to pinpoint settings for the sites suffering timeouts and also looked for any non-site specific logging. The server error log contained entries such as this:

[Mon Sep 21 20:03 :28 2009] [warn] (2)No such file or directory: disk_cache: rename tempfile to hdrsfile failed: /cache/aptmpXv5ETN -

> /cache/WzE/gnK/lgx/SHi/4on/e_e_76w.header

[Mon Sep 21 20:03:28 2009] [warn] (2)No such file or directory: disk_cache: rename tempfile to hdrsfile failed: /cache/aptmpnBNEO7 -

> /cache/bTc/thq/qdK/13i/ZwZ/vQsTZaA.header

As well as

[Fri Oct 02 14:57:24 2009] [notice] child pid 10940 exit signal Segmentation fault (11)

[Fri Oct 02 14:57:24 2009] [notice] child pid 11952 exit signal Segmentation fault (11)

[Fri Oct 02 14:57:25 2009] [notice] child pid 12888 exit signal Segmentation fault (11)

[Fri Oct 02 14:57:33 2009] [notice] child pid 12625 exit signal Segmentation fault (11)

[Fri Oct 02 15:05:10 2009] [notice] child pid 13459 exit signal Segmentation fault (11)

[Fri Oct 02 15:06:43 2009] [notice] child pid 13446 exit signal Segmentation fault (11)

[Fri Oct 02 15:06:44 2009] [notice] child pid 12785 exit signal Segmentation fault (11)

These are both potentially issues, especially the segmentation faults, but are they THE issue? Fixing the mod_cache errors should be as easy as making a small change to httpd.conf and restarting Apache. Exploring the source of the segmentation fault is more involved. We’ll need to updated httpd.conf to tell Apache where to dump core files then wait for another segmentation fault and finally examine the core dump with GDB. These log entries frequently happened, but not only at the time of application outages. This may not be the prime suspect, but will need to be sorted out and fixed. Let’s do more exploring before investing resources in that direction.

httpd.conf showed that Apache was using mod_jk to communicate with the application servers. The Apache directive JkLogFile specifies where mod_jk specific errors are logged. We noticed a steady stream of errors in mod_jk that coincided with the time of site outages. We’re getting closer to the source of the problem!

/usr/local/apache-2.2.10/logs/mod_jk.log:

[Mon Sep 21 16:51:29.746 2009] [30396:2510284000] [error] ajp_send_request::jk_ajp_common.c (1507): (appsrv2) connecting to b

ackend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)

[Mon Sep 21 16:51:29.746 2009] [30396:2510284000] [error] ajp_service::jk_ajp_common.c (2466): (appsrv2) connecting to tomcat

failed.

[Mon Sep 21 16:51:29.845 2009] [30396:2510284000] [error] service::jk_lb_worker.c (1469): All tomcat instances failed, no more worke

rs left

[Mon Sep 21 16:51:30.918 2009] [20133:2510284000] [error] ajp_send_request::jk_ajp_common.c (1507): (appsrv2) connecting to b

ackend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)

[Mon Sep 21 16:51:31.018 2009] [20133:2510284000] [error] ajp_send_request::jk_ajp_common.c (1507): (appsrv2) connecting to b

ackend failed. Tomcat is probably not started or is listening on the wrong port (errno=111)

[Mon Sep 21 16:51:31.018 2009] [20133:2510284000] [error] ajp_service::jk_ajp_common.c (2466): (appsrv2) connecting to tomcat

failed.

Let’s learn more about the connections to the application servers.

appsrv1 and appsrv2 are both defined in /usr/local/apache-2.2.10/conf/workers.properites:

---

worker.appsrv1.type=ajp13

worker.appsrv1.host=10.10.10.10

worker.appsrv1.port=60009

worker.appsrv1.lbfactor=1

#worker.appsrv1.cachesize=10

#worker.appsrv1.cache_timeout=600

worker.appsrv1.connect_timeout=15000

worker.appsrv1.prepost_timeout=10000

worker.appsrv1.connection_pool_timeout=600

worker.appsrv1.socket_keepalive=true

#worker.appsrv1.recycle_timeout=300

worker.appsrv1.domain=webapp-cluster1

worker.appsrv1.ping_mode=A

worker.appsrv2.type=ajp13

worker.appsrv2.host=10.10.10.10

worker.appsrv2.port=60009

worker.appsrv2.lbfactor=1

#worker.appsrv2.cachesize=10

#worker.appsrv2.cache_timeout=600

worker.appsrv2.connect_timeout=15000

worker.appsrv2.prepost_timeout=10000

worker.appsrv2.connection_pool_timeout=600

worker.appsrv2.socket_keepalive=true

#worker.appsrv2.recycle_timeout=300

worker.appsrv2.domain=webapp-cluster1

worker.appsrv2.ping_mode=A

----

So we’re getting a feel for the complexity of the environment. We’ve already realized we’re on load balanced web servers that are each hosting more than seventy sites. The web servers are connecting to application servers via mod_jk. The application servers are also pooled. It’s becoming clear that this is a three-tiered architecture with each tier having pooled resources.

The mod_jk errors indicated connection failures from the web servers to the application servers. Now that we know the application server particulars lets see if we can establish a connection. Let’s use telnet to manually connect to those IP addresses on port 60009.

# telnet 10.10.10.10 60009

Trying 10.10.10.10...

No connection established! The web server cannot communicate with the application server, but why? We logged into the second application server to see if we could connect to the first from there. Nope! Is this a firewall issue? Are there ACLs somewhere preventing this connection from succeeding? Let’s login directly to the application server so we can narrow down the problem as much as possible.

# telnet 127.0.0.1 6009

Trying 127.0.0.1…

We can’t connect from locally either. That means this isn’t merely a firewall or network issue, something is going on with the program responsible for listening on port 60009. Maybe not every connection fails? We ran the telnet command above a dozen or so times and noticed that about half of the connections failed, but some were successful. Let’s use the same technique we used earlier to find what program is listening on the port we’re interested in:

# netstat -anp | grep 60009 |grep ISTEN

tcp 0 0 :::60009 :::* LISTEN 27255/java

Let’s get more information on PID 27255

appsrv1# ps auxww|grep 27255

jboss 27255 101 13.2 1878776 542612 ? Sl 16:51 172:56 /usr/java/jdk1.5.0_21/bin/java -Dprogram.name=run.sh -server -Xms1024

m -Xmx1024m -XX:MaxPermSize=256m -Dlog4j.configuration=file:///opt/WebApp/app/conf/current/log4j.properties -DStaticTemplateElements

.properties=file:///opt/WebApp/app/conf/current/StaticTemplateElements.properties -DStaticTemplateElements_fr.properties=file:///opt

/WebApp/app/conf/current/StaticTemplateElements_fr.properties -DStaticTemplateElements_sp.properties=file:///opt/WebApp/app/conf/cur

rent/StaticTemplateElements_sp.properties -Djava.endorsed.dirs=/opt/WebApp/jboss/lib/endorsed -classpath /opt/WebApp/jboss/bin/run.j

ar:/usr/java/jdk1.5.0_21/lib/tools.jar org.jboss.Main

Got it, jboss. Now to find where port 60009 is defined

appsrv1# grep -r 60009 /opt/WebApp/jboss-4.0.5.GA/server/

/opt/WebApp/jboss-4.0.5.GA/server/default/deploy/jbossweb-tomcat55.sar/server.xml: <Connector port="60009" address="${jboss.bind.address}"

Netstat showed a lot of activity on port 6009.

appsrv1# netstat -an | grep 60009| wc -l

724

Is there anything interesting in the jboss logs? We noticed noticed frequent Java exceptions in the standard out log for the application (appsrv1:/opt/WebApp/jboss-4.0.5.GA/logs/ms_prod_stdout.log) but those seemed to be irrelevant and routine. Mismatching data types from user-provided inputs in incoming requests - that

sort of thing. We did notice some SQL connection exceptions, but not from the time frame of interest and that seemed to be an isolated issue only on one app server and during a single short span of time. Every single SQL query made by the application is written to appsrv1:/opt/WebApp/jboss-4.0.5.GA/logs/webapp.log which could have a performance impact, but the objective was not to write a ‘best practices’ report, but to fight fires!

At this point we took a few minutes to review server.xml and noticed that most application connections defined included a parameter called “maxThreads”, but the connection definition for 60009 did not. We quickly checked the Tomcat documentation and saw that the default maximum thread count is 200 (http://tomcat.apache.org/tomcat-5.5-doc/config/http.html). The netstat command output above shows that the server is seeing a lot more connection activity than 200 threads can handle. We modified server.xml and explicitly set maxThreads to 1000 and bounced the application. The mod_jk errors immediately ceased and the site became functional again.

The gut reaction is to celebrate, enjoy a cup of tea, and explain how you saved the day with a big grin on your face. But there are still unanswered questions. The site has been functional for years, why did this problem just recently begin? A review of traffic doesn’t show any usage spikes. Why would Tomcat be using more than 700 threads when there are only a couple of hundred client connections? Why is Apache segfaulting? Why is Apache sometimes hitting MaxClients when the visitor volume is no where near that number?

Earlier I mentioned that it had become clear we were dealing with a three-tier architecture, but so far, we’ve only examined two of the three tiers. What about the database connections? We dug through JBoss configuration files and pinpointed several database connections used. Some connections go to Oracle and others go to MSSQL. We wrote a quick shell script that made a series of connections to Oracle from the application server, executed a query, and logged the results.

We took another step and our Oracle DBA checked health and performance of Oracle server and didn't find anything that caught his attention. Oracle was definetely under utilized.

During DBA audit of SQL servers we noticed that one SQL instance producing huge disk I/O on one of the installed volume. The server had attached storage configured as 2 RAID 10 containers. One container was configured as disk E: and another one disk F:. Disk E: was used to hold SQL Server data files and F: for SQL Server logs. During performance troubleshooting we noticed huge (>1000 requests) queuing on disk E: when disk F: was not queuing at all.

First recommendation was to put some database devices on disk F:. Our client's DBA was against this solution as the insisted that log should be separated. We still moved some database device files to F: drive and saw some improvement.

The client purchased additional disks and formed 3 more volumes to split the I/O.

After moving database device files and log files around and splitting I/O through all available disk volumes we achieved almost perfect I/O spread. We observed 0 requests in disk queue on all volumes.

After resolving disk I/O bottleneck time outs stopped happening.

Michael Petrov

Founder, Chief Executive Officer

Michael brings 30 years of experience as an information architect, optimization specialist and operations’ advisor. His experience includes extensive high-profile project expertise, such as mainframe and client server integration for Mellon Bank, extranet systems for Sumitomo Bank, architecture and processing workflow for alternative investment division of US Bank. Michael possesses advanced knowledge of security standards such as ISO 27001, NIST, SOC and PCI that brings into any solutions delivered by Digital Edge. Security solutions and standards are expended into public cloud such as AWS and Azure.