Issue
Many custom applications call web services from AEM. These applications use Apache Commons HTTP Client or other libraries. When the back-end systems experience performance issues then AEM would experience slow response times. In addition, if too many threads hang, this can lead to slow JVM garbage collections, out of memory errors, OS thread exhaustion, etc.
When capturing thread dumps or heap dumps from AEM, you observe many threads waiting on the web service calls.
Example Scenario:
Below [1], is an example stack trace from a JVM thread dump. It was captured from an AEM instance with an application experiencing a badly performing back-end service.
The thread dump showed a few issues:
- The request thread below was stuck waiting on a web service that wasn't responding. No socket read timeout was set so the thread was waiting forever. See here for the solution.
- Hundreds of threads were waiting on the single request thread. This was because the thread pool was configured to be single threaded, queued first in, first out. Having so many pending threads caused high memory utilization. See here for the solution.
- The web service that was being called was taking too long to respond. This caused the aformentioned thread pile up.
Note the highlighted stack lines:
- In yellow, you can see that the custom application is using Spring Framework's RestTemplate library to do a web service call.
- In orange, you can see that Spring Framework uses Apache Commons HTTP Client for its web service calls.
- In red, you can see that the thread was stuck in SocketInputStream.socketRead which means it is waiting on the web service for a response.
[1]
Cause
Below are some common causes:
- The web service host is unreachable and the socket "connection timeout" is not set or it is set too long (for example 10 minutes).
- The web service is active, but is responding with an error which isn't being handled by the application.
- The web service is active, but responding too slow or hanging during response. The socket "read timeout" is not set or it is set too long. Threads wait indefinitely for a response.
- The thread pool is configured to allow one request at a time. This causes concurrent request threads to wait.
Resolution
Solutions to these problems are the following:
- Set the "connection timeout" to a reasonable value, for example, 3 seconds.
- Set the "read timeout" to a reasonable value. This value depends on how long you expect the responses to take. For most small web services, 10 seconds is reasonable. However, some web services that do a lot of processing or send and/or receive large files require a higher read timeout setting.
- Refer to your httpclient or web service client library's documentation on how to manage multi-threading efficiently. For example, in Apache Commons Http Client 3.x, you could refer to this page.
- Cache the responses from service calls. This applies when you are calling the service using the same parameters and receive the same result more than once.
- If the data is shared (not user specific), then consider calling the web service ahead of time in a background thread and store the result for later use. For example, use a sling job, sling scheduler or AEM polling importer to manage a background thread.
- If you have a high volume of outgoing calls then implement a back-off algorithm to deal with failures. This is where when the application calls the service and it fails X times, then you stop trying for a certain amount of time and immediately report errors instead. Then try X times again and if it fails or times out again wait for a longer time and so on. This is usually done with an exponential increase on wait time with a max cap time (aka "Truncated Exponential Back-off Algorithm").