Many IBM Connections 4.5 admins may have seen this kind of thing in the past:
Users see the page structure in their browser but with formatting badly mangled.
Often clearing cache and refreshing page, or else just waiting a few minutes and trying again, appears to resolve it. The issue is browser- and platform-independent, and is very intermittent though appears to be seen most commonly first thing in the morning or after server restarts. The browser and server logs seemed to indicate that a number of files that should be downloaded to the client could not be accessed.
Seen it? I had in several environments, but after much diagnosis and the absence of anything more concrete, had put it down to server or network loading.
Well it turns out that there is an inherent caching issue in the Connections 4.5 platform that is causing this issue:
Major inconsistencies and distortions seen with the UI on Production. Only a restart of the production servers helped resolve this issue only for them to see a reoccurrence every 2-3 hours. The UI gets distorted at this point and all users are affected.
Root Cause for Situation
There is therefore a window of 3 hours 30 minutes in which the application server’s dynacache misses the cached resource, because the 30 minutes cache period elapsed, and requests now hit the application server directly. In these circumstances, if the client sends an If-Modified-Since header that matches the last modified time of the generated resource in the “in-memory cache” (which lasts for another 3 hours and 30 minutes), the Common application responds with a 304 Not Modified and an empty response body. This is fine for the client that issues the request, which reuses the response cached by the browser.
The flip side here is that the 304 Not Modified response is cached by the application server in the dynacache for the next 30 minutes. The application server will serve this stale response to every client that requests the same resource with the same values of components declared in the cachespec.xml.
Ideally, 304 responses should never be cached (by dynacache) which is exactly the opposite of what was observed by the customer. At that point, the only way to clear the application server cache was to restart the Common application which is what was being observed had to be done whenever the customer faced this issue.
Specific Action carried out to Fix
Apply the custom JVM property com.ibm.ws.cache.CacheConfig.filteredStatusCodes, and set it to “304 404 500 502”. This instructs the application server not to cache responses that have one of the response statuses included in the space separated list. Essentially, the 304 responses are not cached. This customer property required an iFix PM54521 on WAS which needed to be back ported to the WAS 7.0.0 FP21 that the customer’s environment had. (SM: this iFix is in WAS 8 and thus is not required in Connections 4.5 environments)
Once this was carried out, this issue has now been fixed and the customer’s production system is stable.
From a more permanent standpoint, below are the approaches considered
1. To have the JVM parameters and the WAS iFix as a part of the IC installer to exclude the 304 response codes
2. Code change to fix failing test of Etag value sent by client as If-None-Match header enclosed in double quotes, compared with Etag not enclosed in double quotes. We suspect that the performance fix (RTC 60458) to enclose Etag headers in double quotes to accommodate Edge server caching introduced was incomplete. When coded, would reduce the possibility to incorrectly generate 304 responses.
3. Understand whether it is safe to not use dynacache to cache responses from Common, and look to handle caching entirely in the application space without relying on dynacache for caching responses.
So just in case that isn’t entirely clear, the fix is as follows:
- In the ISC, navigate to Application servers > *server* > Process definition > Java Virtual Machine > Custom properties
- Click ‘New’. Set Name to be ‘com.ibm.ws.cache.CacheConfig.filteredStatusCodes’ and Value to be ‘304 404 500 502’. Add a description as appropriate. Click OK
- Repeat for all your Connections-related servers
- Save the configuration
- Resync your nodes
- Restart the WAS servers
In my experience this has resolved the issue for all environments that have experienced it (there have been 4 or 5 so far), and thus I am setting it for all new installs. It may well be that IBM will take steps to prevent this using one of the methods discussed in the quote above, but for now this is still the best solution we have.
A huge thank you to the brilliant David McCarthy in IBM Dublin for helping us track down this nasty issue!