This document will explain how dispatcher caching happens and how it can be configured.

Caching Directories

We use the following default cache directories in our baseline installations

  • Author
    • /mnt/var/www/author
  • Publisher
    • /mnt/var/www/html

When each request traverses the dispatcher the requests follow the configured rules to keep a locally cached version to response of eligible items.

Note:

We intentionally keep published workload seperate from the author workload because when Apache looks for a file in the DocumentRoot it doesn't know which AEM instance it came from.  So even if you have cache disabled in the author farm, if the author's DocumentRoot is the same as publisher it will serve files from the cache when present.  Meaning you'll serve author files for from published cache and make for a really awful mix match experience for your visitors.

 

Keeping seperate DocumentRoot directories for different published content is also a very bad idea.  You'll have to create multiple re-cached items that don't differ between sites like clientlibs as well as having to setup a replication flush agent for each DocumentRoot you setup.  Increasing the amount of flush over head with each page activation.  Rely on namespace of files and their full cached paths and avoid mutliple DocumentRoot's for published sites.

Configuration Files

Dispatcher controls what qualifies as cacheable in the /cache { section of any farm file.  In the AMS baseline configuration farms you'll find our includes like shown below:

/cache {
    /rules {
        $include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
    }

When creating the rules for what to cache or not please refer to the documentation here

Caching Author

There are a lot of implementations we've seen where people don't cache author content.  They are missing out on a huge upgrade in performance and responsiveness to their authors.

Let's talk about the strategy taken in configuring our author farm to cache properly.

Here is a base author /cache { section of our author farm file:

/cache {
    /docroot "/mnt/var/www/author"
    /statfileslevel "2"
    /allowAuthorized "1"
    /rules {
        $include "/etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any"
    }
    /invalidate {
        /0000 {
            /glob "*"
            /type "allow"
        }
    }
    /allowedClients {
        /0000 {
            /glob "*.*.*.*"
            /type "deny"
        }
        $include "/etc/httpd/conf.dispatcher.d/cache/ams_author_invalidate_allowed.any"
    }
}

The important things to note here are that the /docroot is set to us the cache directory for author.

Note:

Make sure your DocumentRoot in the author's .vhost file matches the farms /docroot parameter.

The cache rules include statement includes the file /etc/httpd/conf.dispatcher.d/cache/ams_author_cache.any which contains these rules:

/0000 {
	/glob "*"
	/type "deny"
}
/0001 {
	/glob "/libs/*"
	/type "allow"
}
/0002 {
	/glob "/libs/*.html"
	/type "deny"
}
/0003 {
	/glob "/libs/granite/csrf/token.json"
	/type "deny"
}
/0004 {
	/glob "/apps/*"
	/type "allow"
}
/0005 {
	/glob "/apps/*.html"
	/type "deny"
}
/0006 {
	/glob "/libs/cq/core/content/welcome.*"
	/type "deny"
}

In an author scenario content is changing all the time and on purpose.  You only want to cache items that are not going to change frequently.  We have rules to cache /libs because they are part of the baseline AEM install and would change until you have installed a Service Pack, Cumulative Fix Pack, Upgrade, or Hotfix.  So caching these elements make a ton of sense and really have huge benefits of the author experience of end users who use the site.

Note:

Keep in mind that these rules also cache /apps this is where custom application code lives.  If you're developing your code on this instance then it will prove to be very confusing when you save your file and don't see if reflect in the UI due to it serving up a cached copy.  The intention here is that if you do a deployment of your code into AEM it too would be infrequent and part of your deployment steps should be to clear the author cache.  Again the benefit is huge making your cacheable code run faster for the end users.

ServeOnStale (AKA Serve on Stale / SOS )

This is one of those gems of a feature of the dispatcher.  If the publisher is under load or has become unresponsive it will typically throw a 502 or 503 http response code.  If that happens and this feature is enabled the dispatcher will be instructed to still serve what ever content is still in the cache as a best effort even if it's not a fresh copy.  It's better to serve something if you've got it rather than just showing an error message that offers no functionality.

Note:

Keep in mind that if the publisher renderer is having a socket timeout or 500 error message this feature will not trigger.  If AEM is just unreachable this feature does nothing

This setting can be set in any farm but only makes sense to apply it on the publish farm files.  Here is a syntax example of the feature enabled in a farm file:

/cache {
    /serveStaleOnError "1"

Caching Pages with Query Params / Arguments

Note:

One of the normal behaviors of the Dispatcher module is that if a request has a query parameter in the URI (typically shown like /content/page.html?myquery=value) it will skip caching the file and go directly to the AEM instance.  It's considering this request a dynamic page and shouldn't be cached.  This can cause i'll effects on cache efficiency

If you have pages in AEM that take GET arguments / query parameters in the address line that help the page operate but doesn't render different html are good candidates for this configuration element.

You can tell dispatcher which arguments to ignore and still cache the page.

As an example someone has built a social media deep link reference mechanism that uses the argument reference in the URI to know where the person came from.

Example Usage:

https://www.weretail.com/home.html?reference=android

https://www.weretail.com/home.html?reference=facebook

The page is 100% cacheable but doesn't cache because the arguments are present.  To work around this we add the following section to our farm configuration file:

/cache {
    /ignoreUrlParams {
        /0001 { /glob "*" /type "deny" }
        /0002 { /glob "reference" /type "allow" }
    }

Now when the dispatcher sees the request it will ignore the fact that the request has the query parameter of ?reference and still cache the page

Caching Response Headers

It's pretty obvious that the dispatcher caches .html pages and clientlibs, but did you know it can also cache particular response headers along side the content in a file with the same name but a .h file extension.  This allows the next response to not only the content but the response headers that should go with it from cache. 

AEM can handle more than just UTF-8 encoding (big grin)

Sometimes items have special headers that help control cache TTL's encoding details, and last modified timestamps.

These values when cached are stripped by default and the apache httpd webserver will do it's own job of processing the asset with it's normal file handling methods, which normally is limited to mime type guessing based on file extensions.

If you have the dispatcher cache the asset and the desired headers you can expose the proper experience and assure the all the details make it to the clients browser.

Here is an example of a farm with the headers to cache specified:

/cache {
	/headers {
		"Cache-Control"
		"Content-Disposition"
		"Content-Type"
		"Expires"
		"Last-Modified"
		"X-Content-Type-Options"
	}
}

In the example they have configured AEM to serve up headers the CDN looks for to know when to invalidate it's cache.  Meaning now AEM can properly dictate which files get invalidated based on headers.

Note:

Keep in mind you cannot use regular expressions or glob matching.  It's a literal list of the headers to cache.  Only put in a list of the literal headers you want it to cache.

Auto-Invalidate Grace Period

On AEM systems that have a lot of activity from authors that do alot of page activations you can have a race condition where repeat invalidations occur.  Heavily repeated flush requests are un-necessary and you can build in some tolerance to not repeat a flush until the grace period has cleared.

Example of how this works:

If you have 5 request to invalidate /content/exampleco/en/ all happen within a 3 second period.

With this feature off you'd invalidate the cache directory /content/exampleco/en/ 5 times

With this feature on and set to 5 seconds it would invalidate the cache directory /content/exampleco/en/ once

Here is an example syntax of this feature being configured for 5 second grace period:

/cache {
    /gracePeriod "5"

TTL Based Invalidation

A newer feature of the dispatcher module was Time To Live (TTL) based invalidation options for cached items.  When an item gets cached it looks for the presence of cache control headers and generates a file in the cache directory with the same name and a .ttl extension. 

Here is an example of the feature being configured in the farm configuration file:

/cache {
    /enableTTL "1"

Note:

Keep in mind that AEM still needs to be configured to send TTL headers for dispatcher to honor them.  Toggling this feature only enables the dispatcher to know when to remove the files that AEM has send cache control headers for.  If AEM doesn't start sending TTL headers then dispatcher won't do anything special here.

Cache Filter Rules

Here is an example of a baseline configuration for which elements to cache on a publisher:

/cache{
    /0000 {
        /glob "*"
        /type "allow"
    }
    /0001 {
        /glob "/libs/granite/csrf/token.json"
        /type "deny"
    }

We want to make our published site greedy as possible and cache everything.

If there is elements that break the experience when cached you can add rules to remove the option to cache that item.  As you see in the example above the csrf tokens shouldn't ever be cached and have been excluded.  Further details on writing these rules can be found here