Sunday, May 16, 2010

Link Header-based Invalidation of Caches

This post is an introduction to LHIC, which is something I've been working on recently.




Also known as a reverse proxy cache, web accelerator, etc – it's  a specific type of shared cache that is shared by all clients, and is generally within the same organisational boundary as the origin server.

It's a layer, and as such it may actually consist of one or many peered 'instances', this research does not define how such a peering mechanism would work between instances, and this is definitely an area for future work. For our purposes here we'll consider the gateway cache layer to be one complete component.

The primary objective for a gateway cache is to minimize demand on origin servers.


All caches work by leveraging one or more of the 3 principal caching mechanisms:

  • Expiration
  • Validation
  • Invalidation


The downsides to expiration are primarily that it's inefficient, and difficult to manage.

It's inefficient because the expiry period is always limited in length to the resource's greatest potential volatility. This is particularly inefficient for resources which depend on human interaction and have periodic and dynamic volatility.

It's difficult to manage because the more efficiency you try to squeeze out of it, the greater the risk to the integrity of the cached information - this puts pressure on questions that are already very difficult to answer such as; What should the rules be? Where are those rules stored? How are those rules governed over time?


Ensuring freshness is a property of the validation mechanism which does have significant benefits, however this is at the expense of the server side which will still handle each request and incur processing and I/O costs. This is therefore not useful for gateway caching since the primary objective is minimizing demands on the server.


Using a combination of both expiration and validation will effectively give you the best of both worlds, but you will still inherit the problems that come with the expiration mechanism.


Invalidation-based caching works by keeping responses cached right up until their resource's state is altered by a client request. Therefore, in order to rely on invalidation exclusively, the cache must intermediate interactions from all clients, and is therefore a mechanism that is only really suited to gateway caches, and not to normal shared (i.e. forward proxy) or client-side caches.

In HTTP terms an invalidating request is any non-safe request that receives a 200 or 300 response, this sequence diagram demonstrates how invalidation can work in practice.


All REST constraints play important roles in enabling cache-ability in general, however the main enabler of invalidation is the uniform interface, and specifically self-descriptive messages. The uniform interface and self-descriptiveness of messages empower layering in REST and crucially it enables intermediaries like caches to make assertions about client-server interactions. It is these assertions that are the key to the invalidation mechanism.


What are the benefits of invalidation-based caching?

Invalidation-based caches have self control; they are governed naturally and dynamically by client-server interaction. This makes them much easier to manage, ensures freshness, and operates with best-case efficiency in which responses are only invalidated when absolutely necessary. This best-case efficiency results in responses being cached for the longest possible period minimising contact with origin server, and bandwidth consumed.

So.. why isn't this really used? Because there are common problems when building systems with HTTP, that cause the mechanism to fail.


There are other types of problem and variations on these two, they just happen to be the most common.



In a perfect world resources are granular and don't share state. At all. So - in the perfect world example above, the collection is simply a series of links. This does, however, require any client to make several subsequent requests for each item resource. This behaviour is generally considered overly 'chatty' and inefficient  and therefore in the real world clear identification of resources and their state is traded-away for network efficiency gains.

This trade-off has consequences for invalidation..


How would an intermediary answer that question?

The right answer should be "None." or at least "I don't know."

Given that URI's are essentially opaque it should, as far as intermediaries are concerned, have no effect. Using assumptions of perceived URI hierarchy is brittle and restrictive - what happens if this item belongs to more than one composite collection?


It's common practice to treat representations as resources in their own right, and expose them with their own URIs. This creates the same kind of situation as the composite resource problem in which these 'representation resources' share state invisibly.

You could propose to solve this by making assumptions using 'dot notation' however this is again ignoring the opacity of URIs and is brittle and restrictive.

There are other examples in which the existing approach to invalidation is made impossible, but they all revolve around the same core problem:


Composite and split resources are problems because they reduce visibility.

Resources share state, and are therefore dependent, but the uniform interface lacks the capability to express this as control-data; and it is therefore not visible to intermediaries.


Link headers can be used to "beef up" the uniform interface by expressing these invisible dependencies as link relations.

Standardising the link relations allows these links to be used as control data within the uniform interface; thus increasing self-descriptiveness of messages and visibility.

This is named "Link Header-based Invalidation of Caches" (LHIC). There are two types of LHIC:


LHIC-I is a simple mechanism that can be thought of as "pointing out" affected resources in the response. This gives the origin server dynamic control over the invalidation.

In order to secure the LHIC-I mechanism from DoS attacks in which any/all cached objects could be indicated for invalidation, it is likely that a same-domain policy would have to be adopted.

The purpose of this type of link relation is to simply increase visibility of the invalidating interaction itself.


LHIC-II is a more complex mechanism that can be thought of as dependent resources "latching on" to one another. This effectively creates a 'dependency graph' within a gateway cache which can be queried against each invalidating request. LHIC-II is therefore capable of allowing invalidation to cascade along a chain of dependencies, whereas LHIC-I is only capable of handling first level dependencies.

The purpose of this type of link relation is to increase overall visibility in the system, ahead of an invalidation taking place.



LHIC-II does not suffer the drawbacks of LHIC-I, however it is more complex to implement and does not allow dynamic control of invalidation by origin servers.

The optimal approach is to implement both methods; this allows for both dynamic control by origin servers, and cascading invalidation.

Conclusion..

LHIC injects lost visibility back into the web

The resulting caching mechanism is
  • Very efficient
  • Ensures freshness
  • Easily managed
  • Leverages existing specs
Thoughts, comments, suggestions all welcome! :)

5 comments:

  1. I think your coverage of the 3 mechanisms for cache consistency is great. I will note, however, that invalidation doesn't work well if the gateway cache is distributed--whether that is a layer of Squid reverse proxy caches (peered or not), but especially for geographically distributed origins (say in multiple data centers).

    We are in this setting, and hence have had to rely on the expiration and validation mechanisms exclusively, which are ultimately more scalable due to the lack of required coordination between caches.

    In an eventually consistent world, some amount of staleness is to be expected; this is where the optimistic locking of conditional PUT or DELETE is a big win; getting a 412 in this case will let a client detect the staleness when it really matters.

    I do agree, though, that this model makes you think very hard about proper Cache-Control settings, and that's not always easy.

    Thanks for the article, though -- good food for thought.

    ReplyDelete
  2. Great content, and very interesting. I was wondering what the advantages are over the Cache Channels style (http://www.mnot.net/blog/2008/01/04/cache_channels)?

    ReplyDelete
  3. The interesting thing about this approach, for me, is that the invalidations are generated by client traffic itself, and client traffic is usually geographically load balanced, so that a user will see the effects of their actions take place immediately, since they're (usually) using the same set of reverse proxies.

    Regarding the relationship to Cache Channels -- to me they're very different mechanisms. Cache channels takes a bounded amount of time to propagate the invalidation, and is designed to be very reliable in doing so. Link-based invalidation can be nearly real-time, but doesn't have the reliability built into it (it depends on how you implement it).

    Also, with Cache Channels the source of invalidations is the origin server, not client traffic.

    I see the typical use case for Cache Channels as a large body of content that doesn't change often (thereby benefitting by long TTLs), but when it does change, it needs to take effect in a guaranteed and relatively short amount of time. E.g., when you have a corpus of news articles and legal needs the ability to pull an article within a minute or two.

    Meanwhile, the typical case for link-based invalidation is user-generated content, where the stream of requests affects the content itself, and you want blog entries, ratings, etc. to be noticed quickly, so that the user sees the effects of their activity while they're interacting with the site. However, the iron-clad guarantees aren't really necessary in this use case.

    ReplyDelete
  4. So we're saying the 'down side' to this is in the difficulty of achieving the mechanism reliably when the gateway layer is distributed?

    Is this because of an inherent limitation of the approach, or simply because the technologies either don't exist or do exist but haven't yet been leveraged for this purpose?

    Maybe an approach that uses HTCP over a reliable multicast protocol e.g. PGM (http://tools.ietf.org/html/rfc3208) might be a feasible, scalable solution? Obviously geographical distribution would require a unicast bridge and retransmission at the other end(s).

    Any thoughts?

    ReplyDelete
  5. Well, it's one aspect. Keep in mind that reliability isn't just making sure the message gets through (whether that's TCP or whatever), it's also making sure that it gets through even if the peer is down (because of network segments, multi-day maintenance, etc.).

    My experience is that multicast is difficult to get people to deploy (although that is changing a bit), and using reliable delivery mechanisms (e.g., the various message queue products) often scare people away. That's the beauty of cache channels; it manages reliable delivery without a ton of overhead (at least deployment-wise).

    ReplyDelete