Tuesday, June 8, 2010

Please Accept: application/hal+xml

Here's an example of something I'm calling the 'Hypermedia Application Language' (hal):

http://gist.github.com/430011

Hal just defines a standard way to express hyperlinks in xml via a simple <link> element. The link element has the following attributes: @rel @href @name
  • Simple links can be written as solo/self-closing tags.
  • Links used to indicate embedded representations from other resources should be written with open and close tags, with the embedded representation contained within.
  • The root element must always be a link with an @rel of self and an appropriate @href value.
  • @name must be unique between all links in a document with the same @rel value, but is not unique within the entire document. i.e. a link element cannot be referred to by @name alone (thanks to Darrel for this)
  • @href value may contain a URI template
Some questions that have arisen for me:

What can/can't you do with this media type?

Did I just reinvent RDF/XML?

Is it enough to implement a system with this media type and to provide documentation to clients as "application specs" or "flow specs" describing the various ways link relations can be traversed to get stuff done?

As usual - all thoughts, comments, suggestions welcome! :)

Sunday, May 16, 2010

Link Header-based Invalidation of Caches

This post is an introduction to LHIC, which is something I've been working on recently.




Also known as a reverse proxy cache, web accelerator, etc – it's  a specific type of shared cache that is shared by all clients, and is generally within the same organisational boundary as the origin server.

It's a layer, and as such it may actually consist of one or many peered 'instances', this research does not define how such a peering mechanism would work between instances, and this is definitely an area for future work. For our purposes here we'll consider the gateway cache layer to be one complete component.

The primary objective for a gateway cache is to minimize demand on origin servers.


All caches work by leveraging one or more of the 3 principal caching mechanisms:

  • Expiration
  • Validation
  • Invalidation


The downsides to expiration are primarily that it's inefficient, and difficult to manage.

It's inefficient because the expiry period is always limited in length to the resource's greatest potential volatility. This is particularly inefficient for resources which depend on human interaction and have periodic and dynamic volatility.

It's difficult to manage because the more efficiency you try to squeeze out of it, the greater the risk to the integrity of the cached information - this puts pressure on questions that are already very difficult to answer such as; What should the rules be? Where are those rules stored? How are those rules governed over time?


Ensuring freshness is a property of the validation mechanism which does have significant benefits, however this is at the expense of the server side which will still handle each request and incur processing and I/O costs. This is therefore not useful for gateway caching since the primary objective is minimizing demands on the server.


Using a combination of both expiration and validation will effectively give you the best of both worlds, but you will still inherit the problems that come with the expiration mechanism.


Invalidation-based caching works by keeping responses cached right up until their resource's state is altered by a client request. Therefore, in order to rely on invalidation exclusively, the cache must intermediate interactions from all clients, and is therefore a mechanism that is only really suited to gateway caches, and not to normal shared (i.e. forward proxy) or client-side caches.

In HTTP terms an invalidating request is any non-safe request that receives a 200 or 300 response, this sequence diagram demonstrates how invalidation can work in practice.


All REST constraints play important roles in enabling cache-ability in general, however the main enabler of invalidation is the uniform interface, and specifically self-descriptive messages. The uniform interface and self-descriptiveness of messages empower layering in REST and crucially it enables intermediaries like caches to make assertions about client-server interactions. It is these assertions that are the key to the invalidation mechanism.


What are the benefits of invalidation-based caching?

Invalidation-based caches have self control; they are governed naturally and dynamically by client-server interaction. This makes them much easier to manage, ensures freshness, and operates with best-case efficiency in which responses are only invalidated when absolutely necessary. This best-case efficiency results in responses being cached for the longest possible period minimising contact with origin server, and bandwidth consumed.

So.. why isn't this really used? Because there are common problems when building systems with HTTP, that cause the mechanism to fail.


There are other types of problem and variations on these two, they just happen to be the most common.



In a perfect world resources are granular and don't share state. At all. So - in the perfect world example above, the collection is simply a series of links. This does, however, require any client to make several subsequent requests for each item resource. This behaviour is generally considered overly 'chatty' and inefficient  and therefore in the real world clear identification of resources and their state is traded-away for network efficiency gains.

This trade-off has consequences for invalidation..


How would an intermediary answer that question?

The right answer should be "None." or at least "I don't know."

Given that URI's are essentially opaque it should, as far as intermediaries are concerned, have no effect. Using assumptions of perceived URI hierarchy is brittle and restrictive - what happens if this item belongs to more than one composite collection?


It's common practice to treat representations as resources in their own right, and expose them with their own URIs. This creates the same kind of situation as the composite resource problem in which these 'representation resources' share state invisibly.

You could propose to solve this by making assumptions using 'dot notation' however this is again ignoring the opacity of URIs and is brittle and restrictive.

There are other examples in which the existing approach to invalidation is made impossible, but they all revolve around the same core problem:


Composite and split resources are problems because they reduce visibility.

Resources share state, and are therefore dependent, but the uniform interface lacks the capability to express this as control-data; and it is therefore not visible to intermediaries.


Link headers can be used to "beef up" the uniform interface by expressing these invisible dependencies as link relations.

Standardising the link relations allows these links to be used as control data within the uniform interface; thus increasing self-descriptiveness of messages and visibility.

This is named "Link Header-based Invalidation of Caches" (LHIC). There are two types of LHIC:


LHIC-I is a simple mechanism that can be thought of as "pointing out" affected resources in the response. This gives the origin server dynamic control over the invalidation.

In order to secure the LHIC-I mechanism from DoS attacks in which any/all cached objects could be indicated for invalidation, it is likely that a same-domain policy would have to be adopted.

The purpose of this type of link relation is to simply increase visibility of the invalidating interaction itself.


LHIC-II is a more complex mechanism that can be thought of as dependent resources "latching on" to one another. This effectively creates a 'dependency graph' within a gateway cache which can be queried against each invalidating request. LHIC-II is therefore capable of allowing invalidation to cascade along a chain of dependencies, whereas LHIC-I is only capable of handling first level dependencies.

The purpose of this type of link relation is to increase overall visibility in the system, ahead of an invalidation taking place.



LHIC-II does not suffer the drawbacks of LHIC-I, however it is more complex to implement and does not allow dynamic control of invalidation by origin servers.

The optimal approach is to implement both methods; this allows for both dynamic control by origin servers, and cascading invalidation.

Conclusion..

LHIC injects lost visibility back into the web

The resulting caching mechanism is
  • Very efficient
  • Ensures freshness
  • Easily managed
  • Leverages existing specs
Thoughts, comments, suggestions all welcome! :)

Wednesday, April 28, 2010

My slides from #wsrest2010

On Monday, I gave a presentation on the accepted paper myself and Michael Hausenblas wrote for ws://rest.2010 @ WWW2010.

Here are the slides:


I'll write a proper post about our paper once I'm back in the UK

Sunday, November 22, 2009

RESTful PHP Applications Despite Zend Framework


At the moment there’s a couple of blog posts doing the rounds explaining and promoting the new REST components in Zend Framework.

Here’s the problem:



Why do I say that?

REST isn't just collection/item CRUD over HTTP, and it's not about URI templating. Frameworks that don’t agree with this are needlessly restricting and/or confusing developers; when the only (debatable) ‘benefit’ of the approach is that collection and item resources can be handled by one controller and one route.

It bothers me so much I made an extension for Zend Framework that approaches things a bit better – it’s called Resauce:

Resauce is a small extension to Zend Framework. It gives you a lightweight framework for building RESTful HTTP interfaces with PHP.

The main idea is to treat each controller as a Resource; i.e. a controller is responsible for one resource. This is distinct from other solutions (such as the REST components included in ZF itself) where a controller is responsible for both collection and item resources.

Example routes in a Resauce application:

‘/blog’ >> BlogController
‘/blog/:post’ >> BlogPostController
‘/blog/:post/comments’ >> BlogPostCommentsController

A route in a Resauce application simply maps a URI pattern to a Resauce controller – no controller action should ever be specified in a route. Instead – controller actions are ‘routed’ according to the HTTP method of a given request:

GET >> getAction()
PUT >> putAction()
POST >> postAction()
DELETE >> deleteAction()
HEAD >> headAction()
OPTIONS >> optionsAction()

If a request is routed to a Resauce controller which does not have an action implemented for the given HTTP method, then the Resauce will automatically respond with a 405 Method Not Allowed response code.

.. More on github

The Resauce approach to MVC provides much more flexibility over how resources can be exposed than other alternatives – you have complete control of URI patterns and each resource’s HTTP methods

Check it out, and let me know what you think :)