Sunday, November 14, 2010

Cookies are gross

Alpha-geek Jeff Atwood wrote a post yesterday called Breaking the Web's Cookie Jar. It's about MitM + session hijacking attacks on HTTP (ala FireSheep).

Jeff says that instead of us using HTTPS for everything, he'd "rather see a better, more secure identity protocol than ye olde HTTP cookies".

The thing is.. it already exists and has done since 1997 , see:

http://en.wikipedia.org/wiki/Digest_access_authentication

There is a more fundamental reason to hate cookies and avoid using them in your application:

Cookies are crap because they are used to create shared state over a protocol that is deliberately stateless, and session hijacking is just one example of the problems that causes.

They were a hack designed to overcome the historical limitations of browsers as a client; browsers sucked at managing and presenting their client-side state to users, so cookies were created as a way to offload that responsibility to the server which would maintain the shared state and render that should-be-client-side state into the web pages. This approach to application design is not scalable and is horribly inefficient. You can't cache pages that vary by the client accessing them. This led to applications requiring things like ESI just to stay performant (I'm looking at you, Magento).

Fortunately, with 'HTML5 & co.' browsers are now developing more sophisticated features which are eliminating the need for shared state altogether.

Back to the point then, I suppose.. Auth is, as far as I can tell, relatively well covered by HTTP Digest but for three things:

  1. No control over the login box gives designers a heart attack
  2. There's no way to logout other than closing the browser (which people don't do all that much anymore)
  3. A feasible subset of Digest features should be identified to keep things simple and help interop
Hopefully that will lead to "a better online identity solution than creaky old HTTP cookies" and, in the process, help us to move away from cookies entirely.

p.s: WebID looks interesting, too.

Monday, October 4, 2010

Evolving HAL

I've decided to revise and update HAL after some feedback and pondering.

HAL is a hypertext format for m2m interaction. It provides the following hypermedia factors:

  • Embedded Links (LE)
  • Out-bound navigational links (LO)
  • Templated Queries (LT)
  • Link relations(CL)

HAL specifies two elements:
  • link
  • resource

Both elements share the following attributes:

  • @rel
  • @href
  • @name

The link and resource elements differ in the following ways:

Link elements..
  • The link element is intended for representing out-bound links and should be written with solo/self-closing tags.
  • @href value of a link element may contain a URI template to express a templated query link.

Resource elements..
  • The resource element is intended for representing the embedded state of other resources and should be written with open and close tags, with the embedded representation contained within.
  • The root element must always be a resource with an @rel of self and an appropriate @href value.

Other rules:
  • @name must be unique between all HAL elements (link + resource) with the same @rel value in a document, but should not be considered unique within the entire document. This means a link element cannot be referred to by @name alone (thanks to Darrel for this)
  • The subject of an @rel is always directed at the closest parent resource element.
    e.g. A link that appears within an embedded resource relates to the embedded resource, and not the root resource.

Here is an example:



And here's how that might look in json:

Thursday, September 30, 2010

LCI and The Mechanism Formerly Known as LHIC

Yesterday, Mark Nottingham released LCI for the Squid cache (http://github.com/mnot/squid-lci).

LCI (Linked Cache Invalidation) is virtually identical to the LHIC stuff I laid out in the WS-REST 2010 paper earlier this year. Mark was actually working on his LCI implementation even before I started writing the paper - which I didn't know at the time. That's a cool thing to find out about, though :)

The only real difference between LCI and LHIC is the name of the link relations used i.e:

invalidates vs dependant
invalidated-by vs dependsOn


I think I prefer the names Mark picked, and given that they are exactly the same approach it makes sense to give up one set of terms. This is why LHIC will now simply be referred to as LCI or The Mechanism Formerly Known As LHIC.

It's great to see an actual implementation getting released into the wild.. definitely looking forward to playing around with it over the next few days and seeing what people make of it.

Cheers,
Mike

Tuesday, June 8, 2010

Please Accept: application/hal+xml

Here's an example of something I'm calling the 'Hypermedia Application Language' (hal):

http://gist.github.com/430011

Hal just defines a standard way to express hyperlinks in xml via a simple <link> element. The link element has the following attributes: @rel @href @name
  • Simple links can be written as solo/self-closing tags.
  • Links used to indicate embedded representations from other resources should be written with open and close tags, with the embedded representation contained within.
  • The root element must always be a link with an @rel of self and an appropriate @href value.
  • @name must be unique between all links in a document with the same @rel value, but is not unique within the entire document. i.e. a link element cannot be referred to by @name alone (thanks to Darrel for this)
  • @href value may contain a URI template
Some questions that have arisen for me:

What can/can't you do with this media type?

Did I just reinvent RDF/XML?

Is it enough to implement a system with this media type and to provide documentation to clients as "application specs" or "flow specs" describing the various ways link relations can be traversed to get stuff done?

As usual - all thoughts, comments, suggestions welcome! :)

Sunday, May 16, 2010

Link Header-based Invalidation of Caches

This post is an introduction to LHIC, which is something I've been working on recently.




Also known as a reverse proxy cache, web accelerator, etc – it's  a specific type of shared cache that is shared by all clients, and is generally within the same organisational boundary as the origin server.

It's a layer, and as such it may actually consist of one or many peered 'instances', this research does not define how such a peering mechanism would work between instances, and this is definitely an area for future work. For our purposes here we'll consider the gateway cache layer to be one complete component.

The primary objective for a gateway cache is to minimize demand on origin servers.


All caches work by leveraging one or more of the 3 principal caching mechanisms:

  • Expiration
  • Validation
  • Invalidation


The downsides to expiration are primarily that it's inefficient, and difficult to manage.

It's inefficient because the expiry period is always limited in length to the resource's greatest potential volatility. This is particularly inefficient for resources which depend on human interaction and have periodic and dynamic volatility.

It's difficult to manage because the more efficiency you try to squeeze out of it, the greater the risk to the integrity of the cached information - this puts pressure on questions that are already very difficult to answer such as; What should the rules be? Where are those rules stored? How are those rules governed over time?


Ensuring freshness is a property of the validation mechanism which does have significant benefits, however this is at the expense of the server side which will still handle each request and incur processing and I/O costs. This is therefore not useful for gateway caching since the primary objective is minimizing demands on the server.


Using a combination of both expiration and validation will effectively give you the best of both worlds, but you will still inherit the problems that come with the expiration mechanism.


Invalidation-based caching works by keeping responses cached right up until their resource's state is altered by a client request. Therefore, in order to rely on invalidation exclusively, the cache must intermediate interactions from all clients, and is therefore a mechanism that is only really suited to gateway caches, and not to normal shared (i.e. forward proxy) or client-side caches.

In HTTP terms an invalidating request is any non-safe request that receives a 200 or 300 response, this sequence diagram demonstrates how invalidation can work in practice.


All REST constraints play important roles in enabling cache-ability in general, however the main enabler of invalidation is the uniform interface, and specifically self-descriptive messages. The uniform interface and self-descriptiveness of messages empower layering in REST and crucially it enables intermediaries like caches to make assertions about client-server interactions. It is these assertions that are the key to the invalidation mechanism.


What are the benefits of invalidation-based caching?

Invalidation-based caches have self control; they are governed naturally and dynamically by client-server interaction. This makes them much easier to manage, ensures freshness, and operates with best-case efficiency in which responses are only invalidated when absolutely necessary. This best-case efficiency results in responses being cached for the longest possible period minimising contact with origin server, and bandwidth consumed.

So.. why isn't this really used? Because there are common problems when building systems with HTTP, that cause the mechanism to fail.


There are other types of problem and variations on these two, they just happen to be the most common.



In a perfect world resources are granular and don't share state. At all. So - in the perfect world example above, the collection is simply a series of links. This does, however, require any client to make several subsequent requests for each item resource. This behaviour is generally considered overly 'chatty' and inefficient  and therefore in the real world clear identification of resources and their state is traded-away for network efficiency gains.

This trade-off has consequences for invalidation..


How would an intermediary answer that question?

The right answer should be "None." or at least "I don't know."

Given that URI's are essentially opaque it should, as far as intermediaries are concerned, have no effect. Using assumptions of perceived URI hierarchy is brittle and restrictive - what happens if this item belongs to more than one composite collection?


It's common practice to treat representations as resources in their own right, and expose them with their own URIs. This creates the same kind of situation as the composite resource problem in which these 'representation resources' share state invisibly.

You could propose to solve this by making assumptions using 'dot notation' however this is again ignoring the opacity of URIs and is brittle and restrictive.

There are other examples in which the existing approach to invalidation is made impossible, but they all revolve around the same core problem:


Composite and split resources are problems because they reduce visibility.

Resources share state, and are therefore dependent, but the uniform interface lacks the capability to express this as control-data; and it is therefore not visible to intermediaries.


Link headers can be used to "beef up" the uniform interface by expressing these invisible dependencies as link relations.

Standardising the link relations allows these links to be used as control data within the uniform interface; thus increasing self-descriptiveness of messages and visibility.

This is named "Link Header-based Invalidation of Caches" (LHIC). There are two types of LHIC:


LHIC-I is a simple mechanism that can be thought of as "pointing out" affected resources in the response. This gives the origin server dynamic control over the invalidation.

In order to secure the LHIC-I mechanism from DoS attacks in which any/all cached objects could be indicated for invalidation, it is likely that a same-domain policy would have to be adopted.

The purpose of this type of link relation is to simply increase visibility of the invalidating interaction itself.


LHIC-II is a more complex mechanism that can be thought of as dependent resources "latching on" to one another. This effectively creates a 'dependency graph' within a gateway cache which can be queried against each invalidating request. LHIC-II is therefore capable of allowing invalidation to cascade along a chain of dependencies, whereas LHIC-I is only capable of handling first level dependencies.

The purpose of this type of link relation is to increase overall visibility in the system, ahead of an invalidation taking place.



LHIC-II does not suffer the drawbacks of LHIC-I, however it is more complex to implement and does not allow dynamic control of invalidation by origin servers.

The optimal approach is to implement both methods; this allows for both dynamic control by origin servers, and cascading invalidation.

Conclusion..

LHIC injects lost visibility back into the web

The resulting caching mechanism is
  • Very efficient
  • Ensures freshness
  • Easily managed
  • Leverages existing specs
Thoughts, comments, suggestions all welcome! :)

Wednesday, April 28, 2010

My slides from #wsrest2010

On Monday, I gave a presentation on the accepted paper myself and Michael Hausenblas wrote for ws://rest.2010 @ WWW2010.

Here are the slides:


I'll write a proper post about our paper once I'm back in the UK