« Smerity.com

Where did all the HTTP referrers go?

Good (and bad news): the general consensus in the web developer community is that any and every website should be HTTPS by default. Why? HTTP by itself isn't encrypted, leaving it open to eavesdropping, message tampering, and man-in-the-middle attacks. HTTPS, if you use it consistently, prevents these issues.

So how can that possibly be bad news? HTTPS is confusing one of the core metadata tools of the Internet: HTTP Referrers. HTTP Referrers disappear when going from HTTPS to HTTP, but, more worryingly, sensitive HTTPS Referrers still get carried when going from HTTPS to HTTPS. Most secure applications aren't aware of where their HTTP Referrers do or don't go. Don't worry though: there's hope. Or at least meta hope.

define(HTTP Referrers)

Imagine you're on Reddit and click on a link to homakov.blogspot.ru. The server at homakov.blogspot.ru knows that you came from reddit.com. It even knows what specific webpage you came from -- in this case, a Reddit search page on /r/netsec searching for homakov.

So what does this look like? When your browser wants a page, it sends a GET request with a number of HTTP headers. One of these headers is the HTTP Referrer and informs server B that you came from a link at server A.

GET http://homakov.blogspot.ru/2013/05/csrf-tool.html HTTP/1.1
Referer: http://www.reddit.com/r/netsec/search?q=homakov
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3
User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.31 (KHTML, like Gecko) Chrome/26.0.1410.63 Safari/537.31
Connection: keep-alive
Cookie: session="x1yc3OWM2ZTdhNNzk1YWY5NDk0MTczNTEKc=="; csrftoken="6QjAl18WY3NyZgpNsHpEKotZNfEtzSLocHRm";
Host: homakov.blogspot.ru
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.8
Accept-Encoding: gzip,deflate,sdch

The second line, Referer: http://www.reddit.com/r/netsec/search?q=homakov, is the HTTP Referrer. It's these HTTP Referrer fields that are tracked by analytics tools such as Google Analytics and ChartBeat.

HTTPS kills HTTP Referrers

HTTPS turns off HTTP Referrers to HTTP websites. Why? According to RFC 2616 (HTTP 1.1), this is due to the possibility of sensitive information being encoded in a referring URL:

Clients SHOULD NOT include a Referer header field in a (non-secure) HTTP request if the referring page was transferred with a secure protocol.

This seems quite reasonable at first glance. Unfortunately, this hard set rule doesn't transition well into a future where almost every website uses HTTPS.

Why? The HTTPS Referrer, which may contain sensitive data, will be sent from any HTTPS website to any other HTTPS website by default. It's only when a connection goes from HTTPS to HTTP that the referrer is dropped.

This leaves two problematic situations:

  • HTTP websites don't receive referrers from HTTPS websites -- all traffic appears as direct traffic
  • HTTPS websites will send referrers to any other HTTPS website even if it contains sensitive information

The first situation means we lose any understanding of where traffic is coming from, the second situation leads potentially to security vulnerabilities or information leaks. Essentially, if a HTTP website links to another HTTP website, the author of the secure page is lending extra trust just as it's HTTPS. In most cases, this is not what was intended.

Fixing Referrers in HTTPS: The Meta Referrer

So what are the possible situations with our HTTP Referrers? If we could somehow tell the HTTP Referrer to act in a particular way, what different behaviours would you like? What do we want to be possible? We'll label these different situations with names.

  • Never: Our URLs are sensitive and we don't want to tell the world about them. We want the HTTP Referrer dropped (i.e. default behaviour for HTTPS).
  • Origin: I run a search engine and want people to know the traffic came from me, but not which search queries were used to reach your site.
  • Default: I run Facebook and want to pass the HTTP referrer for all requests unless it goes from HTTPS to HTTP (i.e. the situation where previously hidden information might be disclosed).
  • Always: A blog hosted over HTTPS might wish to link to a blog hosted over HTTP and receive trackback links.

These cases are covered under a new HTML5 called the meta referrer. Now a simple tag can be used, such as <meta name="referrer" content="always">, to specify the exact behaviour of the HTTP Referrer regardless of whether we're using HTTP or HTTPS.

With that said, not all web browsers support this new fangled HTML5 meta referrer. If a web browser doesn't support this flag, it'll just fall back to the standard for HTTP or HTTPS. Which browsers support these?

Web Browser Meta Referrer Support
Google Chrome Yes
Safari Yes
Firefox No (In Progress)
Internet Explorer No

Well, okay, that's fine, the world improves. Let's say in a year they'll get it all sorted. What popular websites are using meta referrer?

Website Meta Referrer
https://www.facebook.com Yes: Default
https://www.google.com Yes: Origin
Reddit No: HTTP and no meta referrer
Hacker News No: HTTPS with no meta referrer

Add the metatag, save the world

So, here's my little request.

If you run a website over HTTPS, add in an appropriate meta referrer. If it's a secure application, nix the referrers by setting it to _Never_. If the Internet would benefit from knowing you sent them traffic, allow those referrers for everyone.


Appendix

Tangent: Google Knows

Considering Google Analytics is used for 57.23% of the top million websites on the Internet (572,300 in case your math failed you), they're actually in an excellent position to track and understand the flow of the Internet even without these referrers in place. Assuming that a user is only browsing amongst the top million websites, not only do they know how a user interacts on a website with Google Analytics but they're likely able to track when a user goes from site A to site B as site B likely has Google Analytics. Even if it didn't, they know the user didn't go to one of the other Google Analytics backed websites, so the user either ended up closing the tab or going to one of the links on the page that isn't backed by Google Analytics. This is all valuable information they could use as it's a form of limited PageRank -- in PageRank you assume the user is a random link clicking bot, whilst in this situation you've removed a great deal of the ambiguity.

Combine this with Google's stranglehold on how people reach web pages (Google Search) and their access to the valuable search queries, they really are in an enviable position.

Tangent: Referrer vs Referer

The correct spelling is referrer (two arrs) but you'll see it everywhere as referer (minimal arrs). Why? A small accident of history. Every reference to referrer is spelt referer in RFC 1945 (HTTP 1.0). By the time it was picked up, it was a little too late.

Useful Links