For years I had Apache serving up Status Shuffle. It wasn’t perfect, but it worked. In fact, it worked for so well that it handled a million users a day on one box with plenty of room to spare. However, in late 2011, Facebook started requiring HTTPS support from all it’s publishers, us included. We bought an SSL certificate, made the necessary configuration changes, then restarted Apache. It all seemed to work as planned.
Over the next few days we watched our server logs closely and discovered that our error rate has gone up. It seemed that the extra overhead caused by the SSL negotiation step was enough to dramatically increase the failure rate for some of our users (probably those with unstable Internet connections). Ideally, we would use HTTP keep-alives to allow everyone to open the connection once and make multiple requests through it, thereby offsetting the SSL negotiation overhead. Alas, trying to hold thousands of connections with Apache’s pre-fork MDM was a sure way to eat up all the RAM in our box. Instead, I decided to put up an nginx box in front of Apache in a reverse proxy setup.
The idea is simple and well documented. You use nginx to handle connections with the end users. nginx in turn calls Apache to actually do the work, then finally hands off the data to the end user. nginx can hold thousands of connections open, thus keep-alives are no problem. (Note: Apache 2.4 was just released and it can nativity do all of this using “event” MPM. Alas, “event” doesn’t support SSL connections so we’ll be sticking with this nginx setup for now).
Redmond, We Have a Problem
At first we didn’t realize anything was wrong. As I mentioned above, the amount of connection issues we were logging went down dramatically so we were very happy. But then the complaints started: “It doesn’t work” and “When I load Status Shuffle it looks funny”. When more and more complaints started piling up, we couldn’t ignore them anymore, even though the app worked perfectly on every machine and browser we could get our hands on.
A Google search regarding the “Invalid character” error failed to add clarity. Some said this error would be returned if IE failed to download the file at all. Some implicated an old (now fixed) bug in IE 6 where it wasn’t properly decoding gzip’ed web pages properly. And that couldn’t be it because we were seeing the issue in Internet Explorer versions up to 9 (the latest version at the time of writing).
User Visible Symptoms
We finally caught a break when a user who reported the issue agreed to remoting session with us. Over the course of an hour we poked around her computer, trying to figure out what’s wrong. The strangest thing of all was that she shared the computer with her husband, and when she logged in through his account, the problem didn’t exist. I concluded that there must be a corruption somewhere in her local user registry settings, but couldn’t tell what it was in the time I had with her.
Piecing It Together
I had a theory that nginx’s proxy was causing the issue. We quickly confirmed this by temporarily taking nginx out of the loop and hitting Apache directly. The errors stopped coming in.
Scanning the nginx documentation for the proxy_cache directive revealed this innocuous looking sentence:
nginx does not handle “Vary” headers when caching.
The Vary header is used to tell caching proxies that a response is tied to a particular request header format. For example, when your browser requests a web page, it will tell the web server that it will accept (understand) compressed results by using this request header:
The web server will then happily compress the page and return it with (at least) the following two response headers:
The first means that the response is compressed using gzip. The second means: This response is only valid for requests that have the same Accept-Encoding value that you just sent me.
As stated in the documentation, nginx doesn’t handle the Vary response header. The sensible thing to expect from nginx is that it would not cache responses that contain Vary. Instead, what nginx does is cache the result of the first request and serve it to everyone, even if they don’t have the same request header.
I assume this is what was happening with IE for all those people. I estimate that 0.5%-1% of our Internet Explorer users were affected. If Status Shuffle didn’t have the massive request volume that it does, we would have probably never caught on to this.
I definitly consider this to be an nginx bug. But at least there is an easy fix. We moved all compression away from Apache and into nginx.
This means that in Apache, we removed:
And in nginx we added:
gzip_disable "MSIE [1-6]\.";
We now have no more errors and no more angry users.
Update Feb 29, 2012: I filed a bug against nginx 1.0.12 on the nginx bug tracker. Maxim Dounin of nginx offered two interesting ways of working around the issue, so please follow the bug report link to read his comment.