Pinterest failing miserably on the backend

Pinterest failing miserably on the backend

Yesterday I was messing around with Pinterest to see how they do their Facebook integration.  I set up a test account and went about my way.  This morning, I started getting emails from them, so I clicked on “Change your email preferences” with the intention of unsubscribing.

What greeted me was the 502 error you see to your right.  502 is the HTTP error code for “Bad Gateway”.  This implies that some sort of networking error is preventing me from reaching their site.  But that’s not the case.  Having seen similar failure scenarios years ago when I coded Status Shuffle, I had a pretty good guess of what the problem was.

Let me give you a hint: I switched to my other browser, loaded the same link, and it worked like a charm.

Here’s another hint: It has nothing to do with the browser itself, this was a failure on their back end.

Let me tell you a little story.  Towards the end of 2009, Status Shuffle was growing like crazy.  It grew from 100K to 1M daily users in just a little over a year, and a lot of crazy problems started popping up.  When you have so many users, “rare” problems become not so rare anymore.  I added a lot of instrumentation to the error logs to try to figure out where things where going wrong.

One error that kept appearing in the logs was this:

Conflict in Facebook UID: user_id=xxxx  FB user_id (live)=yyyy  FB user_id (db)=zzzz

Meaning: Our user xxxx, who we think is Facebook user zzzz, is actually being reported by Facebook as being yyyy.

“Obviously,” I thought, “we have a bug”.  But the code looked good and I couldn’t initially reproduce this problem.

Then it dawned on me: Some people share their computers.  And if they share their computer, then the following is likely:

  1. Alice logs on to Facebook
  2. Alice installs our app.
  3. Our app creates a user for Alice, and stores her Facebook ID.
  4. Bob wants to use Facebook.  He logs Alice out, then logs in.
  5. Bob follows a link to our app.
  6. We have a session cookie, left over from before, saying “this is Alice’s browser”.  We ask Facebook to confirm, but they say: “No, this is Bob”.

The solution was simple: If you detect such a conflict, delete the backend session and start over.  Reauthorize the user if necessary.

And back to Pinterest: I’m hoping you figured out by now that the key to the 502 error was the fact that I was using a test account the day before.  Their email link contained some sort of identification code for that test account.  But when it opened in my default browser (where my “real” account was logged in), there was a back-end conflict and a spectacular failure.


Strange Errors to Expect at Scale

Let me sum up a few strange errors you’d expect if you hit this kind of scale.

  1. Multiple users on the same computer: As described above,  if not coded correctly, this this could cause your authentication to get out of whack.  Pro tip: This also happens on mobile devices (e.g. a family will share an iPad and log in and out of their respective Facebook accounts).
  2. Blocking Gateways: Some people work in friendly places with strict policies about what kind of content you can view.  Their IT departments go so far as to automatically block anything they deem inappropriate.  In our particular case, AJAX calls that returned user content with swear words were being blocked by the gateway.  Instead of our data, it was returning an HTML page stating that the content was blocked.  This was obviously not what the app was expecting and chaos ensued.  (BTW, we found this out using interesting logging techniques that I’ll hopefully have time to discuss in a future post.)  A possible solution is to use HTTPS so the gateway can’t read your content, though it’s not unreasonable to assume that “helpful” IT departments with gateways like this are above installing enterprise certificates so they can spy on parse all incoming and outgoing traffic.  In our case, we used good ‘ol rot13 to scramble the content.  Problem gone.
  3. Random Network Failures:  Even though we used a CDN, it was very common for a user’s browser to randomly fail to load resources.  For example, it would load only 3 out of 4 JavaScript files.  This problem was made worse when we switched all content to HTTPS, as the SSL handshake adds a few more network round-trips.  My solution was to detect if all critical resources loaded, and if not, automatically reload the page.  Just make sure not to get in an infinite reload loop!

I’ll update this list as I recall more.