Major Google App Engine hiccup reveals weaknesses
Google’s App Engine suffered from increased data access latency and errors yesterday, including problems serving applications. According to TechCrunch, the problems lasted for approximately six hours.
From the App Engine status page:
On July 2nd, all applications experienced increased error rate and latency with read and write Datastore and memcache operations, as well as some serving errors. Datastore access and serving have been fully restored as of 12:25 PM PDT.
In a longer explanation posted later by the App Engine team, the problem was apparently due to an issue with GFS (the Google File System) in one of App Engine’s datacenters. This in turn broke Bigtable, which App Engine’s Datastore depends on and also caused the application serving problems (in plain English: causing actual application downtime, not just slowdown).
The increased latency is clearly visible on the status page:

In Google’s defense, system problems like this do happen, and App Engine is no exception. Still, six hours is a significant period of time and is sure to have been very frustrating to those who host their applications on the service (and equally frustrating to the people trying to use those applications!).
The cloud weakness
Whenever something like this happens, it clearly reveals one of the big drawbacks with cloud computing: Cloud computing services become the single point of failure for all applications depending on that service. Therefore any service downtime will have a wide impact.
From many companies’ perspective another drawback is also relying on an external service for something that may be business critical, but that is a discussion for another day.
But now on to something that really surprised us:
The App Engine status page went missing in action
The App Engine issue revealed a weakness in the way Google has set up its system status page.
It’s good that they have one, but while yesterday’s problems lasted, the App Engine status page was unavailable, something that surprised us here at Pingdom a great deal.
Why? Because normally it’s good practice to have the status page for a service completely separated from the service’s infrastructure and placed in a different datacenter, which apparently wasn’t the case here. Considering the scale of Google’s operation, we’re not sure why Google hasn’t done this.
Here is what Google had to say about it:
Many users noted that the System Status site was also down. The System Status site is hosted separately from App Engine applications, and is not typically affected by availability problems. However, due to the low level problem with GFS in this case, the System Status site was also affected.
The team ended up posting status updates via Google Groups and Twitter.
Note that they said that the status page is “hosted separately from App Engine applications,” which is good, but it shouldn’t depend on any of the infrastructure that the App Engine uses. Google really should think about adding another level of separation.

The New England Patriots held what seemed to be a commanding lead (17-15) with five minutes left of Super Bowl XLVI last night. But the New York Giants came back and managed to win with 21-17.
As Super Bowl 46 is approaching, fans will flock to the Lucas Oil Stadium in Indianapolis, Indiana, and to TV sets around the world to follow the New York Giants battle it out with the New England Patriots.
Every Friday we bring you a collection of links to places on the web that we find particularly newsworthy, interesting, entertaining, and topical. We try to focus on some particular area or topic each week, but in general we will cover Internet, web development, networking, performance, and other geeky topics.h
Out of the
Pingdom’s Mobile Podcast is a weekly show about Internet, web, and mobile stuff. 


Judy
July 8th, 2009 at 1:49 pm
I am tying to imagine how some people might have been frustrated by that