Major Google App Engine hiccup reveals weaknesses

Google’s App Engine suffered from increased data access latency and errors yesterday, including problems serving applications. According to TechCrunch, the problems lasted for approximately six hours.

From the App Engine status page:

On July 2nd, all applications experienced increased error rate and latency with read and write Datastore and memcache operations, as well as some serving errors. Datastore access and serving have been fully restored as of 12:25 PM PDT.

In a longer explanation posted later by the App Engine team, the problem was apparently due to an issue with GFS (the Google File System) in one of App Engine’s datacenters. This in turn broke Bigtable, which App Engine’s Datastore depends on and also caused the application serving problems (in plain English: causing actual application downtime, not just slowdown).

The increased latency is clearly visible on the status page:

In Google’s defense, system problems like this do happen, and App Engine is no exception. Still, six hours is a significant period of time and is sure to have been very frustrating to those who host their applications on the service (and equally frustrating to the people trying to use those applications!).

The cloud weakness

Whenever something like this happens, it clearly reveals one of the big drawbacks with cloud computing: Cloud computing services become the single point of failure for all applications depending on that service. Therefore any service downtime will have a wide impact.

From many companies’ perspective another drawback is also relying on an external service for something that may be business critical, but that is a discussion for another day.

But now on to something that really surprised us:

The App Engine status page went missing in action

The App Engine issue revealed a weakness in the way Google has set up its system status page.

It’s good that they have one, but while yesterday’s problems lasted, the App Engine status page was unavailable, something that surprised us here at Pingdom a great deal.

Why? Because normally it’s good practice to have the status page for a service completely separated from the service’s infrastructure and placed in a different datacenter, which apparently wasn’t the case here. Considering the scale of Google’s operation, we’re not sure why Google hasn’t done this.

Here is what Google had to say about it:

Many users noted that the System Status site was also down. The System Status site is hosted separately from App Engine applications, and is not typically affected by availability problems. However, due to the low level problem with GFS in this case, the System Status site was also affected.

The team ended up posting status updates via Google Groups and Twitter.

Note that they said that the status page is “hosted separately from App Engine applications,” which is good, but it shouldn’t depend on any of the infrastructure that the App Engine uses. Google really should think about adding another level of separation.

1 comment

Leave a Reply

Comments are moderated and not published in real time. All comments that are not related to the post will be removed.