Pingdom Home

US + international: +1-212-796-6890

SE + international: +46-21-480-0920

Business hours 3 am-11:30 am EST (Mon-Fri).

Pingdom Blog

Royal Pingdom

Ramblings from the Pingdom team about the Internet and web tech

RSS Feed

The major Internet outages so far in 2008

Downtime manEvery day brings a new set of outages on the Internet. Websites go down, online services run into trouble, networks have glitches, and so on. When a lot of users are affected, these outages make the news and set the blogosphere abuzz. We here at Pingdom work with downtime-related issues every day and probably spend more time reading about these things than most, so we decided to sum up the year so far for your convenience, and add some analysis of our own in the process.

These are 14 (not 13, that would be bad luck! ;) ) of the more notable Web- and Internet-related outages and incidents so far in 2008. We chose outages that have either affected a lot of people, or have other implications that we deemed important to highlight.

One thing that the following examples clearly show is that no one is immune to downtime. Not Google, not Microsoft, and not Apple. In addition to this, sometimes whole parts of the Internet itself simply break.

SaaS availability problems

  • Gmail trouble (Mar 1, Mar 12, Mar 27, Aug 11) – Google has had numerous difficulties with its Gmail service this year. Often only a subset of the users have been affected, but they have made themselves heard, and there has been plenty of big news coverage which should not be surprising considering how many people use the service.
  • Google Apps trouble (Jul 8, Aug 7) – Problems with Google Apps (including Google Docs) caused grievances for users on at least two occasions. The noise generated by these outages, although they were relatively short-lived, gives an indication of what would happen if Google Apps would go down for a longer period of time, say a day. We suspect the roar would be heard all over the Internet.

Glitches in the cloud

  • Amazon S3 outages (Feb 15, Jul 20) – These outages are significant because AWS has become somewhat of a poster boy for cloud computing, so every time S3 (or EC2) has a problem, “the cloud” is called into question. And of course, another reason it gets a lot of attention is that a lot of services use Amazon S3, so just as when a hosting company or data center has an outage, a lot of sites are affected. The one on July 20 lasted eight hours.

Launch issues

  • Apple’s MobileMe outages (Jul 10-11) – When Apple was migrating .Mac accounts to the new MobileMe, things did not go as smoothly as they would have wished. Steve Jobs has later admitted (in a leaked email) that Apple made a mistake launching MobileMe, the iPhone 3G, the iPhone 2.0 software and App Store all on the same day, and that MobileMe should have been given more time and testing.
  • iPhone activation woes (Jul 11) – Not only did Apple have problems with the MobileMe launch, but many who bought the brand new iPhone 3G couldn’t activate their phones because the iTunes activation servers were overloaded.
  • Cuil launch trouble (Jul 28) – When Cuil launched its much-hyped “Google killer,” it crashed. They had built up a big buzz, and then weren’t able to handle the number of visitors they got. Now there are also rumors that Cuil itself is causing downtime for some sites due to excessive crawling by its indexing bot.
  • Microsoft Photosynth launch trouble (Aug 21) – The website of Microsoft’s new 3D photo-stitching service couldn’t handle the load on its launch day. Microsoft seems to consider the initial downtime as a badge of honor and promised to quickly add more horsepower, but as Michael Arrington pointed out over at TechCrunch: “We see similar optimistic responses to server failure all the time from startups. Except they’re startups.”

Internet and data center issues

  • Mediterranean submarine cable cuts (Jan 30) – A pair of cut submarine telecom cables in the Mediterranean just north of Egypt caused severe Internet outages and disruptions in the Middle East, Pakistan and India. This incident reminded us all that ultimately the Internet is greatly dependent on the physical cables that interconnect the various networks that make up the Internet. (Renesys has a detailed description of the effects of the outage that you may want to check out, including a map of the most affected areas.) Further cable incidents in the same region followed, sparking various conspiracy theories.
  • Explosion and fire at The Planet (Jun 1) – Probably the most massive data center outage of the year, an explosion and electrical fire at one of The Planet’s data centers in Houston affected thousands of sites (around 9,000 servers), some for several days. The fire department’s initial refusal to let The Planet activate its backup power generators didn’t exactly help. In addition to this, services that depended on DNS servers located in that data center were also affected.

Service interruptions

  • Blackberry service outages (Feb 11, Jun 18) – The Blackberry addicts (some have given it the nickname “Crackberry”) have had two major service outages so far this year to fret about. Considering that many Blackberry users are business users, this could have had a real effect on some businesses’ ability to communicate.
  • Netflix problems (Mar 24, Aug 11-15) – In March the online video-rental service (with seven million subscribers) suffered from a technical glitch that took out its website and logistics system for about 12 hours. In August a system problem prevented the service from delivering DVDs for several days.

Third-party “sabotage”

  • The YouTube IP hijacking (Feb 24) – YouTube was unavailable for roughly two hours because an ISP, Pakistan Telecom, had mistakenly claimed their IP address space (including the IP addresses used by YouTube’s DNS servers). This effectively took YouTube offline in a matter of minutes. This is interesting because it proved that a single ISP can, under the right (or wrong!) circumstances, inadvertently sabotage parts of the entire Internet.
  • Revision3 taken down by MediaDefender DDoS attack (May 24-26) – Anti-piracy company MediaDefender crippled the Revision3 Internet TV service for most of the Memorial Day weekend with an attack on the service’s (legal) BitTorrent tracking server, bombarding Revision3’s network with up to 8,000 SYN packets a second. Revision3 has posted a lengthy explanation of the incident on their blog.
  • SiteMeter crashing blogs (Aug 2) – An update to SiteMeter’s script (websites can have it included on their pages to get visitor statistics) started crashing popular blogs like Gawker, Lifehacker, Gizmodo and Valleywag for Internet Explorer users. Presumably every single website using SiteMeter had this problem. This is significant because it shows that third-party apps and scripts can quite easily stop a whole site from working.

Summary

This is a summary of some key points that we feel were highlighted by the various outages we have listed above.

  • Capacity problems during launches – There seems to be a trend involving new services building up hype and then launching a new product only to be unable to handle the amount of traffic the service gets. These are often pure scaling issues, for example when people couldn’t activate their new iPhones or when Microsoft’s Photosynth proved to be “more popular than expected.” That this happens for small startups that have limited resources is not surprising, but for big companies like Apple to stumble like they did with MobileMe and the iPhone activation is a bit unexpected.
  • The importance of DNS – Several of these issues also underline how critical DNS servers are on the Internet. Lose control of your DNS servers, and you lose control of your site, so spread them out. Don’t have them all in one place! For example, after the YouTube incident described above, YouTube has added additional DNS servers on another network to prevent something similar from happening to them again.
  • Infrastructure problems are affecting “the cloud” - It may not be fair to single out Amazon, but their S3 service is the most-used “cloud” utility service out there so they always come up when cloud computing is discussed. The problems they have had so far have been various issues with their infrastructure, but hopefully these problems will diminish as the technology matures. Arguably you could also cite the problems that Google have had with Gmail and Google Apps to be “issues with the cloud”.
  • Third-party interference – Several of these outages show how vulnerable services and websites are to “external influence”. The most recent incident with the SiteMeter script is highly interesting, because in the mashup world of Web 2.0, websites are stacked with third-party scripts and applications. Any one of these may be the link that breaks the chain. The more you have, the higher the risk of them affecting your site in some way (performance or downtime).

Downtime happens to everyone sooner or later. It’s simply a matter of fact on the Internet. However, by learning from the mistakes (or bad luck) of others and being well aware of what can go wrong, it’s possible to take a proactive approach and minimize the risk of future downtime, and when it happens, at least keep it short.

By the way, if you feel that we missed some major (or interesting) outages, please let us know in the comments!

Want to test your site every minute?








You will get an email with your login information.

3 Comments

You forgot the poster boy of downtime. Twitter.
Also the problem of Firefox3 launch

Don’t forget the Cogent/Telia peering dispute, although this wasn’t an outage affecting everyone then this was a serious issue for a lot of people single-homed to either Cogent or Telia.

that little yellow icon should be the pigdom iphone app icon!

In 2010, there were just over 1 million secure Internet websites worldwide. Almost half of those, or 446,992 to be exact, were located in the United States.

But in which country can we find the most secure websites in relation to population? The answer may surprise you.

Read more

No news is good news for the Super Bowl website

The New England Patriots held what seemed to be a commanding lead (17-15) with five minutes left of Super Bowl XLVI last night. But the New York Giants came back and managed to win with 21-17.

As exciting as the game sounds, we missed the whole thing, instead spending our time watching the Superbowl.com website.

It turned out to be a rather dull thing to do because the site held up well and there was no downtime at all. The response time also didn’t give away anything significant in terms of online Super Bowl traffic.

Read more

As Super Bowl 46 is approaching, fans will flock to the Lucas Oil Stadium in Indianapolis, Indiana, and to TV sets around the world to follow the New York Giants battle it out with the New England Patriots.

Kickoff is scheduled for 6:30EST on Sunday, February 5, and we’re already monitoring Superbowl.com to see how the site will handle the event.

What team will win Super Bowl 46? How will the site cope? We can only wait to find out.

Read more

Weekend must-read articles #2

Every Friday we bring you a collection of links to places on the web that we find particularly newsworthy, interesting, entertaining, and topical. We try to focus on some particular area or topic each week, but in general we will cover Internet, web development, networking, performance, and other geeky topics.h

This week we bring you a collection of articles focusing on cloud, with a few other topics thrown in to boot.

Read more

Out of the 59 US-based e-commerce sites we monitored during the holiday season last year 28 scored a perfect 100% uptime for December.

Whether this helped spur on the booming sales in the US, we don’t know, but retail e-commerce spending in the US reached $37.2 billion for the November to December 2011 period. That was an increase of 15% from the same period in 2010.

We decided to dig into the numbers for these e-commerce sites to see how well they did in terms of uptime and performance. After massaging the data coming from our Pingdom probes, it turns out that the sites overall performed well during December 2011 in terms of uptime, but response time was an issue for several sites.

Read more