Pingdom Home

US + international: +1-212-796-6890

SE + international: +46-21-480-0920

Business hours 3 am-11:30 am EST (Mon-Fri).

Do you know if your website is up right now? We do! LEARN MORE

About yesterday’s Pingdom outage

PingdomAs you may have noticed if you’re a Pingdom user, we had problems yesterday. In this post we will do our best to explain what happened and why, and how we will learn from this going forward.

And let us also take this opportunity to sincerely apologize for any inconvenience this may have caused you. It’s super-important for us to provide you with a quality service, and those of you who have been with us for a long time hopefully know that we deliver on that promise. This was an extreme, highly unusual scenario.

What happened

Yesterday at 3:29 p.m. CET (GMT+1) we had a critical hardware failure at our main data center, located in Stockholm, Sweden. This is where the Pingdom website, control panel and backend are hosted, as well as the Stockholm monitoring location. The hardware failure effectively cut off those servers from the Internet.

We determined the root cause relatively quickly, but it took time to get replacement hardware into place and get it back up and running properly. In the end we were offline for just over four hours.

That doesn’t mean monitoring stopped working during this time. The actual monitoring from our 30+ monitoring locations is designed to continue working independently of our backend if it can’t be reached.

Once the backend was reachable again, the system started the process of catching up with the monitoring results for those past few hours, i.e. getting the results from those 30+ (32 right now, to be exact) monitoring locations. They cache results until they can be sent to and processed by the backend and show up in reports, etc.

The consequences

Aside from the site being unavailable for four hours, which was bad enough, there were some other unfortunate consequences of the outage.

  • Delayed data in reports. Since our backend was unreachable for such a long time, it took a couple of hours until all the monitoring data from our monitoring network had been sent in and processed so the report data was 100% correct and up to date.
  • Delayed alerts. Those who had outages that started during the time our backend was unreachable got their alerts late, which of course is unacceptable. We plan to implement a distributed solution for alerts that will completely eliminate any risk for this in the future.
  • Some incorrect alerts. Unfortunately, in addition to regular but delayed alerts a number of incorrect ones were triggered as the system came back up. We did not discover this until we had time to investigate the system in more detail. We had designed a fail-safe for this, but it turns out it was circumvented in this case. After the fact we can only apologize. We have found the problem and will correct it so it can’t happen again.
  • Overloaded site. To make matters worse, since hours worth of delayed alerts were sent out in a short amount of time as the results flooded back into the system, a ton of people of course tried to connect to our site and control panel to investigate what was going on, creating almost a kind of DDoS attack which made our site and control panel slow and even at times unavailable.

Note regarding SMS refunds: Any lost SMS credits due to late or incorrect alerts in connection with yesterday’s outage will of course be refunded.

What we have learned going forward

We tend to say that no service is immune against downtime, and that includes us. What matters is that you resolve it, and learn from it.

This four-hour downtime was, by quite some margin, the single largest issue we’ve ever had with the Pingdom service during the four years we’ve been around. Basically every single thing that could possibly go wrong, did. To name just one example, getting the replacement hardware into place easily took an hour longer than necessary simply because roadwork had created a huge traffic jam on the freeway to our data center.

We’ll use the experience and knowledge we’ve gained from this to make our system stronger and better, with additional fail-safes and redundancy. We’re using an excellent data center, but we need to add more, especially for backend-related functionality.

What we plan to do to avoid similar incidents in the future:

  • Add more locations with backend functionality to handle alerts and other time-critical resources.
  • Add more redundant hardware, and better failover capability.
  • Have more spare parts on location.
  • Add more fail-safes in the system to gracefully recover from longer site issues.
  • Make alert code modifications to better handle extreme situations without side effects.

That’s actually the only upside to this incident. It will make Pingdom an even better service. You should be able to trust us 100%, and yesterday during and for a while after the outage, we stumbled in that regard. We hope your trust in us has not been too damaged by this incident.

And please remember, if you have any questions you can always get in touch with the Pingdom team at support (at) pingdom.com.



18 comments
netking17
netking17

Getting an offline alert and seing my site online was just, hey, good news... I would worry more if my site was down and pingdom wouldn't tell me. Your service is one of the most reliable I know, always here whenever there's the slightest glitch on our servers. Everybody have a chance to screw it sometimes (even S3) and I'm sure you'll get the service even better after that. Keep the good work ;-)

pingdom
pingdom

It goes without saying that we really appreciate your understanding and all the supportive comments. But we'll say it anyway: Thanks everyone. :)

Abbey
Abbey

Seems we didn't learn anything from the Amazon E3 downtime a few months ago. We can't have any single point of failure in our systems or applications. And a monitoring service should be a perfect example of this. Of course, we are only human and even machines are created by us, thus not perfect either. I trust that Pingdom team will carry on their redundancy plan and make this service only better in the long run :)

John Dalton
John Dalton

Seconding Gavin's point #1 - I was woken at 5AM by obviously incorrect alerts ("down" alerts which didn't match my internal monitoring systems, followed by "up" alerts reporting an end to outages that had apparently gone on for multiple days). When I checked the Pingdom twitter feed (later in the day) for some sort of explanation, I saw the responses denying any false alerts were sent. As soon as I had determined that my alerts were incorrect, I went back to sleep - so I wasn't too badly inconvenienced. ;) I also know how hard it is to deal with integrating and understanding the data coming in from customers in the field when your primary concern is service restoration. However if a large number of customers are telling you your analysis is incorrect, then it's a probably a good idea to acknowledge the possibility (even if you don't believe them yet ;) ). All that said, I'm still very happy with Pingdom's service. These things happen, and the response as outlined above looks good.

Adrian
Adrian

I think that this post with apologies is ok for customer who need to know what happens .

Brian
Brian

Nice post, thanks; good to see you're addressing the causes from a number of points of view and this sort of transparent and thoughtful response only serves to increase my confidence in pingdom. Of course, the reliability you are seeking to provide can only be accomplished by having a well distributed architecture with no single points of failure; you have to protect against an entire data centre going off the air as well as machine failure.

Cam Jackson
Cam Jackson

It happens. You guys are still awesome. Big hugs for pingdom from me :)

Gabriel
Gabriel

I was playing a concert in Paris at and I had to leave and go get my computer to see what was going on. I'll send you the taxi bill ;)

Mike
Mike

I think well done for being honest in this blog post about your issues yesterday and you've created an action plan very quickly. All websites go through similar problems, even huge services like Skype and Twitter. This event was unfortunate but will only make Pingdom an even better service in the future.

Remco
Remco

"getting the replacement hardware into place easily took an hour longer than necessary simply because roadwork had created a huge traffic jam on the freeway to our data center" What about remote hands in the datacenter? I am a bit flabbergasted that you with such an important service have a single point of failure, and no people/spare parts at location. Did you just not think about it, or where you just to busy with drawing the new office plans ;).

Dennis
Dennis

Thanks for fixing the false alert problem. At first I was panicked when I started getting text message alerts that a bunch of our servers (even at different data centers) were down. But I quickly realized that the problem was with Pingdom. Anyway, thanks for explaining the problem. for refunding all the text message credits and for the extra ones provided. As far as I am concerned you provide a valuable service at a excellent price.

Scott
Scott

An incorrect alert got me out of bed at around 4am and rushing to check why DNS was down, when it wasnt! These things happen though! I love pingdom :)

Mark
Mark

What was the hardware failure? Was it a server, a power strip, a switch or ?

Gavin Pearce
Gavin Pearce

Great post, and great handling of what could be classed as a catastrophic situation. We were kept informed, and updated throughout, and no major trust lost in Pingdom as a result. On the topic of learning for the future, two areas for improvement: 1) Conflicting information - straight away a large number of users, ourselves included, noticed that many of the reports were incorrect, not just delayed. However Pingdom response was that "no" incorrect reports had been sent out, only wrong time stamps. Lesson: Trust your users more when they point out errors. 2) Single point of failure - the only question left unanswered. What item of kit was it, and why did the system have a single point of failure if so critical? No need to expand further on this one. Lesson: Obvious. Otherwise, great handling and great work on keeping us customers informed throughout. Much appreciated, even if it did interrupt our meal! :)

Pete
Pete

This is a good first step towards regaining your customers' trust. I'd suggest that you need to go a bit further to knock the ball out of the park. Fix the problems, and then firedrill this exact outage in production. If you're willing to do that, then you don't have any doubt that you really fixed the problem, do you? Demonstrate that the problem is solved, and then blog about it. Showing your customers that you actually follow through with your intentions is important, and good marketing, as well.

Roberto Valerio
Roberto Valerio

We did not receive a single warning email that Pingdom services were offline! Why was there no notification to your customers? This would have helped a lot. I do not care about late apologies, but I do care about timely communication! Even a mass mailing to all customers would have helped. Don't tell me you do not have a offsite backup of your customers contact date. Best, Roberto

pingdom
pingdom

@Remco: Getting someone on location wasn't the big problem (i.e. remote hands were available), the big problem was that we unfortunately didn't have the specific replacement hardware we needed available at the site.

pingdom
pingdom

@Mark. Firewall hardware breakdown.