Pingdom Home

US + international: +1-212-796-6890

SE + international: +46-21-480-0920

Business hours 3 am-11:30 am EST (Mon-Fri).

Do you know if your website is up right now? We do! LEARN MORE

In regard to the recent outages

In less than a week, we’ve experienced two outages. In fact, these two outages combined have been the worst since the company was founded in 2007.

We wanted to take this opportunity to give you an update on the situation and tell you where we go from here. The current status is that all our core websites and services are up and running as they should. This includes the monitoring you have set up of your websites, alerts, our API etc.

The only services remaining to be started up again are the non-core Ping and Traceroute service, parts of our Tools, as well as Report Banners.

When this most recent outage started yesterday, we were of course alerted right away through the various ways we monitor our own services. Despite this, the severity of this incident and the challenges we faced meant it took longer than anticipated addressing the problems. During both incidents, we have worked closely with consultants as well as suppliers.

We are in the website monitoring business, and we think you will agree with us that everyone will, at some point, be struck by unscheduled downtime. To put things in perspective, our systems handle around 280,000 customer accounts and store almost 300 million monitoring results each day. We invest millions every year in hardware, software, services, and more, to provide the best possible monitoring solution available today. Unfortunately, even with the best of intentions and the most thorough preparations, things sometime go wrong.

We’re not trying to play the blame game, but we want to be as upfront as we can with you, our customers. Next week we will present a detailed plan for how we will fix the point of failures we have today, including a specific timeline for when different things will be done. Following that plan, we will continuously keep you updated and informed about the progress we make.

Until everything has been implemented to fix the risk presented by problems we face, we’ve already put into place measures that hopefully will prevent any similar incident from happening again until we made the necessary long term fixes. We anticipate that we will be completely done with the first step of the coming changes and updates in 5-10 days from now.

Everyone at Pingdom is now dedicated to fixing these issues, and all other development has been put on hold. Even after what’s now happened, we’re very proud of the services we provide, and we’ll work exclusively on making our systems as reliable as they can possibly be.

Even though, this is just the start of what will be an intensive process for us, we’d like to think we learn from our mistakes. This is now our chance to prove to you, our customers, that we do just that.

If you have any questions or concerns, please send an email to support@pingdom.com.

As always, it would be very helpful for us if you could provide as much information as possible when you contact us, including your account information and what checks are affected. Also, wherever possible, sending us a screenshot of the exact issue will help us help you faster.



28 comments
jasonjwwilliams
jasonjwwilliams

I've heard nothing in Pingdom's replies here to indicate they agree their silence was unacceptable. In fact the paragraph containing "we're not trying to blame someone else" did exactly that...we're not unintelligent individuals. Bad things happen, being honest/communicative about them and making sure that permutation never happens again is all I expect of my vendors.

 

I opened a support ticket on Friday when it was clear they were having issues....no response to that ticket and it's now Tuesday. Not so much as a customer-wide mass mailing admitting the issue and it's nature (much less an RFO with a resolution). We moved to Pingdom from Alertra because it looked like you got more for your money. Guess not...in 6 years of using Alertra for the same purpose (and check types) we had not one false positive or critical outage. In 2 years of using Pingdom this is the 5th significant issue. Short of a public mea culpa indicating Pingdom values it's customers, we will be moving again...

 

While we're on the subject of valuing your customers...if you're generating enough revenue to justify millions of $$ in "hardware, software, and services" why the heck don't spend some of that on a couple of guys to have 24 hour email support? The lack of it is just another indication that all that is valued about Pingdom's customers are their payments.

jasonjwwilliams
jasonjwwilliams

I've heard nothing in Pingdom's replies here to indicate they agree their silence was unacceptable. In fact the paragraph containing "we're not trying to blame someone else" did exactly that...we're not unintelligent individuals. Bad things happen, being honest/communicative about them and making sure that permutation never happens again is all I expect of my vendors.   I opened a support ticket on Friday when it was clear they were having issues....no response to that ticket and it's now Tuesday. Not so much as a customer-wide mass mailing admitting the issue and it's nature (much less an RFO with a resolution). We moved to Pingdom from Alertra because it looked like you got more for your money. Guess not...in 6 years of using Alertra for the same purpose (and check types) we had not one false positive or critical outage. In 2 years of using Pingdom this is the 5th significant issue. Short of a public mea culpa indicating Pingdom values it's customers, we will be moving again...   While we're on the subject of valuing your customers...if you're generating enough revenue to justify millions of $$ in "hardware, software, and services" why the heck don't spend some of that on a couple of guys to have 24 hour email support? The lack of it is just another indication that all that is valued about Pingdom's customers are their payments.

Pingdom
Pingdom

Thanks for the suggestion Dan, that's certainly something that's in our plans.

Pingdom
Pingdom

Thanks for the suggestion Dan, that's certainly something that's in our plans.

Dan Plaskon
Dan Plaskon

I would recommend setting up a dedicated site for status updates during service disruptions. Host it under a subdomain (eg. status.pingdom.com) and serve it from somewhere completely disconnected from your infrastructure to ensure it stays up if your other services go down. I know of many companies that take this approach, one example is BeanstalkApp (who we use for source control and deployments) - their status page is here for reference: http://status.beanstalkapp.com/

Pingdom
Pingdom

This incident was certainly a learning experience for all of us at Pingdom, when it comes to technology as well as in other ways, including communications. We have taken everything that our customers have said to heart and will do better in the future, that much we can promise. As we said in the blog post from a few days ago, we will this week publish our plan for what we're doing to get to grips with the points of failure that led to the incident this past weekend. Please keep the comments coming - everything you've said is greatly appreciated and we *will* do better.

Pingdom
Pingdom

This incident was certainly a learning experience for all of us at Pingdom, when it comes to technology as well as in other ways, including communications. We have taken everything that our customers have said to heart and will do better in the future, that much we can promise. As we said in the blog post from a few days ago, we will this week publish our plan for what we're doing to get to grips with the points of failure that led to the incident this past weekend. Please keep the comments coming - everything you've said is greatly appreciated and we *will* do better.

Joe Toomey
Joe Toomey

I agree with several of the above comments. We all know that downtime can happen -- that's why we pay for your service. But if you want to build a trust-based relationship with your customers, you need to communicate when there are problems. You have almost 65,000 twitter followers, many of whom were asking direct questions on twitter during these outages. Important questions like "will I still get notifications if my site goes down?" Based on your above blog post, the answer to that appears to have been "maybe, but it might be significantly delayed." That was critical information for us to have *then*, not after the problems have been fixed. If you want to recover from this PR disaster, I suggest you publish an outage policy that includes communication details (i.e. if service(s) are down, here's where we will publish details includes what is affected and information on projected uptime.) And then adhere to it. Hopefully it's clear you have some trust to win back. I wish you luck, b/c I like your service and want to see you succeed.

Joe Toomey
Joe Toomey

I agree with several of the above comments. We all know that downtime can happen -- that's why we pay for your service. But if you want to build a trust-based relationship with your customers, you need to communicate when there are problems. You have almost 65,000 twitter followers, many of whom were asking direct questions on twitter during these outages. Important questions like "will I still get notifications if my site goes down?" Based on your above blog post, the answer to that appears to have been "maybe, but it might be significantly delayed." That was critical information for us to have *then*, not after the problems have been fixed. If you want to recover from this PR disaster, I suggest you publish an outage policy that includes communication details (i.e. if service(s) are down, here's where we will publish details includes what is affected and information on projected uptime.) And then adhere to it. Hopefully it's clear you have some trust to win back. I wish you luck, b/c I like your service and want to see you succeed.

Dan Plaskon
Dan Plaskon

Seems I'm not alone in my thoughts here. Please listen to your customers Pingdom. If I were you, I'd take all of the efforts you put into writing fluff blog articles and instead, put that into relevant and timely communications with your customers.

Dan Plaskon
Dan Plaskon

Seems I'm not alone in my thoughts here. Please listen to your customers Pingdom. If I were you, I'd take all of the efforts you put into writing fluff blog articles and instead, put that into relevant and timely communications with your customers.

Troy Ackerman
Troy Ackerman

Some downtime is understandable. Not updating your customers on a timely basis during the outage is not. When my company had a data center go offline for 7 hours due to a backbone ATS failure, I posted updates on Facebook 15+ times and sent out a couple of emails. My technology budget is $X,XXX a year not $X,XXX,XXX like you guys.

Troy Ackerman
Troy Ackerman

Some downtime is understandable. Not updating your customers on a timely basis during the outage is not. When my company had a data center go offline for 7 hours due to a backbone ATS failure, I posted updates on Facebook 15+ times and sent out a couple of emails. My technology budget is $X,XXX a year not $X,XXX,XXX like you guys.

Cameron Banfield
Cameron Banfield

You failed to mention the cause of the downtime... Im sure many customers like me would like to know

Cameron Banfield
Cameron Banfield

You failed to mention the cause of the downtime... Im sure many customers like me would like to know

Aneil Singh
Aneil Singh

I would like to know in detail how this effects my stats and reports I am required to provide my customers? Again. Maybe you should work on core system instead of putting makeup and lipstick on your UI.. Fix the network path issue you have!!!!

Aneil Singh
Aneil Singh

I would like to know in detail how this effects my stats and reports I am required to provide my customers? Again. Maybe you should work on core system instead of putting makeup and lipstick on your UI.. Fix the network path issue you have!!!!

Dan Plaskon
Dan Plaskon

As you said, anyone can have issues - although this outage was pretty long by comparison. What I found inexcusable as a paying customer was the complete lack of updates yesterday. Surely someone at Pingdom can take 5 minutes out of their day to provide hourly updates during such severe and impactful downtime?

Dan Plaskon
Dan Plaskon

As you said, anyone can have issues - although this outage was pretty long by comparison. What I found inexcusable as a paying customer was the complete lack of updates yesterday. Surely someone at Pingdom can take 5 minutes out of their day to provide hourly updates during such severe and impactful downtime?

Warwick Poole
Warwick Poole

How about you use Twitter to talk to your 65k followers, and let them know you are fixing the service we pay for during a multiday service outage? Your lax attitude cost you this customer at least. I'm tolerant of outages but intolerant of hubris like not saying anything to your customers when the service they pay for stops working.

Warwick Poole
Warwick Poole

How about you use Twitter to talk to your 65k followers, and let them know you are fixing the service we pay for during a multiday service outage? Your lax attitude cost you this customer at least. I'm tolerant of outages but intolerant of hubris like not saying anything to your customers when the service they pay for stops working.

Jonathan Elliott
Jonathan Elliott

Good, I was hoping it wasn't just my internet going crazy like it norm does.

Jonathan Elliott
Jonathan Elliott

Good, I was hoping it wasn't just my internet going crazy like it norm does.

Pingdom
Pingdom moderator

 @jasonjwwilliams Thanks for the input Jason, we do appreciate it. We put this comment on Facebook in response to comments (you can also see it below) but we'll add it here too since it applies:

 

" This incident was certainly a learning experience for all of us at Pingdom, when it comes to technology as well as in other ways, including communications. We have taken everything that our customers have said to heart and will do better in the future, that much we can promise. As we said in the blog post from a few days ago, we will this week publish our plan for what we're doing to get to grips with the points of failure that led to the incident this past weekend. Please keep the comments coming - everything you've said is greatly appreciated and we *will* do better."

 

Regarding increasing support, that's something that is in the plans for the near future.

 

About your email to support, if you could give us the ticket number, we'll follow up on that.

Pingdom
Pingdom

@jasonjwwilliams Thanks for the input Jason, we do appreciate it. We put this comment on Facebook in response to comments (you can also see it below) but we'll add it here too since it applies:   " This incident was certainly a learning experience for all of us at Pingdom, when it comes to technology as well as in other ways, including communications. We have taken everything that our customers have said to heart and will do better in the future, that much we can promise. As we said in the blog post from a few days ago, we will this week publish our plan for what we're doing to get to grips with the points of failure that led to the incident this past weekend. Please keep the comments coming - everything you've said is greatly appreciated and we *will* do better."   Regarding increasing support, that's something that is in the plans for the near future.   About your email to support, if you could give us the ticket number, we'll follow up on that.