In less than a week, we’ve experienced two outages. In fact, these two outages combined have been the worst since the company was founded in 2007.
We wanted to take this opportunity to give you an update on the situation and tell you where we go from here. The current status is that all our core websites and services are up and running as they should. This includes the monitoring you have set up of your websites, alerts, our API etc.
The only services remaining to be started up again are the non-core Ping and Traceroute service, parts of our Tools, as well as Report Banners.
When this most recent outage started yesterday, we were of course alerted right away through the various ways we monitor our own services. Despite this, the severity of this incident and the challenges we faced meant it took longer than anticipated addressing the problems. During both incidents, we have worked closely with consultants as well as suppliers.
We are in the website monitoring business, and we think you will agree with us that everyone will, at some point, be struck by unscheduled downtime. To put things in perspective, our systems handle around 280,000 customer accounts and store almost 300 million monitoring results each day. We invest millions every year in hardware, software, services, and more, to provide the best possible monitoring solution available today. Unfortunately, even with the best of intentions and the most thorough preparations, things sometime go wrong.
We’re not trying to play the blame game, but we want to be as upfront as we can with you, our customers. Next week we will present a detailed plan for how we will fix the point of failures we have today, including a specific timeline for when different things will be done. Following that plan, we will continuously keep you updated and informed about the progress we make.
Until everything has been implemented to fix the risk presented by problems we face, we’ve already put into place measures that hopefully will prevent any similar incident from happening again until we made the necessary long term fixes. We anticipate that we will be completely done with the first step of the coming changes and updates in 5-10 days from now.
Everyone at Pingdom is now dedicated to fixing these issues, and all other development has been put on hold. Even after what’s now happened, we’re very proud of the services we provide, and we’ll work exclusively on making our systems as reliable as they can possibly be.
Even though, this is just the start of what will be an intensive process for us, we’d like to think we learn from our mistakes. This is now our chance to prove to you, our customers, that we do just that.
If you have any questions or concerns, please send an email to firstname.lastname@example.org.
As always, it would be very helpful for us if you could provide as much information as possible when you contact us, including your account information and what checks are affected. Also, wherever possible, sending us a screenshot of the exact issue will help us help you faster.