Bug temporarily affected monitoring for a portion of our customers today
Today, the Pingdom team deployed a software upgrade to some of our monitoring probes. Despite thorough testing, this upgrade contained a malfunction that led to false down alerts being sent to a portion of our customers.
Even if the issue affected monitoring for less than 90 minutes for a limited number of customers, it’s of course frustrating if you were one of them. We take a lot of pride in delivering a reliable service and this doesn’t represent what Pingdom stands for.
Let us first stress how rare it is that something like this happens at Pingdom. In fact, this is the first time a similar occurrence has struck us. That said, we want to take this opportunity to provide information about what happened, present what actions we’ve already taken, as well as tell you how we move forward.
Our normal deployment of new and updated software consists of a series of tests designed to making sure that our systems are reliable. This means that we roll out updates gradually to our infrastructure and only after they’ve been thoroughly tested in our development and staging environment.
Today at around 8 am GMT we gradually started to roll out the update to a few selected monitoring probes. Immediately we saw that there was an issue with the code and did a rollback. But, unfortunately, a limited number of customers had faulty downtimes recorded in their data and in some cases also received faulty down alerts during a limited time.
After a thorough investigation we’ve already initiated actions to minimize the effect this may have had, including:
- Affected Pingdom checks will have their up and down records marked as unmonitored for the period in question, up to a maximum of 90 minutes. Therefore, each site’s uptime record will not be affected. In other words, your uptime percentage will not change due to this incident.
- Any lost SMS credits due to incorrect alerts in connection with this issue have been refunded. You will receive double the amount of credits that was used during the incident.
- We will take further steps to make sure that future upgrades to our infrastructure will be implemented with even more caution. This incident has already led to improvements in our deployment routines.
We want you to rest assured that all of us working at Pingdom take significant pride in delivering the best possible service, and even though mistakes happen they are not acceptable to us.
If you were affected by this, we’re really sorry. You can be sure that someone will be wearing the stupid hat today.
Please contact us at firstname.lastname@example.org if you have any questions or comments.