We recently moved our service to a new architecture which allows us to deploy changes several times per day with zero downtime as well as improving our service availability, scalability and performance. In two words – No brainer.
As part of the move however we did encounter an incident that presented us with a huge challenge as well as a huge opportunity to test our communication and willingness to be transparent. And we think we passed the test. As well as having no complaints, one of our customers commented – “No problem – these things happen. Good communication from you and the team on it”
Here is the timeline of what happened (and what worked for us)
1am GMT on 27th September – Live deployment was completed 8:30 am GMT – we understand that there is a problem with some users logging in 8:45 am GMT – we place a message on the login page advising users how to proceed Rest of the day – we continue investigation while patiently responding to all support tickets 8:30 am GMT on 28th September – We discover the root cause of the issue 8:35 am GMT – the problem is corrected in production 8:45 am GMT – we amend the sign in page message the reflect the recent information 9:00 am GMT – we make a decision to be completely transparent with our customers about the events 11:00 am GMT – we have a draft of the formal communication we are about to send to all administrative users 12:00 pm GMT – the send to list is finalised and the campaign is all set up 12:15 pm GMT – we send the message below to all our customers 12:20 pm GMT – we change the sign-in page message once more to reflect the latest understanding 3:00 pm GMT – this blog post is born
And here is the main communication we sent to all out customers
Subject: A day to forget yet a day to remember
Dear <First Name>,
In the 7 years that LeaveWizard has been running we have not had an incident like the one that happened yesterday, so I wanted to reach out to you and explain what happened.
On 27th September 2016 from 01:00 GMT until and 08:30 GMT on 28th September 2016 a recent deployment of LeaveWizard resulted in users having to change their passwords and some lost data.
What happened? – in Summary
As part of migrating to our new infrastructure a configuration setting was incorrectly configured which resulted in our production environment pointing at the wrong database.
Unfortunately, the cause of the problem was not immediately apparent, the symptoms were that users needed to reset their passwords but the initial conclusion was that this was due to the hardware change. It took a while for our development team to identify the root cause of the issue but as soon as the problem was found it was addressed.
We are looking to put steps in place to ensure this kind of thing does not happen again. If you like to more details about what happened you can read more below.
What we are asking you to do?
We are asking for your support today. Because our backup process looks only after the live database, we have no access to the events or approvals you may have added to the system between 1am GMT on 27th September and 8:30am GMT on 28th September.
Please could you double check any events, approvals or other changes you may have made in this time window and if you find something is missing to add it once more.
I would like to thank you for your patience and understanding over the last couple of days. Kind regards Rich Allen Co-Founder LeaveWizard Ltd
What happened? – in Detail
In order to cope with the increasing popularity of the LeaveWizard platform it became necessary for us to migrate our infrastructure to make it possible for the service to cope with the increased demand without degrading performance.
As part of this move, in order to streamline our software delivery process and make it easier to ship new features, a new deployment process was introduced that required various configuration settings in order to correctly setup the environment for deployment.
Part of this deployment process takes the application through a staging environment which enables us to test the functionality prior to deploying it live. Once the application has passed testing on the staging environment it is promoted to our production environment and made live.
One of the settings, that controls the database that the production environment points to, was incorrectly set. This resulted in data being stored in a temporary database rather than the production database that is regularly backed up. Overnight, the temporary database was restored to its previous state which resulted in the days transactions being lost. Why is it not going to happen again?We take looking after your data very seriously and are disappointed in ourselves for letting this happen, but we must learn from our mistakes to ensure this kind of problem does not happen again.
Therefore, we shall be working on a solution that helps to verify that the configuration is valid for the environment it is being deployed to prior to making any future deployment live. This solution will perform some automated checks prior to pushing the button to go live.