I am sorry for the unscheduled downtime this morning, Friday 25th of May around 8 am UTC. A scheduled kernel upgrade of the server went not as expected. The kernel upgrade did went correctly on the slave, including reboot and resync, but the master failed to come up again. For data safety reasons, we performed a backup of the slave before promoting it as master and switching the application to use the new master. This backup is what took a bit more time than expected and resulted in the large downtime.
What is next?
- Need to promote a new slave and get the master to logship to it, this will again force a small downtime of the master database the time to pickup the new configuration.
- Need to be a bit more proactive in announcing the issues. I setup a small list for the very active users, but a know central place to publish updates would be better.
- Need a regular backup of the slave to not have to perform it under pressure.
I will keep you informed. Sorry for the annoyance, sometimes issues happen and this one took me by surprise.
Update: Reviewing the logs, the combination of a VM + Hardware node restart, including KVM upgrade is most likely the culprit.