On 2021/12/02 at 13:27 EST, one of our database servers for SMP had a drive failure.
On 2021/12/06 at 07:32 EST, a second database server experienced drive failure.
Unfortunately, both of these servers were part of the same 3 member replica set which means that the same data is duplicated in 3 different servers, in 3 different racks of the datacenter. Since we now had two out of these three machines down, we no longer had a quorum, and the final member was unable to serve data, neither reads nor writes.
Monitoring failed to alert us on both failures, so as soon as we noticed an uptick in SMP bug reports, we started investigating and noticed the outage. After which, we took SMP offline completely to minimize impact to the players’ worlds.
Since both servers were in a RAID 0 drive configuration, all data on the drives were lost, and the only current copy was on the remaining server. We immediately backed up the latest version of this part of the database to ensure we had the least amount of data loss possible and so we wouldn’t have to resort to restoring an older backup, potentially rolling back days of work on your worlds.
We have since restored the database fully, and all three servers are operational again with brand new drives, and starting SMP servers has been enabled. Monitoring has also been adjusted to alert us for failures properly. We’ve optimized our process for a recovery of this nature if this highly unlikely situation were to ever happen again.