Posts

MySQL replication - master failed - how to recover

I recently experienced a problem with my huge MYSQL database.  The SSD on the master failed - basically it refused to be recognised so was useless as a drive. My Databases are pretty big - terrabytes. I have several slaves which are replicas of the master so this failure was an irritation rather than a disaster. Here's how I recovered it and also recovered from some "edge cases" I discovered with the rather pathetic MySQL replication process. A key  lesson I've learnt is I probably need to set up RAID on the master.  I've had hardware RAID in the past but unfortunately I discovered a common failure mode that two drives on the same controller can have a common mode failure which can kill both drives - hence at the time I decided not to have RAID.  I need to revisit this decision with soft RAID as it would same time in recovery. My databases are big so the first problem I encountered is that there isnt enough spare disk space on the slaves to dump a copy o

The journey to DevOps begins

My experience with technology operations started about 1998.  I had it ingrained into me that the right model was Plan Build Operate.   I was never entirely happy with the model.  It worked. But it was very waterfall paradigm.  The Planners were usually oblivious to Operational issues and usually they had moved onto the next project so had no incentive or interest to fix problems.  More often than not Test added no value other than letting Operations know what didnt work when it went live.  All in all it was a bunch of silos leading to Operations being the poor citizen with things being blindly thrown over each wall inexorably destined to Operations. Over time I started experimenting with organisational structures to address these short comings.  Letting Developers loose on the network didnt work.  They had no discipline.  If they "fixed" something and it didnt work, they didnt know how to get back to a known state and it was left to Operations to clean up the mess.  The