MySQL replication - master failed - how to recover

I recently experienced a problem with my huge MYSQL database.  The SSD on the master failed - basically it refused to be recognised so was useless as a drive.

My Databases are pretty big - terrabytes.

I have several slaves which are replicas of the master so this failure was an irritation rather than a disaster.

Here's how I recovered it and also recovered from some "edge cases" I discovered with the rather pathetic MySQL replication process.

A key  lesson I've learnt is I probably need to set up RAID on the master.  I've had hardware RAID in the past but unfortunately I discovered a common failure mode that two drives on the same controller can have a common mode failure which can kill both drives - hence at the time I decided not to have RAID.  I need to revisit this decision with soft RAID as it would same time in recovery.

My databases are big so the first problem I encountered is that there isnt enough spare disk space on the slaves to dump a copy of the database.  Fortunately I have lots of space on the master.

Solution.  Run mysqldump remotely on the master. This approach doesnt store anything on the slave - the dump files are stored on the master.

On the master run mysqldump but specify the slave hostname  to dump the database

mysqldump -h my_slave_hostname -u root_user -p database_name  > dump.sql

This is the basic syntax.  Basically you are logging on to the remote machine but writing the SQL dump to the local machine.

In my case, the dump file for each table are in the region of 200Gb when gzipd so I enabled compression to avoid Gbytes of unnecessary data going over the internet.

mysqldump --compress -h my_slave_hostname -u root_user -p database_name > dump.sql

The compress flag enables compression of data sent over the wire.

The next challenge is I didnt want  all tables from the database in the single dump file as it would be enormous.  Also I didnt want each table as a separate file as there are about 100 tables.

In my case there are several tables associated with the same datatypes eg typeA_data, typeA_strings, typeA_metadata

I wanted to keep things simple so wanted each dump file to contain all the tables for each type

mysqldump doesnt really allow wildcards for tables so I created  a series of text files containing a list of tables to be dumped in each dump

filename:  typeA.list - each table name on a separate line
typeA_data
typeA_strings
typeA_metadata

Then run mysqldump using this file as a list of tables to dump

mysqldump database_name `cat typeA.list` > dump_file.sql

This creates a single dump file which contains all the tables listed in typeA.list

After repeating this process I now had a series of dump files which I gziped and used to reload the master

In my case the master had lost the user database so the slaves couldnt connect to the master anyway but I would strongly recommend stopping each slave to avoid the slaves creating duplicates by picking up the replication inserts from the master (or dropping the databases which is the first command in the mysqldump) when the master is reloaded.

I would recommend doing the following

1/ Take a copy of /var/lib/master.info on each slave
2/ Type stop slave
3/ Type reset slave

This make sure your slaves won't take any notice of the master whilst you're reloading the databases and it also makes sure you have the details the slaves use to log in to the master to do replication.

OK I now started loading the dump files into the master database.

zcat dump.sql.gz | mysql -u root -p database_name

I had gziped the dump.sql files so used zcat to stream them to the mysql loader

You can also use pv to get a rough view on the progress of the loading.

pv dump.sql.gz | gunzip | mysql - root -p database_name

So I loaded the tables, file-by-file.   I did this manually....

I finally got to the point where I thought I had restored the master so then restarted replication.

I created the users on the the master for each slave

I stopped the master in order to ensure the replication log file was flushed and a new log file would be created when restarted. I checked where the master was with

SHOW MASTER STATUS

I then used this and the data from  /var/lib/master.info to reinstate each slave

CHANGE MASTER TO  MASTER_HOST='master_host_name', MASTER_USER='replication_user_name',  MASTER_PASSWORD='replication_password', MASTER_LOG_FILE='new_master_log_file_name';

Then I restarted each slave. The master has lots of batch file updates which I then turned back on.

Then I discovered a problem.......

The slaves were now synching with the master as expected as all the deltas were being applied to the master by the batch updates  but when I manually reinstated the master I had made a mistake.  I had missed some tables out.

Now I had a problem.  There were changes which were being made to the master and were being replicated on the slaves but I can't now load the missing tables to the master without either deleting the tables from the slaves (the dump file starts with DROP TABLE) or creating duplicates on each slave.

Fortunately it is possible to handle this scenario.

I knew what tables were missing which is good news.

First step was to stop the slaves.

STOP SLAVE;
SET GLOBAL replicate_wild_ignore_table='missingtableA%,missingtableB%';

I'm paranoid so I ran

SHOW SLAVE STATUS\G

this shows which tables the replication was disabled on to confirm the above change was applied.

START SLAVE;

The slaves are now updating but not applying any changes to the missing tables.

I then loaded the missing tables on the master.

I now restarted the master and ran SHOW MASTER STATUS to find out where the master was.

Then waited for each slave to go beyond the point the master had loaded the missing tables then reset the GLOBAL replicate_wild_ignore_table

Hoepfully you wont make the same mistakes but if you do I hope my experiences are useful to you.



Comments

Popular posts from this blog

The journey to DevOps begins