Migrating from a VPS with MySQL to AWS RDS, using AWS DMS

Like most software companies, we started one day by writing small (web)apps, using a shared hosting or VPS to host the code and a database to store the data and have the data searchable. Over time and as better practice, this database was often migrated over time to an own VPS or dedicated server.

For one of our projects, this was the case, and this server was getting old (Debian 7, MySQL 5.5) with only some security-patches and even the long term support (LTS) is expired… oops… Upgrading this type of VPSs is always a risk, and often a guess… Of course, you can practice on a clone (aka staging) before you touch the production, and you should, but even then, it’s a risky operation.

To get out of this situation, we decided to move this database server to AWS Relational-Database-Service, RDS in short. A SAAS/IAAS that will take care of the hosting of the database, management, applying patches, adding redundant/fallback features and scalability for a reasonable price.

Amazon is also offering AWS DMS, the Database-Migration-Service, to migrate your data from your own server. Sounds great, no? We locked a day to get our production database migrated, and hopefully without much downtime which is advertised by the AWS DMS page.

We started by making a short plan-of-action:

Setup AWS, RDS & VPC. Add webservers to VPC.
Setup DMS, sync data
Test if changes on production DB are synced to RDS
Change webservers to use new DB server, one-by-one
Shutdown old server, evaluate, check logs

Right, let’s get started:

We first created a new AWS account, and added it under our organisation. We created a new RDS (Still MySQL, and to be safe MySQL 5.5, same as the current DB server). Used the default Virtual-Private-Cloud (VPC) to control who can access the server.

Next was setting up the DMS, which required a user account on the source database. Ok, let’s create a user. As it’s a production database, we wanted to limit access to this user by limiting on IP level, however, as this is not known at the moment, we had to allow access from everywhere. Not a big deal, it’s only temporially and we can improve the access later on, once the DMS has been created. When setting up the DMS, there are a lot of options like adding extra validation, enabling logging. All pretty explained on the page and documents. We enabled them as it all made sense and extra validation would never hurt on a production database, right?

Things started to go not that smoothly as expected from this point on ☹️

The DMS page is pretty clear about how to go, and the help pages are pretty helpfull. There is a document page that clearly indicates what we need to do, when migrating from MySQL to RDS. Great 😀 Let’s jump to the “Using a Self-Managed MySQL-Compatible Database as a Source for AWS DMS” as that’s our situation. AWS clearly states which flags we need to enable in our MySQL config. Let make those changes and restart the MySQL process.. (right, this is resulting in a small downtime, but that’s all acceptable during a migration).

As a first warning here: Be careful by adding the parameters to your config file if you are on MySQL 5.5 as log_slave_updates, binlog_row_image, binlog_checksum are only available in MySQL 5.6 and later. Adding them will result in a MySQL instance that’s unable to start. Based on the error logs (default in /var/log/syslog on Debian), it’s easy to figure this out. Right, can happen. Let’s hope the result will go smooth. Enabling the binlog for the synchronisation is making sense, as the migration should keep syncing the changes.

Let us start the DMS. It started smoothly… “Starting”… “Running…”, but after 10% it failed. Right, luckily we enabled the logs. The logs clearly stated that our MySQL user didn’t have enough privileges.

2018–11–28T11:52:11 [SOURCE_CAPTURE ]E: Error ‘Access denied; you need (at least one of) the SUPER,REPLICATION CLIENT privilege(s) for this operation’

Right, this make sense as access to the logbin is by default restricted. The same help document page includes a section about the privileges:

You must have an account for AWS DMS that has the Replication Admin role. The role needs the following privileges:

REPLICATION CLIENT — This privilege is required for change data capture (CDC) tasks only. In other words, full-load-only tasks don’t require this privilege.
REPLICATION SLAVE — This privilege is required for change data capture (CDC) tasks only. In other words, full-load-only tasks don’t require this privilege.
SUPER — This privilege is required only in MySQL versions before 5.6.6.

The AWS DMS user must also have SELECT privileges for the source tables designated for replication.

Unknowing what the CDC task exactly is, and after a bad “Google”, we thought it was only for detecting changes in DB Structures, so we only granted the SELECT and SUPER privileges to the user.

Restarting the DMS… we see some tables appearing and “pending validation” — alright, this looks cool. We were seeing MySQL processes on the source database, and the created RDS database. Cool, it started…

However, at 30%.. it seems to be failing again. The database connection seems to be okay, as we have some migrated tables, but not all data was transferred. Right, let check the logs for the error… The only things we could find where “task ended”, but no errors.

We suspected the validation was doing weird, so we disabled the Validation part, but the job kept failing mysteriously at 30%. Is our database server too old? Not supported? Did we need those flags/features that are only available on MySQL 5.6 and later? But, why is AWS allowing us to create a RDS with MySQL 5.5 as engine then… Let’s debug more…

After quite some debugging, and by accidental clicking the “Assess” button, it seems that some IAM privileges were missing. Luckily, the Assess button showed a modal that could fix this for us 🙌 However, it didn’t fix our problem. The only thing visible in the CloudWatch log was:

2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #0 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #1 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #2 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #3 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #4 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #5 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #6 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #7 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Subtask #8 ended (replicationtask_util.c:564)
2018–11–28T12:37:20 [TASK_MANAGER ]I: Task management thread terminated (replicationtask.c:3575)

As you can guess, the issue was because we had to grant the REPLICATION CLIENT and REPLICATION SLAVE to our MySQL user. After understanding CDC better, and granting those privileges, we hit the Start button again. Everything seems to be working.. 🎉

We made some changes in the source database, and they seems to be populated nicely to the new database 🎉🎉🎉

After doing more tests, we were pretty happy and assumed that everything was working. However, we discovered, just in time that the auto-increment values where missing. Not a big deal, we should be able to add those easily via some SQL statements like

ALTER TABLE users AUTO_INCREMENT=1001;

As the new target database was getting changed and filled by new data, as DMS kept syncing all new data and changes in the source, we made a note of all auto_increment values of the source, we added a number to it to allow a gap so that new records, created on the new database, would not be in conflict by records created while we were migrating. However, it seemed that this SQL query was just doing nothing. There was no error, we got a “xxx rows affected” and the query took quite some time, but there was no next auto increment value defined. We discovered that our primair key in every table lost the “use auto increment” flag 😨

Right, after fixing the auto_increment flag on every primaire key on every table in every database, we were happy to see that the original auto_increment value was being transferred. This value did also change when the source had new records. So no risk of having conflicts on the ID (as long as you don’t add records on the new DB, of course). Great.

Good that we saw this on time before queries would start failing. Time for the big switch: Let’s change the config files on the webservers… 😳

Soon we noticed that our application was pretty slow… Time for more debugging: Aha, indexes are not migrated. We had to manually add all indexes on all tables again. Not a fun task as adding indexes is resulting often in a lock of the table.. however, surprisingly, the indexes were added pretty fast, and our application was fast again 🚀

We were happy. No errors in the log. But, we were getting some complains… Let’s debug. It seemed that some IDs couldn’t be retrieved in some parts. Was a part of our webapp using a different config file and still using the old database server? After more debugging and query-evaluations, it seemed that some values in the records were null. That’s unexpected… 😤

As you can guess, DMS didn’t migrate the default values. It’s a bad engineering to trust on default values, but one field on one table was relying on the default-value, and this was critically. We ran a query to get all default values (and verified if the string-encodings were set correctly):

`SELECT TABLE_NAME, COLUMN_NAME, IS_NULLABLE, COLUMN_DEFAULT, CHARACTER_SET_NAME, COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA=’xxx’;`

After fixing the default values, and fixing the new records by setting the values, everything was working again 🎉

It took us almost a full day. We lost quite some time debugging, finding out how things are working (sounds like a normal day, right?), but finally managed to migrate our production database to AWS RDS, with almost no downtime. Our database is now using the High Availability (multizone) feature with auto-fallback, encrypted backups and encryption at rest. Integrated monitoring and alerting.

If you are still having an older database, like MySQL 5.5, I hope this story can help you to avoid the issues we had, and reduce migration time.
Be sure to grant the SUPER and REPLICATION CLIENT privileges. Verify which flags you are setting and check if they are compatible with your software. Be sure to verify the indexes, auto increment and default values of your tables and fields after the migration, before making the final switch.

Let’s hope everything will run smooth now.. and without unexpected and more hidden issues…

—

Update: There is a list with known limitations at the bottom.

‘The AUTO_INCREMENT attribute on a column is not migrated to a target database column.’ .

It’s easy to miss, but this one is defined. Oops.

About indexes, the following is listed:

“Indexes created on a portion of a column value aren’t migrated. For example, the index CREATE INDEX first_ten_chars ON customer (name(10)) isn’t created on the target.”

However, in our case, we only have indexes on full fields, not on portions/substrings. Maybe we missed a setting somewhere…

There is also no mention about the loss of default values 🤷‍♂️

Another important aspect is:

“If the MySQL-compatible source is stopped during full load, the AWS DMS task doesn’t stop with an error. The task ends successfully, but the target might be out of sync with the source. If this happens, either restart the task or reload the affected tables.”

So, be warned 👮‍♂️🚨

Migrating from a VPS with MySQL to AWS RDS, using AWS DMS

Latest from these topics

How to turn an end-of-life iMac 5K into a Debian dashboard system

My personal tips for working from home

How to downgrade or cancel your Dropbox account

Enabling 2FA on your (not primary) Apple Developer Account