Migrating Data with Elixir

Posted on Dec 18, 2021. Discuss on GitHub

Recently I completed a rewrite of an application1, and needed to migrate the existing data. Below I'll review the process and discuss some important things I learned along the way.

Table of Contents

Let’s get started.


Setting the Stage

My first programming-related job was part of a creative services team at a university. Student-funded groups, whose purposes ranged from volunteerism and fundraising to entertainment, needed graphic and web design services to attract students. To facilitate these services, the creative services team used an electronic queue.

This queue went through several iterations over the years. In 2014 I created my first iteration of the application with CreativeQ, a Ruby on Rails app that taught me many valuable lessons about software design. As is often the case, while learning and adopting a new technology (Elixir), I was excited to try it out by rewriting the queue. This started in 2017 and took many years of occasional work.

Towards the end, I faced the task of migrating data from the old version. This was tricky for a few reasons:

  1. Data schemas were different. Even the types used for IDs changed for some tables.
  2. Data had been normalized. A record from one table in the old database may translate into one (or more!) of several different tables in the new database.
  3. Not everything resided in a database. Related files would be migrated from disk storage to S3-style object storage.

Starting from scratch with a new, empty database wasn’t an option, so it was time to figure out a migration strategy.


Do Painful Things More Often

Martin Fowler wrote about a phrase we hear in the realm of software:

If it hurts, do it more often.

Migrating data was going to be a painful task. I was dreading the process and how much downtime it might require. As the saying goes, this was a great time to consider what it might look like to perform data migrations more often.

In hindsight, this was already starting to happen: development of the new version required example data to validate whether the interactions made sense in the context of a real workflow (as opposed to all of the fake graphic and web requests I created over the years). Furthermore, users could benefit from being able to try the system early — with real data — and give feedback before the switch-over.

So I gave myself the following prompt: create a system to migrate data to the new version of the application every day. This created a necessary and beneficial change in thinking. SQL queries and scripts weren’t going to be enough. I needed an application that could migrate data in an automated and repeatable way.

Migrator Overview

I created a basic mix project to perform the migration. It only needs a few dependencies: Ecto (via ecto_sql and postgrex) to read from the old PostgreSQL database and write to the new one.

The main application structure (lib/) looks like this:

lib
├── migrator
│   ├── application.ex
│   ├── new
│   │   ├── schema1.ex
│   │   ├── schema2.ex
│   │   │   ├── schema2_embed1.ex
│   │   │   └── schema2_embed2.ex
│   │   ├── ...
│   │   └── schema3.ex
│   ├── new.ex
│   ├── old
│   │   ├── schema1.ex
│   │   ├── ...
│   │   └── schema2.ex
│   ├── old.ex
│   ├── timer.ex
│   └── translator.ex
└── migrator.ex

Let’s break it down.

“Contexts” and Schemas

Migrator.New and Migrator.Old provide data access functions like you would find in a context module generated by Phoenix. For the old data, these functions only list records. For the new data, they only create, update, and truncate records. Beneath each of these modules are the Ecto schemas for each database table. Unlike in a working Phoenix application, these schemas read and write to each field as literally as possible. For example, here’s one of the schema modules for the new data:

defmodule Migrator.New.Group do
  use Ecto.Schema
  import Ecto.Changeset

  schema "groups" do
    field(:name, :string, default: "")

    field(:inserted_at, :utc_datetime_usec)
    field(:updated_at, :utc_datetime_usec)
  end

  def changeset(%__MODULE__{} = group, attrs) do
    group
    |> cast(attrs, [:id, :name, :inserted_at, :updated_at])
    |> validate_required([:name, :inserted_at, :updated_at])
  end
end

Notice there is no timestamps/1 call; instead, it manually specifies the inserted_at and updated_at fields. This may seem like a trivial change, but it eliminates normally-helpful behavior on the part of Ecto that might overwrite the migrated data with automatically-generated timestamps. In other places, I’ve also replaced calls to belongs_to and other associations with explicit field calls, like field(:group_id, :integer). This gives the application full control over the data entering the database.

Schema modules for the old data don’t require changeset functions, only schema and field calls.

Migrator

The main Migrator module contains the logic for moving data from old tables to new ones. The main run function:

  1. Truncates data from the new tables, then
  2. Lists all of the data from an old table, and
  3. Creates or updates records in the new tables as appropriate.

Performing this logic at the Elixir level allows it to make transformations that wouldn’t be easy in SQL alone. For example, the application can read the contents of a comment and determine whether it represents a user-generated comment or a system-generated audit message. This module is rich in case statements, with each branch ending in a call to Repo.insert or Repo.update.

Timer

Rather than use a third-party library like quantum, I decided to create a simple GenServer process to schedule migrations daily. On startup, this GenServer schedules a migration for midnight UTC the next day. Following each run, it schedules another one for 24 hours later. It isn’t perfect, but it does the job.

Translator

The final module is Migrator.Translator, which deserves its own section.

Translating IDs

While implementing the migration logic, trouble struck almost immediately. Requests for a graphic or web design — the main record in the system — had integer IDs in the old system and binary IDs in the new system. By default, simply copying data couldn’t ensure that all of the records referencing requests with a foreign key (e.g. comments with a request_id column) would maintain their references.

In hindsight, it would have been perfectly reasonable to copy the old request IDs into a separate column in the new table. Later, a migration on the new database could change the foreign key relationships. But I wanted the migrator to just handle it, and to avoid changing the new schema for the sake of the migration. So instead, there’s a translator.

Migrator.Translator is a simple GenServer that manages an Erlang Term Storage (ETS) table. It has functions like this:

add_group(old_id, new_id)
get_group(old_id)
add_user(old_id, new_id)
get_user(old_id)

Hidden inside the module, the ETS table stores records like {{:group, 2}, 10} (denoting that the old table’s group 2 is the new table’s group 10). When the migrator encounters a foreign key, it runs it through the translator before saving. nil values naturally return nil. In the end, referential integrity between tables is preserved.

Asynchronicity

Whenever you use a GenServer like this, you have to consider the impact of its sequential processing of messages on the data. For example, will it be possible for the migrator to ask for an ID from the table before the GenServer has processed the message to store that ID? Will the processing of messages be a bottleneck for the rest of the application?

In this case, since the migration “run” occurs sequentially in a single process, we don’t need to worry too much about what the translator is doing. Even if the migration process were to asynchronously cast a large backlog of ID insertions to the translator, it will give the translator time to catch up by using call to retrieve IDs later on. Erlang guarantees messaging ordering between a given pair of processes, so the call message will definitely be handled after the cast messages are finished.

What we should avoid, however, is using cast to insert ID pairings and then bypassing the GenServer altogether to retrieve data from the ETS table. While this is possible — with Erlang guaranteeing atomic reads and writes on the ETS table even with multiple processes reading — it could cause a situation where the migrator process reads data before the translator has written it.

The safest thing to do is use call everywhere. This does create a bottleneck: the migration process will only be able to create records as fast as the translator can insert and retrieve IDs from the ETS table. However, this creates a healthy back-pressure in the system. For a migration running daily, it doesn’t need to be incredibly fast. (In reality, working with the database will probably be the slowest part of the routine.) If using call guarantees that it won’t create a large backlog of messages for the translator, possibly causing a future call to timeout, that’s an acceptable trade-off here.

Execution

Getting started with the migrator was easy: create some schemas that match the exact structure of existing database tables. Throw some OTP at the problem with a few GenServers2, and you're well on your way.

At nearly 600 lines of code, the migration logic was slightly more involved than anticipated for an application of the queue’s size and complexity. Performing the migration daily gave me the opportunity to refine the logic over an extended period of time. As folks continued to use the old version of the app in new and interesting ways, small tweaks to the migrator gradually improved the new experience.


Persisting ID Translations

As mentioned above, the migration application used an ETS table to maintain mappings from old to new IDs. The original goal was to retain referential integrity within the context of a single migration: comments of request #1 should have request #1’s new ID in their request_id column on the other side. However, as the migration moved closer to the final rounds of testing, a new issue arose: how do we deal with files attached to requests?

In the process of completing a request, members of the creative services group may upload numerous drafts as attachments. Previously these files were stored on-disk where the application was hosted. The new version uses an S3-compatible object storage. Regardless of how the files get migrated (see Migrating Files below), there was a problem: the randomly-generated binary request IDs were changing every day. Request IDs become part of the file’s path, so after 24 hours, all of the files would need to be moved. This wasn’t going to work.

Now, in addition to maintaining referential integrity within a single migration, the migrator needed to provide referential stability across runs. This required two changes.

Stability from the Migration Logic

The first issue to address was the migrator code itself. Because IDs were assumed to be unstable between runs, every new record was inserted with a nil ID (allowing the database to sequence or generate an ID). Even if the translator already had an ID saved for that record from a previous run, the migrator would generate a new one and overwrite the cache.

Resolving this took a few steps:

  1. First, ensure every schema’s changeset function includes :id in the list of fields given to Ecto.Changeset.cast/4. Otherwise, the preset ID would be ignored.
  2. Check for an existing ID translation before each record insertion. Using nil as the default value works for new records.

It might also be a good practice (though not strictly necessary) to avoid re-inserting the same ID mapping in the translator.

Resetting PostgreSQL Sequences

Now that the migrator no longer relies on the database to generate primary keys for most records, I decided to reset the ID-generating sequences (for tables using integers as primary keys) after each migration. This way, there’s no chance of the database attempting to generate an ID that collides with an existing record.

Thanks to the internet, I used the following (with the groups table as an example):

Migrator.New.Repo.query(
  "SELECT setval('groups_id_seq', COALESCE((SELECT MAX(id) + 1 FROM groups), 1), false);"
)

Tables with generated binary IDs have nothing to worry about.

Stability from Crashes

The ETS table storing the ID mappings was great, but it would be lost if the Erlang node (or even just the translator process) crashed for any reason. This included new deployments of the migrator code, in the absence of hot-code reloading.

My solution? Use DETS.

Disk-based Erlang Term Storage (DETS) looks a lot like ETS, except the underlying table is also saved to a file on disk. In my particular case, it was possible to save this file in such a way that it would persist between deployments of the migrator.

DETS looks deceptively simple to use:

# Replace...
table = :ets.new(:translation_cache, [:named_table])
# with...
{:ok, table} = :dets.open_file(:translation_cache, [{:file, '/path/to/translator.dets'}])

# Replace...
:ets.insert/2, :ets.lookup/2
# with...
:dets.insert/2, :dets.lookup/2

(The .dets extension is my own affectation.)

Of course, the details are a bit more involved, and anyone considering using DETS should read the documentation carefully. However, in this particular use-case, it worked.

With the code now looking for existing IDs before inserting records, and the data persisted to disk between deployments (and crashes) of the migrator, IDs became stable. Any given file could be moved to the S3-compatible storage just once.


Migrating Files

While the migration application handled PostgreSQL data nicely, migrating attached files from disk to S3 storage was another task. As part of this migration, the files were placed into a new folder structure as well (based on request IDs rather than comment IDs).

I won’t include many details of this part in this post, because they aren’t particularly interesting and they are so specific to the needs of the queue. However, let me plug this:

Elixir’s Livebook is an amazing tool for facilitating routine tasks.

In a .livemd file, I now have detailed instructions and code snippets that walk me through the process of downloading CSV files with ID translations and structuring my filesystem properly. Then the executable code blocks read in the CSVs (thanks to nimble_csv), move the files appropriately, and prepare them for syncing to S3-like storage. There’s no need to rely on my memory between instances of moving files. Livebook makes the process as repeatable as possible.


Conclusion

Migrating data from the old version of the queue to the new version was a painful task, so I chose to do it more often. A small Ecto-based Elixir application handled the PostgreSQL data with referential integrity and stability. A Livebook file guided me through the migration of files from one storage to another. Beta testers saw up-to-date data every day, and up-to-date files with much less effort.

This project reinforces the idea that Elixir is a fantastic “glue” language. Whether it is coordinating runs of a purpose-built binary, managing external services or hardware, or migrating data between systems, Elixir can help. It has the right abstractions to both give you control and get out of your way.

As a result, migrating data doesn’t hurt anymore.


  1. Consider this a +1 to the sentiment that rewriting an application from scratch (a) sounds better than it is, (b) takes longer than expected, and (c) leaves existing users of the application dissatisfied while the rewrite occurs. Back

  2. Using a GenServer (or more generally, spawning a new process) is not the best solution for everything. But it is an awfully quick solution for many things. Back