Migrate and Populate the Content
An oft-overlooked step in a website project is actually getting the content into the CMS. This might be an automated or manual project, depending on where the content is now, and how it might need to change.
Does this Apply?
Unless you’re creating a brand new site with all new content, or throwing away all your old content and starting over, then you will be migrating content. Your specific project will dictate how much is automated or manual, but know that this workload will exist.
It’s time for a new house, you decide.
You spend a bunch of time looking at home plans and talking to home builders. You pick out finishes, buy an empty lot, and engage with an architect. You spend months visiting the construction site keeping careful track of how it’s progressing.
You realize that your new house is going to drive other changes in your life. You register your kids in a new school. You decide to buy a more fuel efficient car since your new house is 15 miles further from the old one. You buy a new file cabinet to store all the paperwork, from builder invoices to instructions for the new oven.
The house nears completion. You call your utilities and arrange for transfer of service. You file change of address cards with all your different subscriptions. You tell Trevor, your landlord, that you’re moving out at the end of the month. You even plan a house-warming party for 50 of your friends.
The end of the month comes, and the house is done. The builder gives you the keys. You sign the mortgage papers. It’s all yours, congratulations.
You visit your new house. You walk inside...and sit on the floor. There’s no furniture. You go to make dinner, but there are no dishes in the sink.
Your phone rings. It’s Trevor, your old landlord.
Hey, when are you planning to come get all your stuff?
[record scratch sound effect]
And this is the story of content migrations. In our excitement and rush to build something new, we sometimes forget that we have a ton of stuff that we’ll need to move when this is all done.
In our story above, our new homeowner might have thought:
Oh, I’ll just move that stuff this weekend.
But anyone who has ever moved house knows how ridiculous this is. Unless you’re right out of college with nothing to your name, moving to a new house can be an epic experience. It always takes so much longer than you think. You need five trucks, ten friends, and fifteen pizzas to make it happen.
In your head, you envision just dragging some couches out and throwing them in a truck. But in the reality, you have cupboards full of dishes, closets full of clothes, and did you forget all that stuff you shoved in the attic above the garage? Try moving out of a house you’ve lived in for a decade or two and you’ll realize it’s a logistical nightmare that takes weeks or months of planning, and multiple long stretches of work spanning weeks or months.
In this chapter, we’re going to explain how to avoid this.
“Migration” can be a very dangerous word.
Some people will say “site migration” to mean the entire process of moving from one website to another. To them, the entire thing is a migration, from when you start talking about it, to when the new site goes live. 1 Planning, designing, development, and moving content are all "migration."
To other people, “migration” is moving content from one platform (your old CMS) to a new platform (your new CMS). The migration is one part of a much larger project.
When someone says “migration,” do they mean the entire project, or just moving the content? Be relentless in baselining the usage of this term, because mistakes can be disastrous.
The Three Questions of a Content Migration
When planning a content migration, there are three different questions you need to answer. Sadly, these aren’t yes or no answers – these are more like three complicated problems you need to have solutions for.
The Editorial Question: What content is moving, and how does it need to change on the way over?
The Functional Question: How will functional or logical aspects of the content work in the new CMS?
The Procedural Question: How will the actual bytes move from one disk to another, and what does the timing of that look like?
We’ll discuss each question below, but understand that some of this retreads over subjects we’ve covered in the past. You will likely have done some of the work already, and that’s a great thing. Now you’re just going to reconsider that work in the context of your content migration.
The Editorial Question
We need to know what we’re going to move. We need a list of everything that’s going over, so we can move it and check it off our list.
Thankfully, this is the easiest question for us to answer by this point in the process, because a lot of this work will have been done already.
If you’re moving from your apartment to a new house, the scope of what you need to move is pretty simple – you need to get everything out of the apartment. Where that stuff goes is up to you, of course. Most of it will make it to the new house, but some of it will likely go to the trash dump. Still, it’s easy to know when you have everything accounted for, because the apartment needs to be empty. A simple visual sweep can tell you if you got everything or not.
The same isn’t quite true of websites. You need a list of everything that needs to move, because you can very easily forget to move something, or not plan adequately for it, then turn off the old website and that content simply goes away, insofar as your website visitors are concerned.
By this point in the process, you should have done a content inventory that encompasses the scope of content that’s going to be on the new website. You also have a site map that defines content organization for the new site, including any new content to be written.
Creating new content is an operational/editorial concern, and not really part of a migration. At some point, your content team should be able to get into the new CMS, and start creating new content that doesn’t already exist in some form.
But for content that already exists, we need to define whether each page is a one-off, or part of a highly structured group:
Structured Groups of Content Objects: All news releases. All technical documentation. All forum posts. These are large groups of similar content that can be grouped together in a “bucket” that all have to go over to the new website. You do not need to list every news article, so long as everyone can agree on what a news article is, and that they’re all migrating.
Make It Easier On Yourself
It’s worth saying at this moment that you often don’t need to move all your content. Now might be a handy time to get rid of stuff.
Remember the apartment analogy above where some of the stuff in your apartment went to the trash dump? Same deal with digital content. Here’s a statement that you should repeat to yourself five times every single morning while you’re doing migration planning.
The easiest content to migrate is content that you throw away.
Seriously. We’re not even being flippant. What to make your migration easier? Throw stuff away.
If you’ve performed a detailed content audit, you’ll have already reviewed this content from a few angles. Here are some factors you could take into account when deciding whether to keep or discard something:
Traffic: Check your analytics. How often has this content been consumed?
Centrality: How often is this content linked from other content? If it has 300 inbound links, that can be a problem. But if it’s not linked from anywhere, maybe there’s a reason?
Age: We’re not saying that old content is always bad, but consider how current or relevant the content is to your current content consumers. Are they likely to get any value out of it?
Keywords: Some organizations are much more concerned about SEO than others, but consider if any of your target keywords appear in the content. If the answer is no, then evaluate whether or not it’s serving a purpose for you.
By using a set of rules you might be able to identify a huge subgroup of irrelevant content in a larger group. Yes, all news articles are going over, but before you migrate them, can you perhaps cut down the “all” there? Delete what you can, then migrate the rest.
The goal of The Editorial Question is to identify a list of all the content or groups of content that are migrating. Once this is done, then we move onto the next question, where we figure out how all your content is going to work in the new CMS.
The Functional Question
There are certain features and common page models that are relatively standard, regardless of the CMS.
For instance, news releases display a title, publish date, and a body. You can likely count on this working reasonably well in most any CMS. The simple display of structured content is a well-handled pattern.
However, some content-based functionality can be more in-depth, and unique from CMS to CMS. For example, what about the page that lists all of those news releases? Let’s consider all the functionality that might exist here.
News releases are listed in reverse chronological order by date published
Certain news releases are hidden unless the user is logged in with the correct user permissions
News releases can be filtered or categorized
The user can perform a full-text search on all news releases, and this search includes functionality like type-ahead suggestions, stemming, fuzzy matching, and scoring
The user can subscribe to a particular category to get email updates when a news release is added
And so on.
This is essentially a software application, embedded as a subset of the larger website. Looking over that list, are you sure your new CMS will be able to do all of that?
If not, you’re going to need to make some decisions.
What of those requirements are you willing to abandon?
Do you have an integration partner that can find creative solutions to work around limitations in your new CMS?
Is your new CMS the right choice? I’m not saying you should bail out at the smallest problem, but if you keep evaluating functionality and your new CMS keeps up coming up short, then you might need to take a beat and think through this some more.
And understand that this is just one page on your website. A typical website has a dozen or more of these content-based applications scattered around. You need to account for all of them and make sure there are plans to make this all work.
Remember: functionality has to migrate, too. And even if that functionality is migrated, there’s a very real chance your new CMS handles it completely differently, and it cannot be migrated so much as it has to be reconstructed.
Navigation is a good example. Many systems (especially web-focused systems) use a tree of content to organize their pages, as we discussed in the last chapter. In systems that use such an aggregation, this tends to become the most prominent feature of the CMS. Editors begin to think of their content in terms of parent-child relationships, and it often overlays directly on navigation, meaning that “branches” of the tree become sections of content.
If you’re moving from a content tree-based system to another content-tree based system – for instance, from Sitecore to Episerver – then you’re in pretty good shape. Both systems handle content in their trees pretty much the same way, and developers who implement those systems tend to rely on the trees in the same way to form navigation. A lot of patterns will simply overlay from one system to another.
But what if you’re moving to a system that doesn’t use a tree? Say, Drupal? Or a headless system, which have mostly eschewed content trees altogether? If your entire navigation logic depends on parent-child relationships, then you’re going to have to find a new way to program this.
Finding all this programming is not easy. Some of it will be reflected in source code, and the developers who implemented the system can point this out. But other functionality will simply be inherent to the CMS, and might be so ingrained that you can inadvertently extrapolate that functionality onto all CMSs, assuming your new one is going to handle it in the same way.
Sometimes, the best way to catalog this functionality is to simply walk through your current website with your implementation team. Look at every unique page, and make sure they have an answer for every piece of logic and functionality that goes beyond the simple display of a single content object.
Coming out of this process, you should have a list of all thecontent that’s moving (from The Editorial Question above), and a list of all functionality that’s going to need to be replicated from one system to another.
The Procedural Question
By this point we know what content is migrating, and we’re aware of all the content functionality that’s going to have to be reconstructed (or reconsidered). A lot of that functionality was identified as development tasks, and is being incorporated into our new website.
Now we have to figure how exactly how we’re going to migrate this. Our migration is basically a separate, related project to main website development. Our plan for this subproject is a roll-up of a number of different problems that need solutions:
Extraction: How are we going to get content out of our old system?
Transformation: How does the content need to be cleaned up?
Import: How are we going to get the content into the new system?
Timing: What does the order and spacing of this process look like?
Let’s look at each one individually.
Automated or Manual?
First, you need to make a pretty crucial decision: are you going to try to automate this process at all?
Make no mistake: you can do a migration manually. There’s nothing stopping you from buying pizza for all the interns and stuffing them in a conference room for a week to copy and paste content from your old CMS to your new one.
In fact, for some projects, this is exactly the right decision. Here’s when this makes sense:
Low Volume: If you’re migrating less than 100 pages, then there’s little point in automating the process. You’ll spend more budget to configure the automation than you would save over doing it manually.
High Transformation: Some content needs to be transformed considerably on the way over. For example, if you’re taking a bunch of content out of PDF files and putting this in HTML, that’s probably something you’re going to have to do manually. 2
Low Cost Labor: I wasn’t kidding about buying pizza for the interns. This happens all the time. Universities are legendary for using work-study students for this.
Copy-and-paste is low-tech and tedious, but reliable and low-overhead. A manual migration takes no prep to get started, and sometimes you can brute force a migration faster than getting clever with it.
If you decide to do a manual migration, then very little of what’s below will apply to you. You will (1) extract, (2) transform, and (3) import all in one step.
Also, know that any automated migration project will in reality include some manual migration. There’s always content that is not automatically migratable.[^pages] These pages are usually made up of a mixture of content from a single object, design components that display other content, and aggregations that bring in more content. The complexity of these pages in relation to their relatively scarcity means these will need to be reconstructed, rather than migrated.
There’s a number of different options for getting content out of a CMS. Some are well-supported and helpful. Others are less so.
Export: Some systems actually have an export system. However, it’s getting less common, and even if it exists, it’s important to know in what format the content will be exported, and if all the detail you need will be included. What structured text format will be used? Will you be able to pick through this and make sense of it? Will all the relationships be accounted for? If the output is missing one crucial thing, then it might be useless.
API: Most all systems have an API that you can manipulate from code scripts. It’s not hard to use this to export content into some neutral format that you design. This generally gives you the most granular way to control exactly what gets exported and in what format it ends up.
Database: If a system doesn’t offer either of the above options, you might be able to go around the CMS entirely and manually search its database. Most systems store content in SQL-compatible databases, and in many situations, a developer can hook directly into it and retrieve content. The danger here is that the content in the database might not be exactly representative of what makes it to a browser.
HTTP: The final option would be to simply make an HTTP call to the URL where the content is located. If you can get a list of URLs where the content is the operative content object (see the last chapter), then you can very easily request it just like a browser would, then parse the HTML to break it up into the attributes you need. There are some drawbacks here – non-public content can be difficult, as can non-displayed attributes – but this is occasionally your last resort.
The goal of extraction is to get the content out into some neutral form where it can be easily accessed and manipulated. Usually this is a text markup format like JSON or XML, but occasionally you might deposit it directly into a simple database or other storage mechanism. The key is to get the content somewhere where you can manipulate and retrieve it easily.
Once you have the content out of the old CMS, it often needs to be transformed or adjusted before being imported into your new CMS. Occasionally the content was modeled poorly in the old CMS, and it would be inefficient to move it as-is.
For example, consider some scenarios of content in your old CMS.
The author’s name was represented in a single field. You want to be able to order authors by last name, so this is going to need to be split into first and last names.
The model for committee meetings had a set number of fields (say, six) for attendees. This becomes a problem when more than six people attended, so the plan is to break this out to a relational attribute in the new system. Each meeting attendee will need to form a separate content object, and be linked into the meeting object where they used to appear.
Comments were stored in a separate database table. This content needs to be extracted alongside the CMS content with a reference to the correct article content object from the old CMS.
Remember, when you import to your new CMS, it has no idea where the content came from, which means the rules you used on your old site to dictate design mean nothing.
Many times, you need to “scrub” rich text. Your old CMS may have generated HTML from rich text editors – the body of an article, for instance – and this HTML is of varying quality. Sometimes it’s clean and can be imported without changes, but other times, you’ll need to write scripts to comb through this HTML and fix problems.
Embedded scripts and styles
Deprecated HTML tags like
Invalid HTML, like improperly nested tags
Character encodings not compatible with the new CMS
So-called whitespace hacking, where editors inserted line breaks and non-breaking spaces in an attempt to manipulate whitespace
Empty paragraphs or list items
These operations can be tedious. If you’re lucky, the HTML is consistently bad, meaning you can fix these things at the global script level. Usually, the HTML is inconsistently bad, meaning you have to pick through it and fix them manually.
What goes out must come back in again, and eventually you’re going to have to take that content you’ve gotten out of your old CMS and move it into your new CMS.
There are fewer options for this scenario. Most CMSs don’t have an “import content” function. 4
Usually, an import needs to be performed from code, by a developer, working against the new CMSs API. It would seem fairly simple, but there are some nuances.
You will need to keep a “manifest” to tie content from your old site to its corresponding new content, so the import knows where to put things. This will say, for example, that Content ID #439 in the old system is now Content ID #562 in the new system. You can do this by temporarily adding an "Old Content ID" field to your content model, and storing the old identifier with the corresponding new object.
You will need to make your script “update aware,” so you can run it and re-run it over and over without create new objects every time. We’ll discuss this more in the section on timing and iterations below.
Working with referential attributes – where an attribute on one content object points to another – can be tricky. If you’re importing 10,000 articles and each one is linked to one of 300 authors, then you need to make sure the authors exist first. You can’t link an article to something that isn’t there.
You will need to do a link resolution process, where you search all HTML for embedded links to other content and then point them at the correct content. Since embedded HTML is less strict than referential properties, you don’t need to worry about order of import so much, but you’ll likely need to delay this resolution until all the content is in, so you have all available link targets.
Timing and Iterations: A Hypothetical Migration Project
We’ve been talking about a “migration” like it’s a singular thing that happens at a point in time, but that’s not how it works. A migration is a process that occurs alongside your development project.
Many parts of a migration are iterative, meaning you’ll try them, realize you did something wrong, delete the results, fix your problem, and do it again. This happens on extraction, transformation, and import. These things go in cycles, continually being refined, closer and closer to an ideal.
Remember too that migration can start early. You should start inventorying as soon as you decide to move your website. You can start extraction shortly thereafter. And you can do a lot of transformation without knowing the final destination for the content. Don’t sit around and wait for the new website to be done. Start working on your migration as early as you can.
Here’s how the entire migration process might work for a large, hypothetical project with around 50,000 content objects (pages, components, relationships, etc.):
Your organization decides to move to a new CMS. Even before development has started – indeed, even before a new CMS is selected – the content team includes The Editorial Question in their content strategy. They start developing a list of what content is migrating and what isn’t.
The development team starts looking at ways to get content out of the old CMS. They determine they can do a rough export. Remember that the new CMS still hasn’t been selected, but the developers know they’re going to extract the content no matter what, and getting it out isn’t affected by what it’s going into.
They don’t know exactly what content they’ll need, so they just export every last bit of content they can find. This is an iterative process – they export, review, evaluate what they missed and what’s not ideal, then delete the result, tweak their script, and export again. This goes on for a couple weeks. The developers eventually get the content out into thousands of JSON files.
The developers get busy cleaning up the old HTML. It’s really old and messy. They still don’t know the CMS it’s going into, but they know it’s going to need to be fixed, regardless of destination.
Finally, a new CMS is selected.
In an epic working session on The Functional Question, the content strategy, design, and development teams figure out where content is going to go in the new system, and how it’s going to have to change to work under the new architecture. Much of this work begins in content strategy and information architecture, but this is when concept becomes action. These teams also create a plan for content that needs to be created or manually migrated.
The developers spent the next few weeks preparing the exported content in the JSON files. They write scripts to transform them, review the result, tweak their scripts, then transform again. This goes on until the JSON files are as perfectly clean and importable as possible.
The migration QA team comes up with a plan to import content in stages, then QA the result before moving on. The content is divided into groups and a schedule is put in place.
Import begins. As an example, the developers run a script to bring in 16,000 news articles. They take a quick look and realize something went wrong, so they delete them, fix their script, and re-import. QA starts, and this time the testers realize that none of the images imported. So, QA stops and scripts are tweaked. This happens a few more times – an import happens, a systemic problem is found, QA stops, the script is modified, and import happens again. Once an import is clean enough to keep, it runs one last time and manual editors are free to make adjustments to that content.
Up to this point, testers have just been recording issues, but not fixing them, in case everything needs to be re-imported. However, as QA continues, it appears that the last import is solid and they can commit it. A decision is made that an automated mass import will not be done again for the news articles, so testers are free to start correcting issues.
The developers will never run the news article import script again. They move on to the next group of content.
QA testing continues. Issues are fixed where possible, but occasionally there are issues that need further review. For example, one news article links to an older one that was discarded. Testers open tickets for these issues and assign them to the relevant staff to ensure the issues aren’t missed. The content strategy and editorial teams are busy analyzing problems and adapting the content to fit.
This cycle of import, re-import, commit, and QA continues until all content is in. This process might take weeks. During this time, the content team has been creating new content to fill in the gaps. The new website slowly starts to come together.
Once a content group is imported and the teams commit to that import, the content team acknowledges that this content is “frozen.” This means that it will not be re-imported, so if they want to change the content in the old system (which is still running the public website), they will need to wait on that change until after launch, or duplicate that change in the new system.
All QA completes, all new content is completed, and the new site is fully populated and ready to launch. From this moment, the new site gets “stale” over time. Remember, the old CMS with all the original content is running the public website. And the new CMS with all the imported content is just sitting there. So, either no content can change in either system, or any change to the public website (which would be on the old CMS) must to be duplicated on the new website-in-waiting.
Thankfully, this period is short. The new website is cleared for launch. Once it launches, the other website is hidden from public access. It’s left online for a period of weeks in case the team realizes they missed something, or need to refer to old content or configuration.
The old CMS and website is eventually taken offline and archived.
Now, look back through the above narrative, and acknowledge one thing: we didn’t talk at all about building the new website. That was all migration. Throughout that narrative, it’s assumed there is another project and another group of developers actually building the new site.
Can you imagine if that team got all done building the new website, and then said:
Okay, now let’s figure out how to migrate all this. I don’t know, maybe we should start with a content inventory or something?
Do you see that dot disappearing over the horizon? That’s your launch date.
Users and URLs
There are two specific situations that come up in enough migrations to be worth discussing separately.
Clearly, you’ll have to retrain your editors and get them new accounts on your new CMS. But what if you have users who log into your website? You might have user accounts for your customers or the public that they use to access content.
Many times, these user accounts are stored in your CMS. If this is the case, these accounts are going to have to move to the new CMS.
Problem: you will likely not have access to the users’ passwords, by design.
Passwords are not normally stored in clear text. Rather, they’re one-way encrypted so that you can’t ever view them in their original form. You’ll have nothing but a long string of random text from which you can’t figure out what their original password was.
Again, this is a good security practice. But it means you can’t seamlessly create a new account for your users on your new CMS because you don’t know what their password was. Often, this means that moving all these user accounts will require your users to create new passwords, or even entirely new accounts. 5
If you determine that your users will require new accounts or password resets, you’re going to need to develop a communication plan around this. Not only do you to need to contact them and clearly explain what’s necessary, but you need to convince them that this isn’t a phishing attempt and that you genuinely do need them to reset their passwords. 6
If your website has been online for any period of time, then you have published URLs to the world which have crept “outside the walls.” Search engines have indexed them, users have shared them on social media, and they might have bookmarked them.
You need to figure out how to get requests to old URLs to the new content.
Sure, in a perfect world, your URLs won’t change. But this isn’t common. Different CMSs have different ways of forming URLs. You can override this on some of them to mock up your old URL structure, but sometimes this creates more problems than it solves, and it can be a overbearing solution to a problem solved through other means.
The most straightforward way to manage redirection is to store the old URL with the new content. So every one of the 16,000 news articles you imported will have an attribute for “Old URL.” When a request comes into the new website and generates a 404 Not Found (since the old URL doesn’t exist anymore), you can do a lookup to figure out what they were looking for, then redirect them.
There’s a bit more to it than this, of course.
You need to be concerned with the format of the URL, since looking for an exact URL match might not find what you want, such as if text casing or querystring parameters have been changed.
There are performance concerns when your count of old URLs gets into six figures.
You’ll need to return a response that indicates this a permanent change, and search engines should update their indices.
You need to consider what you do about requests for content you discarded. It’s easy to just send back a 404, 7 but should do you more? Should you try to get them to other relevant content? Or should you explain the content they wanted was archived and give them contact information if they have questions?
Many times, new sites have been launched without a URL redirection plan. They promptly fell out of every search engine, and 404 Not Found requests went through the roof.
Migrations are...unpleasant. No one wants to think about this stuff until it’s too late. Consequently, migrations are chronically overlooked, in terms of both budget and schedule.
A universal rule: plan more time or budget than you think you will need to migrate your content. You will absolutely use it.
Know too that migrations tend to be a little less planned and smooth than the development of the new website. Even if you run the most perfect development project in the world, your migration project might get a little rougher the closer you get to launch. Problems will be found, quick fixes will be hacked into place, and the entire thing will be looked at as disposable – you need to do “just enough” to get it done.
Most migrations come skidding across the finish line backwards, on their roof, and on fire. But the checkered flag tends to make all those problems seem insignificant.
Inputs and Outputs
The inputs and outputs are hard here, because this is not something that happens within the larger project, but rather alongside it. So there are many inputs and outputs, happening all throughout the main project.
At minimum, some of the tasks performed during content planning – a content inventory, audit, and site map – will inform the next steps.
The Big Picture
You need to start your migration early. Like, at the very, very beginning of all this – back when you started talking about goals and plans. You should have been inventorying content back them. And, as our extended narrative above explained, your migration runs alongside the main strategy and development project. At your very first meeting about this project, you should start talking about migration. No time is too early.
This is something that everyone will be involved with, except maybe designers (they might do a bit of QA, but that’s about it). You’ll need the full cooperative of the content, development, and management teams to pull a migration off.