Migrating Content from Sitecore to Umbraco CMS
We recently completed a project to transform a client's Content Management System (CMS) from the Sitecore platform to Umbraco.
My part was to figure out how to migrate the content using a systematic, repeatable methodology. That's what we're going to cover off here to help you do the same.
The constraint? We did not have access to any of the customised source code, only the databases that are hosted in Sitecore.
Luckily, this was not an issue, because the database is simple enough to query and contains all content and media (like videos, images, and documents).
I settled into using the following high-level approach:
- Query the Sitecore database for all required content (and properties related to that content) and store them as consolidated objects to be processed further down the line
- Query the database for all media related to the content. For example, any references to images or documents in the HTML content, or specific properties on the content items referring to media. These were then saved to the disk for further processing.
- Upload all locally referenced media into Umbraco using the provided APIs and storing away the newly created references (Umbraco auto-generates a file path which needs to be used when referenced in the content)
- Transform the content and upload into Umbraco using the provided CMS APIs.
Querying Sitecore Content
I decided to use the Web database, as this is the published version of content in Sitecore. Master contains the working copies of this data and unpublished content.
Sitecore is fairly simple to query. Content Items are stored in a hierarchical tree structure in the Items table. Properties are stored in the Fields table and are referenced by Items. This is over-simplifying the schema a bit, as there are other aspects to the data like Templates and Inheritance.
At this stage, it was also important to infer how pages behaved based on the observation of data; for example, flags that caused a different behaviour of the pages. This was mostly due to the lack of access to any of the custom code.
Recreating Media Locally
Media is stored in the Blobs table. It contains enough information to recreate the file and its content on disk.
With some effort in concatenating binary arrays - the data is stored across multiple rows - the file can be rebuilt in local storage. I will note that I did have some issues at first. The files would be created on the disk, but some of them would be unreadable. It turned out after some investigation, that I was only saving the first row for each object, so small files worked, and larger files would be truncated, meaning that documents like PDFs and Excel spreadsheets would fail to open.
Firstly, I will share some thoughts about the Umbraco solution.
I needed access to Umbraco’s APIs to ensure that the content items and media were getting created correctly, ensuring any constraints were being met; but I didn’t want to pollute the new codebase with a migration process that was intended to be used only once.
For simplicity’s sake, I chose to create a new project that referenced all the Umbraco APIs but was essentially headless. In other words, an Umbraco API without the web part. This turned out to be more difficult to achieve than was first thought. I settled on creating a whole new Umbraco web solution and pointed the database connection strings and file storage references to the development version of the site (for production, it was simply a case of reconfiguring these settings). I then exposed a REST API containing two methods: one for media creation and one for content creation. I chose not to overcomplicate the solution with any security as it was intended to be run locally on the go-live date.
Note: I am sure that there are better ways to access the Umbraco API, but it was not a requirement for the project.
Uploading Media into Umbraco
Some pre-processing was required before uploading the media. Although each media item has a unique reference in Sitecore, there is no constraint about the name attached to it, so I ensured unique names on each item before posting them into Umbraco. The APIs would then pass back a URL for the items, which were stored away for future processing. Although not essentially necessary, I built the API to allow for upserts of data. That is, overwrite an item if it already exists (based on a unique reference associated with each item). This was done entirely to assist the development process (there is little more annoying than having to manually delete hundreds of items when a bug is discovered).
Uploading Content into Umbraco
Fortunately, the canonical URL of a Sitecore item is very similar to an Umbraco item. With some de-duplication, I could pre-generate the mapping of Sitecore canonicals to the Umbraco canonicals before uploading a single content item. This again is over-simplification due to the hierarchy being remapped in Umbraco, but it was a well-defined mapping. The upshot of this is that all the HTML content could be pre-processed to link correctly to other content before saving.
The media references were also fairly simple to modify because we already had stored away the URLs for the uploaded media items.
A quick note on HTML parsing. There are many ways to skin the proverbial cat, but I thought I would try the HtmlAgilityPack libraries to parse the HTML. In hindsight, I would have been better off with XML/XSLT transforms, or even better, Regular Expressions. But this is a decision based on what you are comfortable with.
For this specific project, there were 4 distinct content item templates, so I chose a factory/strategy pattern to simplify the transformations for each. This is one part of the solution that I feel cannot be genericised due to the complexities of different requirements for different clients.
Finally, it was simply a case of calling the API to save each individual content item.
Some of the content was intended not to be publicly accessible, so extra code needed to be written to map the security requirements where required.
A large majority of the migrated content referred to arrays of other content (categories, types, etc). This was easily catered for in the transformations and extending the API to include the behaviour when creating the items.
During development, many mistakes, bugs and missing/partial functionality have to be handled, and the process of reading large volumes of data (or writing large volumes of data) can take a long time. To optimise this, I broke the migration into 3 repeatable steps – querying Sitecore, saving media and saving content. The steps were essentially stored in JSON documents locally with the state of each operation alongside. That meant that I could break the development process into discrete testable units.
For the production run, we decided to host the database locally (The Umbraco solution uses Azure Cloud infrastructure, with blob storage for the media). The content was created in the local database and the media was stored directly on the production blob store. The database was then uploaded back into Azure after sanity checks were completed.
The question of volume is important. How long will the migration take when going live? In this case it was a small set of data – 500 content items and 1.6 gigabytes of media. The process took just under two hours to complete (4 hours when including the preparation, testing and uploads). A large portion of this time is used for uploading media.
The extraction of Sitecore can be mostly genericised for re-use, and the Umbraco API was already mostly genericised for re-use. The business domain logic of mapping from one to the other will most likely have to change on a per-project basis.
Re-indexing of content in Umbraco had to take place after the migration was complete, and this has its own performance issues. Potentially, one could build the import API directly into the solution, but that would require extra thought around security and maintainability.
The solution could potentially be adapted to allow for live streaming of data from Sitecore to Umbraco in situations where the two need to run concurrently.
Content Delivery Network (CDN) integration was not considered for the migration but would be easy to include.