Maarten Kling

20 april 2020

We migrated 1.3 million items from Plone to Django

How to solve problems with timezones, broken data, incorrect data, missing persons, inconsistent data, etc. We outline our general approach.

Import script 3759 lines of code

Exporting was easy. Importing needed a mapping file to combine HRS1 and HRS2 fields together which contains 3759 lines of code, not optimized in any way of course because it is a one time job. Lots of inline comments and try except in try except to prevent errors en fix broken data. That all aside here is what has been done.

The total process

The whole process takes about 50 hours. This is cut into three segments. Export, Import, Prep production. While export/import take both about 20 hours, prep production is done in ten. So we only need three days and no night work to complete the full task. Starting at Friday 9:00 sharp. Export is done before 9:00 on Saturday and import is done before 9:00 on Sunday. As you see no sleep is lost. And it was absolutely unnecessary to put effort in optimizing both export and import as it would only make sense if we could cut them in half which was something we never thought would be possible so we didn't even try.

For the import there were several optimizations done, as first runs would take up to 5 days. Back in the days way before any Corona on a Friday evening at the bar at Four Digits we had a look at the running queries during the import;

SELECT * FROM pg_stat_activity

When it became clear that adding just a few indexes to our database would significantly optimize the import time. The reason was that before every object was created we did several get checks in the database to ensure data integrity.

The pre production phase is copying data to the production environment and doing manual checks and changes. Probably something we could have optimized even more but no need to rush as go live is planned on Monday 9:00.

Exporting from Plone

Export is done by using: https://github.com/collective/collective.jsonify easy to use, and fully documented https://collectivejsonify.readthedocs.io/en/latest/#using-the-exporter.

Only 2 paths were skipped, as they were unrelated for the import: membrane_tool and portal_postcode some Plone portal tools.

return export_content_orig(
    self,
    basedir='/tmp',  # absolute path to directory for the JSON export
    skip_callback=lambda item: False,  # optional callback. Returns True to skip an item.  # noqa
    extra_skip_classname=[],  # optional list of classnames to skip
    extra_skip_paths=['membrane_tool', 'portal_postcode'],
    # batch_start=0,
    # batch_size=5000,
    # batch_previous_path='/absolute/path/to/last/exported/item'
)

One other thing we did was add a lot of ComputedFields within the current codebase. This made it easier to map version 1 data to version 2. As ComputedField are being export as its own field in Json.

ComputedField(
    name='coordinatoren_uid',
    widget=ComputedField._properties['widget'](
        visible={'view': 'invisible', 'edit': 'invisible'},
        label='coordinatoren_uid',
    ),
    mode='rw',
    expression="context._getMijnCoordinatorUID()",
    accessor="context._getMijnCoordinatorUID()",
    edit_accessor="context._getMijnCoordinatorUID()",
),

Lots of data, time consuming

1324 directories

root@serverx:/tmp/content_portaal-nl_2020-04-17-13-09-35# ls
0     1029  1060  1092  1123  1155  1187  1218  125   1281  1312  152  184  215  247  279  31   341  373  404  436  468  5    530  562  594  625  657  689  72   751  783  814  846  878  909  940  972
1     103   1061  1093  1124  1156  1188  1219  1250  1282  1313  153  185  216  248  28   310  342  374  405  437  469  50   531  563  595  626  658  69   720  752  784  815  847  879  91   941  973
10    1030  1062  1094  1125  1157  1189  122   1251  1283  1314  154  186  217  249  280  311  343  375  406  438  47   500  532  564  596  627  659  690  721  753  785  816  848  88   910  942  974
100   1031  1063  1095  1126  1158  119   1220  1252  1284  1315  155  187  218  25   281  312  344  376  407  439  470  501  533  565  597  628  66   691  722  754  786  817  849  880  911  943  975
1000  1032  1064  1096  1127  1159  1190  1221  1253  1285  1316  156  188  219  250  282  313  345  377  408  44   471  502  534  566  598  629  660  692  723  755  787  818  85   881  912  944  976
...
1027  1059  1090  1121  1153  1185  1216  1248  128   1310  150   182  213  245  277  308  34   371  402  434  466  498  529  560  592  623  655  687  718  75   781  812  844  876  907  939  970
1028  106   1091  1122  1154  1186  1217  1249  1280  1311  151   183  214  246  278  309  340  372  403  435  467  499  53   561  593  624  656  688  719  750  782  813  845  877  908  94   971

1000 files each

root@serverx:/tmp/content_portaal-nl_2020-04-17-13-09-35/29# ls
29000.json  29056.json  29112.json  29168.json  29224.json  29280.json  29336.json  29392.json  29448.json  29504.json  29560.json  29616.json  29672.json  29728.json  29784.json  29840.json  29896.json  29952.json
29001.json  29057.json  29113.json  29169.json  29225.json  29281.json  29337.json  29393.json  29449.json  29505.json  29561.json  29617.json  29673.json  29729.json  29785.json  29841.json  29897.json  29953.json
29002.json  29058.json  29114.json  29170.json  29226.json  29282.json  29338.json  29394.json  29450.json  29506.json  29562.json  29618.json  29674.json  29730.json  29786.json  29842.json  29898.json  29954.json
29003.json  29059.json  29115.json  29171.json  29227.json  29283.json  29339.json  29395.json  29451.json  29507.json  29563.json  29619.json  29675.json  29731.json  29787.json  29843.json  29899.json  29955.json
29004.json  29060.json  29116.json  29172.json  29228.json  29284.json  29340.json  29396.json  29452.json  29508.json  29564.json  29620.json  29676.json  29732.json  29788.json  29844.json  29900.json  29956.json
...    
29054.json  29110.json  29166.json  29222.json  29278.json  29334.json  29390.json  29446.json  29502.json  29558.json  29614.json  29670.json  29726.json  29782.json  29838.json  29894.json  29950.json
29055.json  29111.json  29167.json  29223.json  29279.json  29335.json  29391.json  29447.json  29503.json  29559.json  29615.json  29671.json  29727.json  29783.json  29839.json  29895.json  29951.json

As there were 1.3 million items, and all in a json file. We created an index file to store basic information about Type and Path so we could split the importer based on a content-type or path only. This way we didn't have to wait for the computer to open and close 1.3 million items when only 20.000 where needed.

def create_index_file(self, path, index, filepaths, paths, limit):
    for dirpath, dirnames, filenames in os.walk(path):
        logger.info("{}/{}".format(index, dirpath.split("/")[-1]))
        index += 1
        for filename in filenames:
            with open("{}/{}".format(dirpath, filename), "r") as json_file:
                try:
                    data = json.load(json_file)
                except:
                    logger.info(f"cannot load {json_file}")
                    pass
                else:
                    if data:
                        paths[data["_path"]] = {
                            "_type": data["_type"],
                            "_uid": data["_uid"],
                            "_filename": "{}/{}".format(
                                dirpath.split("/")[-1], filename
                            ),
                        }
                        if data["_type"] in filepaths.keys():
                            filepaths[data["_type"]].append(
                                "{}/{}".format(dirpath.split("/")[-1], filename)
                            )
                        else:
                            filepaths[data["_type"]] = [
                                "{}/{}".format(dirpath.split("/")[-1], filename)
                            ]
        if index &gt; limit:
            break
    with open(path + "/listfile.txt", "w") as filehandle:
        filehandle.write(str(filepaths))
    with open(path + "/pathfile.txt", "w") as filehandle:
        filehandle.write(str(paths))
        return paths

We now have a listfile.txt and pathfile.txt based on the export information. This would take about 40 minutes to create and only 2 minutes to load in memory when starting a new importer. This way we could export once, create the index files and then retry importing and only have a 2-minute setback when something was wrong.

Also, we completely split up all content so everything could run separately. A dump was made after each step so going back and forth was absolutely possible.

[19/Apr/2020 00:53:22] INFO [website.management.commands.import_from_json:3409] Count: 551749
[19/Apr/2020 00:53:22] INFO [website.management.commands.import_from_json:3411] Dumping Note
[19/Apr/2020 00:55:04] INFO [website.management.commands.import_from_json:3421] Done tmp/Aantekening_Note.pgsql.gz

Importing json to Django

The importing process needed a mapping between the created json file and the new HRS2 database. As HRS1 was in dutch and HRS2 is written in english we get something like: zip_code = afdeling_postcode.

A small part of the complete mapping:

"Persoon": [
    {
        "class": Profile,
        "field_mapping": {
            "_uid": "_uid",
            "first_name": "persoon_voornaam",
            "last_name_prefix": "persoon_tussenvoegsel",
            "last_name": "persoon_achternaam",
            "email": "email",
            "address_street": "persoon_straat",
            "address_number": "persoon_huisnummer",
            "address_number_ext": "persoon_huisnummer_toevoeging",
            "zip_code": "persoon_postcode",
            "city": "persoon_plaats",
        },
    },
],
"Afdeling": [
    {
        "class": Area,
        "field_mapping": {
            "_uid": "_uid",
            "_original_id": "id",
            "title": "title",
            "name": "afdeling_naam",
            "description": "description",
            "code": "afdelingscode",
            "address_number_ext": "afdeling_huisnummer_toevoeging",
            "zip_code": "afdeling_postcode",
            "city": "afdeling_plaats",
            "type": "Department",
            "address_street": "afdeling_postadres",
            "address_number": "afdeling_huisnummer",
            "cost_heading": "kostenplaats",
            "provinces": "afdeling_provincie",
            "telephone": "afdeling_telefoonnummer",
            "email": "afdeling_email",
            "sync_intranet_and_website": "sync_to_ki",
        },
    }
],

Import was cut into six steps as some content needed to be created first then linked the be filled with more data and so on.

Timezones

First my personal favourite, timezones: time, dates, datetime, now and today. Everything was incorrect in Plone.

def to_timezone(self, obj):
    import pytz
    from pytz import timezone

    # Make a timezone aware datetime object from a datetime object.
    return timezone("Europe/Amsterdam").localize(obj).astimezone(pytz.UTC)

def from_zope_date(self, obj, data, field, value):
    if data[value] and not data[value] == "None":
        setattr(
            obj, field, self.to_timezone(parser.parse(data[value], ignoretz=True))
        )
    return obj

def created(self, obj, data, field, value):
    return self.from_zope_date(obj, data, field, value)

An activity has an start and end date. Even with all protections in place, there were for example 12 projects having a startdate AFTER the enddate. Same for Links, Volunteers and Participants.

def end(self, obj, data, field, value):
    end_date = data[value]
    start_date = data["startdatum"]
    if end_date &lt; start_date:
        uid = data["_uid"]
        logger.info(
            f"{obj} {uid} has end: {end_date} before start: {start_date}"
        )
        data["einddatum"] = start_date
    return self.from_zope_date(obj, data, field, value)

Missing users

Content was created by a user, but the user was deleted (mostly LDAP users). This means there was no way to link that content again to its give owner in Django as the foreign key could not be made to a nonexistent user.

def get_user(self, obj, data, field, value):
    for username in data.get(value, None):
        try:
            user = get_user_model().objects.get(username=username)
        except get_user_model().DoesNotExist:
            # be quiet for now, as we are missing tons of users mainly ldap users

            # we do a retry based on creator in same project.
            creator = data.get("creator", None)
            if creator:
                creator = creator.split(" ")[0]
                project_code = data["_path"].split("/")[3]
                project = Project.objects.get(code=project_code)
                pv = ProjectVolunteer.objects.filter(
                    project=project,
                    volunteer__profile__first_name=creator,
                )
                if pv.count() == 1:
                    pv = pv.first()
                    user = pv.volunteer.profile.user
                    setattr(obj, field, user)
            pass
        else:
            setattr(obj, field, user)
    return obj

Usernames and Passwords

Passwords were only missing a bcrypt$ in front of the current bcrypt passwords. They all kept working fine.

try:
    validate_email(data["email"])
    email = data["email"]
except forms.ValidationError:
    email = ""  # empty when garbage
if "password" in data.keys() and data["password"]:
    password = "bcrypt$" + data["password"]
else:
    password = "Super-Secret"

How to get the project done

This was not a project done in a few days. It took over 15 months of development and over 6000 commits and 1660 pullrequests.

In total 16 contributors worked on it, not only during work hours and days, as you can see in this image. There is no way to describe the effort our team made during this project. The frontend is kickass, when you add a field in the backend you don't have to worry about layout. That way backend development could completely focus on the task at hand.

We have commits on 22:00 in the evening and on Saturday and Sunday. Mainly because people were excited they could fix something and get the project to another level. The importer ran every day and if it would crash late in the evening a fix would be made so we could continue the next morning, and so on.

We programmed 72% Python, 23% HTML, 2.5% JS 2.5% CSS.

Starting on Friday

There were several steps to take before we could start the export. Stop the current website, change DNS to show http://hrs.humanitas.nl/hoera/ etc. After the GO from team stopping HRS1 we could start copying data to our export/import setup. This data copy took about 4 hours, as the total size was 164GB.

Export took about 20 hours. Not bad to export 1.324.441 items from Plone. This created the same amount of json files on disk having a total size of 110GB. As this process was taking about 24hours in total, we didn't have to set any alarms for the next day.

Importing on Saturday

Around 9:00 pm export was done and we could start an easy day in the process. Just follow the script.

ssh serverX
cd /home/hrs/
rm -Rf media/*
rm -Rf private/*
rm -Rf tmp/*
sudo su postgres
dropdb hrs2-import
createdb hrs2-import
exit
git pull
python manage.py migrate
python manage.py assign_permissions
screen  
python manage.py import_from_json /tmp/content

Terminal 2:

tail -f /home/hrs/tmp/logfile.txt

And again wait 18 hours. This process would create all the necessary content in de Django. Everything tested on beforehand and we cleaned garbage along the way.

Working on Sunday

Time to move data again. From the export/import environment to the new production location. We had a script in place to follow. Data has shrunk to 800mb database file (29GB in Plone) and around 80gb of files. In this step we checked, rechecked and made sure everything was working. After a few hours (5), we were done. The pipelines where pushed to production and everything was in place.

Live on Monday

Big day, we removed the ip-restriction from hrs.humanitas.nl and the log in page was shown to everyone who tried to log in. We celebrated, using video conferencing due to corona. Sad as we made big plans in February. The party will follow another time.

Everything is up-and-running. We are happy, it's time for a small break. And then we start in about two weeks on HRS2.1. Including many many new requested features. Yep, software is never done. Let's hope we can wait another +10 years for the next migration.

Thanks to my team, Dennis from Dezzign, Joost from Goed Idee Media, many others and of course team Humanitas for testing, feedback, projectsupport and giving us the power and opportunity to create HRS2!

:wave: