Converting Plone data to Django

Door Roel Bruggink | Op 18-10-2016
Getting data out of a Plone ZODB and into something else, like PostgreSQL.

Use case

The old application was written on Plone. We re-created the application in Django. The main objective of this operation was to create a lighter, more easily scalable application.

Statistics

We converted 6.468 Member objects, 22.104 Content objects, 97.433 Custom ZODB objects, and 21.575 blobs. Conversion time was about 2 hours for non-blob data. Another 3 hours were needed for the blobs, mainly because of image processing.

Database storage requirements went down from 2+ gigabyte Data.fs to 31 megabyte .sql - or 125 megabyte in PostgreSQL.

Blobs storage went down from 92GB (including scaled images) to 27GB (no scaled images). Not much difference here; we now have the option to remove scales older then three months using file system tools only.

Response times went down from +/- 1s to <400ms uncached. Blob response time are now < 100ms (<50ms on our fibre connection).

Instead of 12 Zope instances we now run 4 Django processes on uwsgi.

TL;DR

  • ZODB.broken.Broken is awesome!
  • patch zope.interface to allow unpickling data with unknown interfaces
  • Password hashing
  • Moving blobs around
  • No. automatic. solution.

The meat of the matter

The ZODB ships with ZODB.broken. This allows us to read the data out of a ZODB, without the need for the actual classes. If you ever have seen an error containing 'persistent broken [some.dotted.name]', this is because we couldn't import the class (removed, or moved, or an exception).

We need to patch zope.interface for a bit, so it doesn't crash while trying to unpickle Interface classes that do not exist.

import zope.interface.declarations  # noqa

def _normalizeargs(sequence, output=None):
    """
    Normalize declaration arguments.

    Normalization arguments might contain Declarions, tuples, or single
    interfaces.

    Anything but individial interfaces or implements specs will be expanded.
    """
    if output is None:
        output = []

    cls = sequence.__class__
    if zope.interface.declarations.InterfaceClass in cls.__mro__ or \
            zope.interface.declarations.Implements in cls.__mro__:
        output.append(sequence)
    elif type(sequence) is not type:  # -- THIS IS THE PATCH. It prevents TypeError.
        for v in sequence:
            _normalizeargs(v, output)
    return output

zope.interface.declarations._normalizeargs = _normalizeargs

We're completely ignoring the classes Plone, CMF and Zope2, so we do not install them. We're using our FileStorage and BlobStorage directly here. You could use any of the other storages available, ie ClientStorage or Relstorage.

Even though we're not actually modifying the objects, it still is a bad idea to connect to a production database. Just saying.

DBROOT = '...' # Set this to where your Data.fs en blobs are stored on disk.

import ZODB
import ZODB.FileStorage

storage = ZODB.FileStorage.FileStorage(DBROOT + 'Data.fs', blob_dir=DBROOT + 'blobstorage/')
db = ZODB.DB(storage)
connection = db.open()

root = connection.root() # ZODB root object.
app = root['Application']  # Zope 2 app object
site = app.__Broken_state__['Plone']  # Plone object

folder = site.__Broken_state__['project']
# <persistent broken client.contenttypes.content.projectfolder.Projectfolder instance '\x00\x00\x00\x00\x00\x005u'>
children = folder.__Broken_state__['_tree']  # ProjectFolder is a BTree based folder, which stores its children in the _tree attribute.

# Grab the first child from that tree
zope_project = iter(children.values()).next()
# <persistent broken client.contenttypes.content.project.Project instance '\x00\x00\x00\x00\x02\x1c\xc5\xc3'>

zope_project.__Broken_state__['id']
'my-first-project'

Now we can make new Django objects

from myproject.models import Project

# Convert the Zope data to Django data, YMMV here.
target_state = {'title': zope_project.__Broken_state__['title'], }

# I'm assuming you created Django models which use varchar for the id field.
# By passing the expected state as 'defaults', we can run this scripts multiple times and update the target DB.
django_project, created = Project.object.update_or_create(id=zope_project.__Broken_state__['id'], defaults=target_state)

Converting blobs

This converts a ZODB BlobImage.

from django.core.files.images import ImageFile
from myproject.models import ProjectImage

# z_image is an BlobImage object.
fname = z_image.__Broken_state__['image'].__Broken_state__['filename']
blob = z_image.__Broken_state__['image'].__Broken_state__['_blob']
with blob.open() as f:
    defaults = {'image': ImageFile(f, name=fname)}
    try:
        fname = str(fname)
    except UnicodeError:
        fname = z_image.__Broken_state__['id']
    image, created = ProjectImage.objects.update_or_create(project=django_project, ref_filename=fname, defaults=defaults)

Exporting blobs

This exports blobs to a folder on disk.

z_image_state = z_image.__Broken_state__

fname = z_image.__Broken_state__['id']
blob = z_image.__Broken_state__['image'].__Broken_state__['_blob']

target_dir = '/some/path/'
target_fname = os.path.join(target_dir, fname)

if not os.path.exists(target_dir):
    os.makedirs(target_dir)

with blob.open() as f:
    # Copies the actual blob (f.name) to our target location (target_fname)
    shutil.copyfile(f.name, target_fname)

Converting RelationValues

def rv_to_object(plone_site, rv):
    # This takes a Plone site object and a RelationValue.
    It returns the target object. The target object may be Broken!

    _components = plone_site.__Broken_state__['_components']
    intids = _components.__Broken_state__['intids']
    refs = intids.__Broken_state__['refs']
    _ref = refs[rv.__Broken_state__['to_id']]
    return _ref.__Broken_state__['object']

Converting Zope DateTime objects

import datetime as dt

def to_datetime(zope_dt):
    # Make a datetime.datetime from a Broken DateTime object.
    micros, timezone_naive, tz = zope_dt.__Broken_state__
    return dt.datetime.fromtimestamp(micros)

def to_timezone(obj):
    # Make a timezone aware datetime object from a datetime object.
    return timezone('Europe/Amsterdam').localize(obj).astimezone(pytz.UTC)

# For example, convert the 'creation_date' of a Member object to 'date_joined':
date_joined = to_timezone(to_datetime(z_state['creation_date']))

Converting Plone passwords

The Plone SHA1 passwords aren't Django compatible. Plone stores 'password' + 'salt', Django stores 'salt' + 'password'.

Put this in hashers.py

import base64
import hashlib
from collections import OrderedDict

from django.contrib.auth.hashers import SHA1PasswordHasher, mask_hash
from django.utils.crypto import constant_time_compare
from django.utils.encoding import force_bytes
from django.utils.translation import ugettext_lazy as _


class PloneSHA1PasswordHasher(SHA1PasswordHasher):
    """
    The SHA1 password hashing algorithm used by Plone.

    Plone uses `password + salt`, Django has `salt + password`.
    """

    algorithm = "plonesha1"
    _prefix = '{SSHA}'

    def encode(self, password, salt):
        """Encode a plain text password into a plonesha1 style hash."""
        assert password is not None
        assert salt
        password = force_bytes(password)
        salt = force_bytes(salt)

        hashed = base64.b64encode(hashlib.sha1(password + salt).digest() + salt)
        return "%s$%s%s" % (self.algorithm, self._prefix, hashed)

    def verify(self, password, encoded):
        """Verify the given password against the encoded string."""
        algorithm, data = encoded.split('$', 1)
        assert algorithm == self.algorithm

        # throw away the prefix
        if data.startswith(self._prefix):
            data = data[len(self._prefix):]

        # extract salt from encoded data
        intermediate = base64.b64decode(data)
        salt = intermediate[20:].strip()

        password_encoded = self.encode(password, salt)
        return constant_time_compare(password_encoded, encoded)

    def safe_summary(self, encoded):
        algorithm, hash = encoded.split('$', 1)
        assert algorithm == self.algorithm
        return OrderedDict([
            (_('algorithm'), algorithm),
            (_('hash'), mask_hash(hash)),
        ])

and add to settings.py:

PASSWORD_HASHERS = [
    ... insert the current set here
    'myproject.hashers.PloneSHA1PasswordHasher',
]

and set the Django member's password using:

PASSWORD_PREFIX = PloneSHA1PasswordHasher.algorithm + '$'
pwd = PASSWORD_PREFIX + z_member.__Broken_state__['password']
member.password = pwd
member.save()

Getting groupmembers from acl_users

This returns a list of user ids from the 'Administrators' group. We used this for setting member.is_staff.

# site is a Plone site object.
adminusers = list(site.__Broken_state__['acl_users'].__Broken_state__['source_groups'].__Broken_state__['_group_principal_map']['Administrators'])

Getting an objects current review state

If you need an objects current workflow state:

review_state = obj.__Broken__state__['workflow_history'].__Broken_state__['data'][ insert workflow_id here ][-1]['review_state']

Reading a folder's children in order

This loops over an ordered folder.

for key in z_folder.__Broken_state__['__annotations__']['plone.folder.ordered.order']:
    z_child = z_folder.__Broken_state__['_tree'][key]

Uniquely identifying a ZODB object

Use the Plone uuid:

uuid = obj.__Broken_state__['_plone.uuid']