Kees Hink
Kees Hink
8 mei 2018

Searching through PDF files with Wagtail-Textract

Enable full-text search in PDF, Word, Excel and other uploaded Documents.

To enable full-text search in uploaded documents, we created Wagtail-Textract.

Back story

A customer of ours has to work with large amounts of PDF files, which ought to be searchable. This means not only finding a Document by its title or tags, but also by words in the file contents.

Approach

Some Wagtail core team members already provided hints for an elegant solution in a Github issue. We decided to use that approach in an add-on package, that can easily be installed in a Wagtail site.

Textract

This solution uses Textract, a Python library for extracting text from files. Textract uses a lot of other libraries under the hood, which may depend on operating system-level programs being installed, so deployment may require some modification on the hosting environment.

Maturity and further development

The first (alpha) version was released on PyPI today. We intend to start using it in production in July 2018.

I’d like to thank all the people that have contributed already. And we welcome your comments, questions and pull requests!

We love code