Searching through PDF files with Wagtail-Textract
Enable full-text search in PDF, Word, Excel and other uploaded Documents.
To enable full-text search in uploaded documents, we created Wagtail-Textract.
A customer of ours has to work with large amounts of PDF files, which ought to be searchable. This means not only finding a Document by its title or tags, but also by words in the file contents.
Some Wagtail core team members already provided hints for an elegant solution in a Github issue. We decided to use that approach in an add-on package, that can easily be installed in a Wagtail site.
This solution uses Textract, a Python library for extracting text from files. Textract uses a lot of other libraries under the hood, which may depend on operating system-level programs being installed, so deployment may require some modification on the hosting environment.
Maturity and further development
The first (alpha) version was released on PyPI today. We intend to start using it in production in July 2018.
I'd like to thank all the people that have contributed already. And we welcome your comments, questions and pull requests!