PageCatcher Technology Demonstration

23 March 2001

What is it for?

This distribution is a technology preview for PageCatcher.

PageCatcher is an add-on utility for ReportLab's suite of enterprise reporting tools, as well as the most versatile tool for batch manipulation of PDF files. The suite runs on all common computing platforms.

The free ReportLab core API lets you create PDF files directly using the PPython scripting language; our commercial Report Markup Language product lets you specify printed documents in easy-to-understand XML and converts these to PDF. However, in both cases all of the graphics and visual elements must be constructed from basic building blocks.

Many documents require elements such as fixed form layouts, headers, footers, corporate logos, or other art work which are most cost effectively created by artists or design specialists using visual tools. With Adobe Acrobat, they can use any tools they wish and convert it to PDF. These visual elements can then be seamlessly integrated into PDF reports.

In addition, many applications require batch or server-side modification of existing PDF documents - adding simple annotations, combining documents or printing 2-up or 4-up. These can all be scripted trivially with PageCatcher. There are many single-purpose programs to append, rearrange and extract text from PDFs; we believe that providing a basic API and a scripting interface will lead to the most versatile solution on the market.

Running the demos

The demos are for Windows only and are packaged as a zip file. This creates a subdirectory called (you guessed it) 'pageCatcher' under the location where you unzip it; so you can safely unzip into C:

This distribution consists of: PageCatcher has many command line options; the first argument is a command. The 'Exec' command runs a Python script, and is the only one we'll be demonstrating in this release. We provide five scripts to demonstrate the versatility of the API. Try these commands from a DOS prompt:
pageCatcher.exe Exec example1_fillform.py
pageCatcher.exe Exec example2_reverse.py
pageCatcher.exe Exec example3_append.py
pageCatcher.exe Exec example4_fourpage.py
pageCatcher.exe Exec example5_background.py

Each results in a PDF file being written which begins with 'out'; look at these as well as the samples to get an idea of the capabilities. You can also use the batch file 'runall.bat' to run all five demos in one go.

How does it work?

There are logical two steps in using PageCatcher. First, pages must be extracted into a special data file format using the PageCatcher filter script mode. Second, the extracted pages may be imported by ReportLab programs as PDF Form XObjects. In many applications, extraction is a one-off design-time step, and the data files produced can then be included in new documents at very high speeds.

The commercial product will consist of a compiled Python module (similar to a Java class file) which can be used in 3 ways:

  1. as a command line application with many useful options
  2. as a library within Python scripts
  3. controlled by tags within RML documents

In this release, we are focussing on the scripting interface, which provides the greatest flexibility. To save you from installing languages and libraries, we have produced a single binary containing the ReportLab core library, the Python scripting language and PageCatcher itself. You can write your own scripts as well as looking at the ones we provided, and for many people this EXE will be all they ever need. To do this, you will almost certainly want to refer to the first few chapters of the Reportlab User Guide

, and to look at the documentation for the Python scripting language.

Status

Page Catcher is currently a 'technology demonstrator'. It is 'alpha test' software, and is intended to prove the underlying technology rather than be a complete fully featured product. It is being released at this point because we need to expose it to a wide variety of PDF files, and to prove that it works.

To Do List

We realise that PageCatcher still needs work before it can be classed as "production software". The things that need to be done include:

Known Deficiencies

Page Catcher does not support PDF pages with stream content arrays compressed using the LZW compression method. (Unfortunately this is used in British tax forms). We are working to add this support.

PageCatcher does not support PDF files which have been encrypted.

Since the preprocessor step for PageCatcher parses the entire PDF file, parsing very large files may consume a great amount of computational resources even if only one page is extracted from the file.

There is a very subtle bug in pdfdoc.py (ReportLab release 1.03) that manifests itself in a key error very rarely in using PageCatcher. If you see a key error, try replacing pdfdoc with the latest experimental version of pdfdoc from the ReportLab sourceforge tree.

Workarounds

If you have a copy of Adobe's Distiller, you can use it to work around the majority of problems. To do this, use Distiller's printer emulation to "print to PDF" and the file created will be digestible by PageCatcher. (One known exception: where the PDF file is encrypted and printing is not permitted).

Feedback

We need and welcome feedback to help make this into a great product! Email info@reportlab.com, or join our group of 200+ existing users by emailing reportlab-users-subscribe@egroups.com. Enjoy!