Introduction
- What is Creodocs?
- Why Use LaTeX?
Bulk Document Creation
- Introduction
- The Template
- Populating Content
- Creating Documents
Complex Documents
- Introduction
- The Template
- Populating Content
- Creating Document
What is Creodocs?
Creodocs is a resource to promote the use of LaTeX to automate document production, primarily in a business setting. If you have a need to produce PDF documents programatically, consider reading through the resources here to see whether LaTeX may be a good fit.
Creodocs is operated by one person alongside LaTeXTypesetting.com and LaTeXTemplates.com as part of the same small business, Creodocs Limited. The resources provided here are free and will help you get started, but if you need support for any part of the process, the services offered at LaTeXTypesetting.com will be useful to create your document templates and write the automation code to populate them from your data sources.
Why Use LaTeX?
What Is LaTeX?
LaTeX is a document preparation system originally released in 1983 by Leslie Lamport. It is built on top of the TeX typesetting system, released in 1978 by Donald Knuth. LaTeX is widely used in academia and especially for mathematics, but is not commonly used outside these fields due to the relatively steep learning curve for those unfamiliar with text-based markup languages (such as HTML).
In LaTeX, a document design is defined using commands in plain text, instead of using a point-and-click graphical user interface such as Microsoft Word. This plain text code is then compiled (typeset) to produce the document in PDF format. While this approach is too cumbersome for most people to learn, it has immense benefits for the resulting documents and for those who are technically inclined enough to learn it.
Why Use LaTeX for Automated Document Production?
Typography The act of typesetting with LaTeX optimally arranges text on the page using best-practices typographic rules and algorithms. This influences such things as the spacing between words, hyphenation, how fonts are utilised and justification. Practically, resulting documents are more beautiful and feel nicer to read.
Powerful LaTeX has the ability to implement virtually any design through the use of a wide diversity of extensions (called packages). Combined with the ability to typeset mathematics, symbols and custom fonts with presets for any language, there are few document elements that can't be produced.
Programmatic LaTeX is programmatic in nature, which means variables, loops, if statements and switches can be readily utilised. This is particularly useful for Creodocs where small user inputs can be made to have big impacts on the document produced. For example, a single boolean (true/false) switch in the LaTeX code can have conditional impacts in many places in the document, such as whether tax should be calculated in an invoice, and whether the corresponding rows and columns for tax totals should be shown in the invoice table. Since LaTeX documents are plain text, templates are amenable to version control to easily track changes over time.
Free and Open Source LaTeX itself, and almost all LaTeX packages used to add functionality to documents, are free and open source, licensed under the LPPL (LaTeX Project Public License). This means there is no subscription to pay or forced upgrades over time to justify additional payments. Further, there is a large body of code examples, questions with answers and community spirit to LaTeX, opposed to the one-sided nature of a commercial product.
Longevity Documents produced with LaTeX are usually compilable years, or even decades, after they are written. Yearly releases of LaTeX rarely break backwards compatibility and packages tend to receive infrequent updates that don't often interfere with other packages. For extra security, the entirety of LaTeX is very amenable to running in a virtual machine or Docker container to ensure long term stability.
What is Bulk Document Creation?
Bulk document creation is the act of creating one to many documents at once with a very similar layout. An example may be an Internet Service Provider (ISP) producing PDF invoices at midnight each night for archiving on their systems and sending out to customers by email. A key feature of bulk documents is their rigid structure and content, with strictly defined areas where document content can change. Continuing with our ISP invoice example, the PDF invoices need to show each customer's account number, the date, their plan, usage information and total cost due, which all differ from invoice to invoice. However, company logos and information, along with legal disclaimers, are static and do not change from invoice to invoice.
Our goal with bulk document creation is to define a strict template for the document we want to produce, with fixed variables for dynamic content that we want to inject into the template. We then want to replace these fixed variables with document content from an external source (such as a database) and compile the template to produce a PDF. We may want to run this on demand just once, or we may want to produce thousands of documents in parallel.
Creating a Bulk Template
Continuing with our ISP invoice example, we will use the Minimal Invoice LaTeX template as the starting point in our bulk document creation workflow. Note that this template is not licensed for commercial use and cannot be used in this way within your company. We are using it here for example/development purposes only. For commercial use, you will need to create your own internal LaTeX template or have one created for you by LaTeX Typesetting.
Starting Template
The starting point for automation is a fully-functional LaTeX document which can be compiled independently and contains example content. This allows us to tweak the template for automation as needed and manually enter edge-case content to see how the template responds. Our starting template already has example content, but let's modify it to suit an ISP producing monthly invoices for customers. We end up with the following:
Download Pre-Automation Template Code
Adding Dynamic Variables
Our invoice template is currently composed of static example content. To be able to customize it for each customer, we need to specify the locations of dynamic content using automation hooks. Automation hooks are similar to variables in programming. They have unique names that correspond to a single piece of information (e.g. customer number), and they may be used in only one place or repeated in multiple places in a document. The syntax of the variables doesn't matter, as long as the format is something that wouldn't naturally occur in any document you create. For example, a dollar sign (e.g. $variable_name
) used in many programming languages is likely a bad idea, because the code of many LaTeX documents will feature dollar signs with text immediately after them so you may end up with a clash.
For our dynamic variables, we will use three pipe characters, followed by the variable name in all capitals, followed by another three pipe characters, e.g. |||VARNAME|||
. Pipe characters are not special symbols in LaTeX, unlike dollar signs which surround math environments, so are typeset neatly in the document without causing compilation problems. After replacing static example content with variables, we end up with the following:
Download Automated Template Code
The decision for what content should be static and/or handled by LaTeX vs. dynamic and populated automatically will differ for each document. In our case, six variables have been added for the account number, invoice number, customer name, customer address (2 lines) and invoice items. Notice that the invoice date and due date have been left to LaTeX to populate using the \today
command and automatic calculation of the due date 14 days from when the document is typeset. Notice also that the majority of the invoice has remained untouched because it is composed of company/payment information that we want displayed on all invoices.
The last thing to note before we move on to populating our variables is that variables can correspond to anything. For example, our |||ACCOUNTNUMBER|||
variable will always correspond to an account number with a fixed number of digits. This means we only need to ensure the places where this variable is used in the template can handle that length of content. You may notice, however, that we have replaced the two invoice table lines in the pre-automation template with a single variable |||INVOICEITEMS|||
. This means when we go to replace this variable, we will need to output the correct LaTeX syntax for however many lines the current invoice has.
Populating Bulk Document Content
In the previous section, we started with an existing LaTeX template and modified it to our business needs, then added dynamic variables to indicate where we want to automatically inject dynamic content. We now move on to replacing the dynamic variables with content from a database source.
Process
One of the major advantages of using LaTeX for document automation is that LaTeX documents are plain text. This means all we have to do to populate our automated template is iterate over all the documents we want to produce and replace the variables with real content. Thus, the steps are:
- Read in the automated LaTeX template file(s) into memory.
- Read in the database/source data for all the documents we want to produce.
- Iterate over all documents, and for each one replace each variable in the LaTeX code with the corresponding content.
- Output the populated LaTeX code for each document to a new file ready for typesetting.
Virtually every programming language is capable of reading and writing text files, replacing content in memory using regular expressions and iterating over a data structure of document data in a loop. Scripting languages are particularly easy to use and we will use Python for our examples because it is popular and accessible.
Document Data
Before we can populate the dynamic variables, we need example document data. This is a step where every business will widely differ in how the data for documents are stored and extracted. Typically, data for invoices would live in a database and be extracted in a standard format using a database query. Data storage and retrieval are outside the scope of this tutorial, so here we will use a simple tab-separated values (TSV) file where each line corresponds to a separate document and columns store the data for each variable. Here are the first 5 lines (documents) of the data file:
ACCOUNTNUMBER | INVOICENUMBER | CUSTOMERNAME | CUSTOMERADDR1 | CUSTOMERADDR2 | INVOICEITEMS |
---|---|---|---|---|---|
8109182 | 15235-735 | Mariah Wolf | 96 S. Brown Street | Aberdeen, SD 57401 | Unlimited 300 Mbps Internet,1,50 |
7870918 | 92965-417 | Jase Ibarra | 8834 Miller Ave. | Willoughby, OH 44094 | Unlimited 500 Mbps Internet,1,50 |
1423747 | 43555-576 | Madilynn Castaneda | 360 South Vernon Drive | Middleburg, FL 32068 | Unlimited 500 Mbps Internet,1,50 |
5165512 | 55112-712 | Peter Peck | 86 New Lane | New Lane Wadsworth, OH 44281 | Unlimited 1000 Mbps Internet,1,50;Retention Discount,1,-10;Late Payment Fee,1,20 |
1158708 | 72385-610 | Collin Gould | 23 Glen Creek St. | Euless, TX 76039 | Unlimited 300 Mbps Internet,1,50 |
Note that the last variable INVOICEITEMS
contains one or more lines of invoice items separated by semicolons, each with three values for the three invoice table columns separated by commas. This is an example of how you may want to compress data for multiple sequential outputs into a single dynamic variable.
Populate Template with Data
We now have our template and data ready for populating. Let's go through each step of the process outlined at the top of this page using real Python code. First, we read in the template and data like so:
# Import dependencies
import sys
# Store required script parameters
template_file_path = sys.argv[1]
db_file_path = sys.argv[2]
# Open the LaTeX template file
with open(template_file_path, 'r') as template_file:
# Read the template file into memory
template_content = template_file.read()
# Open the database file
with open(db_file_path , 'r') as db_file:
# Read the header line with variable names and split them into a list
variable_names = db_file.readline().rstrip('\n').split('\t')
Our script takes two parameters: the first for the path to the template file and the second the path to the database file. We first open and read the template file into the variable template_content
. Next, we open the database file and read the first line containing the variable names, and we split this line by a tab character into a Python list (variable_names
) because our database is tab-delimited. The internal counter for reading lines is now on the second line of the database, which is the first line of real document content. We can now proceed to iterate over all lines in the database, populating the template for each one as we go:
# Iterate each line of the database corresponding to a separate document
for db_line in db_file:
# Reset the current template content to the original template
current_template_content = template_content
# Split the current line of the database into the individual variables in a list
current_variable_content = db_line.rstrip('\n').split('\t')
# Iterate over each variable
for i in range(0, len(variable_names)):
# Process invoice item rows separately
if (variable_names[i] == 'INVOICEITEMS'):
# Individual invoice item rows are semicolon-delimited within the single INVOICEITEMS variable, so first split them into a list
invoice_items = current_variable_content[i].split(';')
# Clear the current variable content so we can recreate it below
current_variable_content[i] = ''
# Iterate over each invoice item and create the correct LaTeX syntax for the 3 elements
for x in range(0, len(invoice_items)):
# Split the current invoice item into the three comma-delimited columns in the table
current_invoice_item = invoice_items[x].split(',')
# Create the output LaTeX code for the invoice item using the custom LaTeX command in the template which takes three parameters
current_variable_content[i] += '\\invoiceitem{' + current_invoice_item[0] + '}{' + current_invoice_item[1] + '}{' + current_invoice_item[2] + '}{}\n'
# Replace all occurrences of the variable with its current content
current_template_content = current_template_content.replace('|||' + variable_names[i] + '|||', current_variable_content[i])
For each document, we first create a clean copy of the original template code in current_template_content
that we read in at the start of the script. We next remove the newline character from the current database line and split it by a tab character to end up with a list of each of the variable values for the current document. It's then a matter of iterating over each of the variables and replacing the variable in the template with its content in the database. You'll notice we need to handle the INVOICEITEMS
variable differently, because it can contain one or more semicolon-delimited lines of invoice items that are themselves comma-delimited. For invoice items, we first split by semicolons in case there are multiple lines, then iterate over each line and further split by commas. Each of the invoice table columns per line needs to be output as parameters in curly brackets in the \invoiceitem
command. After replacing all variables, we end up with current_template_content
containing the modified LaTeX code where variables have been replaced by real content for the current document. We then save this to a new file:
# Counter with the current document number
current_doc_num = 1
<previous block>
# Write the populated template out to a new file, where template.tex becomes template.<num>.tex using the current document number (row in database)
output_filename = template_file_path[:-4] + '.' + str(current_doc_num) + '.tex'
with open(output_filename, 'w') as populated_template_file:
populated_template_file.write(current_template_content)
# Iterate the document number
current_doc_num += 1
Here we name the output file by taking the filename of the original template file and adding the current row number from the database to it before the .tex
extension at the end. Including the line number is useful if we encounter compilation issues later, so we can go back to the database and quickly find the line to determine what in the content is causing the issue.
Putting all this code together, we have:
# Import dependencies
import sys
# Store required script parameters
template_file_path = sys.argv[1]
db_file_path = sys.argv[2]
# Open the LaTeX template file
with open(template_file_path, 'r') as template_file:
# Read the template file into memory
template_content = template_file.read()
# Open the database file
with open(db_file_path , 'r') as db_file:
# Read the header line with variable names and split them into a list
variable_names = db_file.readline().rstrip('\n').split('\t')
# Counter with the current document number
current_doc_num = 1
# Iterate each line of the database corresponding to a separate document
for db_line in db_file:
# Reset the current template content to the original template
current_template_content = template_content
# Split the current line of the database into the individual variables in a list
current_variable_content = db_line.rstrip('\n').split('\t')
# Iterate over each variable
for i in range(0, len(variable_names)):
# Process invoice item rows separately
if (variable_names[i] == 'INVOICEITEMS'):
# Individual invoice item rows are semicolon-delimited within the single INVOICEITEMS variable, so first split them into a list
invoice_items = current_variable_content[i].split(';')
# Clear the current variable content so we can recreate it below
current_variable_content[i] = ''
# Iterate over each invoice item and create the correct LaTeX syntax for the 3 elements
for x in range(0, len(invoice_items)):
# Split the current invoice item into the three comma-delimited columns in the table
current_invoice_item = invoice_items[x].split(',')
# Create the output LaTeX code for the invoice item using the custom LaTeX command in the template which takes three parameters
current_variable_content[i] += '\\invoiceitem{' + current_invoice_item[0] + '}{' + current_invoice_item[1] + '}{' + current_invoice_item[2] + '}{}\n'
# Replace all occurrences of the variable with its current content
current_template_content = current_template_content.replace('|||' + variable_names[i] + '|||', current_variable_content[i])
# Write the populated template out to a new file, where template.tex becomes template..tex using the current document number (row in database)
output_filename = template_file_path[:-4] + '.' + str(current_doc_num) + '.tex'
with open(output_filename, 'w') as populated_template_file:
populated_template_file.write(current_template_content)
# Iterate the document number
current_doc_num += 1
Place the Python script in the same directory as the template and document data and navigate to the directory in a Terminal window. Run the script with python creodocs_populate_template.py Invoice_Automated.tex Data.tsv
to produce 1,000 .tex files for each of the lines in the database. We are now ready to compile these 1,000 tex files to produce the final 1,000 PDF documents.
Limitations
Software developers will notice that the code on this page is very fragile. Examples of things that it doesn't do but should are:
- Does not check that each row of the database has the same number of columns.
- Does not escape/replace any LaTeX special characters in the document content, such as: $\_@%{}.
- Does not enforce any maximum lengths or types on the variable content. For example, maybe it should check that all account numbers are numbers that are 7 digits.
- The code only runs on one LaTeX file, but what if your template is spread across multiple .tex files, each with variables? Should it output a separate one for each row in the database, and how will that be handled in the
\input
commands in the LaTeX code? - Does not handle document data in other formats, such as in separate text files or CSVs.
For these reasons, you should treat the code on this page as a guide and skeleton for your own production code, rather than using it verbatim. If you are unsure how to do this, or would like code written for your specific use case, submit an enquiry to LaTeX Typesetting.
Creating Bulk Documents
Before this final stage of the bulk document creation workflow, we have used a LaTeX template with dynamic variables and a database of document data to produce 1,000 populated .tex files. We now need to compile these to produce the final PDF documents.
Compilation/Typesetting Options
Compiling (also known as typesetting) our documents is done using the standard pdflatex engine from a LaTeX distribution such as TeX Live. If you have a LaTeX distribution installed locally, you can manually open and compile the individual .tex files we produced in the preview step of the workflow, but that's not feasible for 1,000 documents. You may want to write a script to use your local distribution to compile the documents serially or in parallel, but this makes your workflow dependent on what you currently have installed, which may change as you update your distribution in the future. In short, this approach is not recommended because it is subject to unexpecated changes.
A better approach to compilation is to use a separate environment dedicated to the task, one which does not change over time unless you explicitly decide to update it. Here we have two main options: virtual machines and containers. A virtual machine involves spinning up a separate computer with its own operating system under your current operating system. You would install a LaTeX distribution on this machine and have full control over when it gets updated and what other software is installed on it (such as system fonts). This approach is feasible, but more complicated than the alternative of using containers.
Here we will use Docker containers to accomplish the same task as a virtual machine, but without the hassle of managing an entire operating system. A Docker virtual machine is called a Docker Image and is created using a Dockerfile. Creodocs has already created a Docker Image for typesetting LaTeX documents which can be found on Docker Hub here: https://hub.docker.com/r/creodocs/peon. This Docker image is based on Ubuntu Linux and installs the full distribution of TeX Live. It then creates a user for typesetting called peon
and creates a working directory at /typesetting
for data input and output that this user has access to.
Download the Creodocs Peon Dockerfile
Configuring Docker
Head to the Docker website and install Docker Desktop for your operating system. This will install the Docker Engine and command line tool that we will use from here onwards. Check that Docker is installed by entering docker
in a Terminal window. Make sure the Docker Engine (aka daemon) is running and enter docker pull creodocs/peon
in your Terminal window. This will pull the Creodocs peon typesetting Docker image from Docker Hub and make it available on your computer. The image will also show up in Docker Desktop.
Using the Docker Image to Compile One Document
If you've been following along with this tutorial, you will have a single directory with the automated template (and its assets: the class and Creodocs logo), document data, Python script and the 1,000 populated LaTeX documents. Let's now try using the Creodocs peon Docker image to compile one of our populated LaTeX files.
Docker containers are created on the command line using docker run
with a number of parameters. Make sure you are in the directory with the template assets and populated LaTeX files and run:
docker run --name peon --volume ./:/mnt/template:ro creodocs/peon /bin/bash -c "cp -R /mnt/template/* /typesetting; pdflatex Invoice_Automated.1.tex"
There's a lot going on here, so let's dissect it piece-by-piece. First of all, we are telling Docker to run a container with the name peon
. We use --volume ./:/mnt/template:ro
to mount the current directory we are in as /mnt/template
inside the container as a read-only directory. This is our link between the container and our local computer. We then specify the image we are running as the one we pulled above (creodocs/peon
). Finally, we specify what we want the container to do once it's created, first by saying that we want to use the bash shell with a single line of commands in quotes (-c
), and then we specify the commands to run inside the double quotes. Multiple commands are separated by semicolons, and the first one is to copy everything from the mount point /mnt/template
(i.e. your current local working directory) to /typesetting
, where the container's peon
user has access to read and write. The final command is to compile the first populated LaTeX document using the pdflatex engine like so: pdflatex Invoice_Automated.1.tex
.
If the Docker container runs as it should, you will see the usual LaTeX compilation log output to your Terminal window, because Docker passes through STDOUT from the container as it runs. You'll notice that once the compilation completes, Docker exits back to your prompt. At this point, our container has done its job and persists on your file system, but it is no longer running. We next want to copy out the compiled PDF file by reaching into the container like so:
docker cp peon:/typesetting/Invoice_Automated.1.pdf .
You should see a notice that the file was successfully copied. Finally, let's remove our container since we no longer need it:
docker rm peon
You should now have the first compiled populated LaTeX code as Invoice_Automated.1.pdf. This is what it should look like:
Compiling All the Documents
In our previous example, we copied all the populated LaTeX files into the container but only compiled one of them by specifying its filename to pdflatex manually. We can slightly tweak our bash code to compile all of the files in one go by using the GNU parallel command-line utility, which was installed in our Docker image as part of the Dockerfile. Parallel lets us run any number of concurrent compilations and acts like a messaging queue where it will only run up to the maximum concurrent compilations you specify, and when one finishes it automatically starts the next in line. This means you can completely utilize your CPU to carry out the compilations, rather than running them one at a time. We make use of parallel like so:
docker run --name peon --volume ./:/mnt/template:ro creodocs/peon /bin/bash -c "cp -R /mnt/template/* /typesetting; find . -type f -regex '.*\.[0-9]*\.tex$' | parallel -j 20 'pdflatex {}'"
Notice that we are no longer simply calling pdflatex on a single .tex file. Instead, we are using find
to find all the .tex files that have a period, followed by a number, followed by a .tex extension (i.e. all the populated LaTeX files we want to compile). After finding these, we pipe them into parallel
and specify the maximum number of concurrent compilations to run (20), then run pdflatex on the current file, which is specified with {}
. Running this takes 48 seconds on a modern CPU and successfully produces all 1,000 final PDFs. Now we can copy out the entire directory and remove the Docker container like before:
docker cp peon:/typesetting/ ./outputs/
docker rm peon
You should now have an outputs
directory which contains all 1,000 final compiled PDF documents. Since we copied the entire /typesetting
directory out of the container, the compilation logs are also present as .log
files. If any of your documents fail to compile, you can now open them manually to check the code to see what is causing the issue, then modify your Python code to handle it in the future. We currently don't handle compilation failures, but to do so, you will need to run pdflatex in nonstop mode so it does not hang your threads on failure, but fails to produce a PDF and moves on (leaving the log for you to interrogate). Click the button below to download all the outputs copied out of the container for the first 50 documents.
Download the First 50 Compiled Documents
Some final thoughts on organizing the project. In this tutorial, we have been keeping all the working files in a single working directory which has swollen to over 1,000 files and over 5,000 in the outputs subdirectory. This is still manageable for a project this size, but what if you are producing millions of documents? It may be a good idea to modify your Python code that populates the automated template to output each populated LaTeX file to its own folder (with all the LaTeX assets of course). You can then modify the Docker commands to run pdflatex in each of these folders, or change the process entirely to create a new container for each document rather than running them all in one. How you structure your project really depends on your requirements, and you should think carefully about it before deciding on a final approach.
Conclusion
This tutorial has gone through the full process of automating a LaTeX template to enable the production of many PDF documents simultaneously. Its purpose is to promote the use of LaTeX as a viable tool for creating business documents in a cheap and, most importantly, durable way that you can set and forget. While the example code provided here is functional, it is largely a skeleton to show the process and enable you to build on top of it to implement your business logic. If you need help in doing so, reach out to the service arm of Creodocs at LaTeXTypesetting.com.
What is Complex Document Creation?
Complex document creation is the act of creating single documents with content that extensively differs from document to document, but with a fixed layout. Complex documents are usually longer and an example may be monthly sales reports for a company showing the latest figures, projections and providing extensive breakdowns per sector. The key feature of complex documents is that the amount of content within them is flexible and diverse. Continuing with our example of monthly sales reports, we may wish to add a new major section one month for an emerging opportunity, which itself will contain subsections, graphs, tables and several pages of text. Despite this flexibility, we want all of our sales reports to feature the same layout, title page, headers/footers and summary information on the first page.
Our goal with complex document creation is to define the general layout of the document in a template, such as the fonts, margins, headers/footers and section styling, and to allow populating the main body of the template with any amount of content the user wishes to input. The template will have a mixture of rigid content containing several fixed dynamic variables, such as the title page, with a fluid main body. We want to allow users to populate the main body of the template with a standard hierarchy of content, without requiring them to learn LaTeX and allowing them to modify the overall design of the document. We want the document to be compilable iteratively as the main body is fleshed out, to eventually end up with a single final PDF version for distribution and/or archiving.
Creating a Complex Template
Coming eventually...
Populating Complex Document Content
Coming eventually...
Creating Complex Documents
Coming eventually...