The main benefit of using Notebooks (R Notebooks or Jupyter Notebooks) is that the document is reproducible: the reader knows exactly how the results of the analysis were obtained. I wrote about the use of Notebooks in an earlier post.

Most organizations have a certain report format: a certain cover sheet layout, a certain font, a log of revisions, etcetera. For the most part, organizations have an MS Word template for this report format. If you want to use a Notebook for you analysis and to write your report, you have a few options:

  • You could write front matter in MS Word using your company’s report template and then attach the Notebook as an appendix.
  • You could also use Pandoc (more about what this is later) to convert the Notebook into a .docx file and then merge it into the report template.
  • You could create your own Pandoc template to convert a Notebook directly into a PDF with the correct formatting.

The first option of attaching a Notebook as an appendix to a report otherwise created in MS Word is effective but is means that you need to maintain two different files: the MS Word report and the Notebook itself. The second option of exporting the Notebook to MS Word and merging it into the template is problematic when it comes to document revisions. If the part of the analysis is revised, there is a temptation to change the affected part by either only re-exporting that section from the Notebook into docx, or worse, making the change directly in MS Word. In both cases, there is the possibility of breaking the reproducibility. For example, let’s say that in your report you define some constants at the beginning and do some math using these constants:

P = 1000
A1 = 2
A2 = 4

sigma1 = P / A1
print(sigma1)
# 500

sigma2 = P / A2
print(sigma2)
# 250

Now let’s say that you ask your new intern to revise the document so that \(P = 1200\). They just edit the MS Word version of the report thinking that they will save some time. They don’t notice that \(P\) is used twice in the calculation and only update the result from the first time it’s used. Now the report reads:

P = 1200
A1 = 2
A2 = 4

sigma1 = P / A1
print(sigma1)
# 600

sigma2 = P / A2
print(sigma2)
# 250

The report is now wrong. In a simple case like this, you’ll probably notice the error when you review your intern’s work, but if the math was significantly more complex, there is probably a fairly good chance that you wouldn’t pick up on the newly introduced error.

For this reason, I think that the best option is to create a Pandoc template for your company’s report template. This means that you’ll be creating a PDF directly from the Notebook. In order to revise the report, you have to re-run the Notebook — the whole Notebook.

For those unfamiliar with Pandoc, it is a program for converting between various file formats. It’s also free and open-source software. Commonly, it’s used for converting from Markdown into HTML or PDF (actually, Pandoc converts to a LaTeX format and LaTeX converts to PDF, but this happens transparently). Pandoc can also convert into MS Word (.docx) and several other formats.

When I decided to create a corporate format for use with notebooks, I looked at the types of notebooks that we use. Generally, statistics are done in an R-Notebook and other analysis is done in a Jupyter notebook. Unfortunately, R-Notebooks and Jupyter Notebooks use different templates. R-Notebooks use pandoc templates, while Jupyter uses its own template. Fortunately, there is a workaround. Jupyter is able to export to markdown, which can be read by pandoc and translated to PDF using a pandoc template. Thus, I made the decision to write a pandoc template.

When pandoc converts a markdown file to PDF, it actually uses LaTeX. The pandoc template is actually a template for converting markdown into LaTeX. Pandoc then calls pdflatex to turn this .tex file into a PDF.

When I first started figuring out how to write a template for converting markdown to PDF, I thought I was going to have to write a LaTeX class or style. I got scared. LaTeX classes are not for the faint of heart. But, I soon realized that I didn’t actually have to do that. The pandoc template that I needed to write was just a regular LaTeX document that has some parameters that pandoc can fill in. I’m not sure that I could figure out how to write a LaTeX class in a reasonable amount of time, but I sure can write a document using LaTeX. This is something that I learned to do when I wrote my undergraduate thesis, and while I don’t write LaTeX often anymore, it’s really not that hard.

A very basic LaTeX file would look something like this:

\documentclass{article}
\begin{document}

\title{My Report Title}
\author{A. Student}

\maketitle

\section{Introduction}
Some text

\end{document}

A pandoc template is just a LaTeX file, but with placeholder for the content that pandoc will insert. These placeholders are just variables surrounded with dollar signs. For example, pandoc has a variable called body. This variable will contain the body of the report. We would simply put $body$ in the part of the template where we want pandoc to insert the body of the report.

Pandoc also supports for and if statements. A common pattern is to check for the existence of a variable and use it if it does exist and use a default value if it does not. The syntax for this would look something like:

$if(myvar)$
    $myvar$
$else$
    Default text
$endif$

I’ve written the above code on multiple lines for readability, but it could be written on a single line too.

Similarly, if a variable is a list, you’d use a for statement to iterate over the list. We’ll cover this later when we talk about adding logs of revisions.

Defining New Template Variables

Pandoc defines a number of variables by default. However, you’ll likely need to define some variables of your own. First of all, you’ll likely need to define a variable for the report number and the revision.

To create the variable, it’s just a matter of defining it in the YAML header of the markdown file. Variables can either have a single value or they can be lists. Elements of a list start with dash at the beginning of the line.

Once we add the report number (which we’ll call report-no) and the revision (which we’ll call rev) to the YAML header, the YAML header will look like the following:

title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B

(Bonus points if you immediately though of William Sealy Gosset when you read that).

We’ll probably want to add a log of revisions to the report. The contents of this log of revisions will have to come from somewhere, and the YAML header is the most logical place. The log of revisions will be a list with one element of the list corresponding to each revision in the log. Lists can have nested members. In our case, an entry within the log of revisions will have a revision letter, a date and a description. Including the log of revisions, the YAML header will look like this:

title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B
rev-log:
-   rev: A
    date: 1-Jun-2019
    desc: Initial release
-   rev: B
    date: 18-Jun-2019
    desc: Updated loads based on fligt test data

We can now use these variables in our pandoc template. Using the variables report-no and rev are straight forward and will be just the same as using the default variables (like title and author).

Using the list variables will require the use of a for statement. In the case of a log of revisions, each revision will get a row in a LaTeX table. Using the variable rev-log, this table will look like this:

\begin{tabular}{| m{0.25in} | m{0.95in} | m{4.0in} |}
    \hline
    Rev Ltr & Date & Description \\
    $for(rev-log)$
        \hline
        $rev-log.rev$ & $rev-log.date$ & $rev-log.desc$ \\
    $endfor$
    \hline
\end{tabular}

In the above LaTeX code, everything between $for(...)$ and $endfor$ gets repeated for each item in the list rev-log. We can access the nested members using dot notation.

Using the Pandoc Template from an R-Notebook

RStudio handles a lot of the interface with pandoc. Adding the following to the YAML header of the R-Notebook should cause RStudio to use your new template when it compiles the R-Notebook to PDF. This should be all you need to do.

output:
  pdf_document:
    template: my_template_file.tex
    toc_depth: 3
    fig_caption: true
    keep_tex: false
    df_print: kable

Using the Pandoc Template from a Jupyter Notebook

Using your new pandoc template from a Jupyter Notebook is a bit more complicated because Jupyter doesn’t work directly with pandoc. First of all, we need to tell nbconvert to convert to markdown. I think that it’s best to re-run the notebook at the same time (to make sure that it is, in fact, fully reproducible. You can do this using nbconvert as follows:

jupyter nbconvert --execute --to markdown my-notebook.ipynb

But, Jupyter notebooks don’t have YAML headers like R-Notebooks do, so we need a place to put all the variables that the template needs. The easiest way to do this is to create a cell at the beginning of the notebook with the cell type set as raw, then enter the YAML header into this cell, including the starting end ending fences (---). This cell would, then, have a content similar to the following. Cells of type raw simply get copied to the output, so this becomes the YAML header in the resulting markdown file.

---
title: "Report Title"
author: "A. Student"
report-no: "RPT-001"
rev: B
rev-log:
-   rev: A
    date: 1-Jun-2019
    desc: Initial release
-   rev: B
    date: 18-Jun-2019
    desc: Updated loads based on flight test data
---

Once you’ve used nbconvert to create the markdown file, you can call pandoc. You’ll have to provide the template as a command-line argument and also specify the output filename (so that pandoc knows you want a pdf) and also give the code highlighting style. The call to pandoc will look something like this.

`pandoc` my-notebook.md -N --template=my_template_file.tex -o my-notebook.pdf --highlight-style=tango

Documentation of Your Template

A “trick” that I’ve used is to add some documentation about how to use the template inside the template itself. It’s pretty unlikely that the user will actually open up the template, but it’s relatively likely that the user will forget one of the variables that the template expects. Since pandoc allows if/else statements, I’ve added the following to my template:

$if(abstract)$
    \abstract{$abstract$}
$else$
    \abstract{
        The documentation for using the template goes here
    }
$endif$

This means that if the user forgets to define the abstract variable, the cover page of the report (where the abstract normally goes in my case) will contain the documentation for the template.

Change Bars: Future Work

One of the things that I haven’t yet figured out are change bars. In my organization, we put vertical bars in the margin of reports to indicate what part of a report has been revised. There are LaTeX packages for (manually) inserting change bars into documents. However, I haven’t yet figured out how to automatically insert these into a report generated using pandoc. I’m sure there’s a way, though.

Conclusion

I hope that this demystifies the process of writing a pandoc template to allow you to create reports directly from Jupyter Notebooks or R-Notebooks in your company’s report format.

(Edited to fix a few typos)