pdfgrep

pdfgrep

It is important to check the statistical terminology in manuscripts as this often reveals confusion and misunderstandings. However, the terminology is often not prioritized by reviewers, and corrections are not always appreciated. Nevertheless, the ideal scientific writing is clear, specific, and unambiguous. The reader does not have to guess what the author really means, for example when using terms such as variable, parameter, and quartile. The correct definitions of statistical terms can be found in The International Statistical Institute. The Oxford Dictionary of Statistical Terms. Oxford University Press, New York 2003

In order to facilitate my own checking of the statistical terminology in manuscripts, I use the Linux command-line utility pdfgrep, a program that scans one or more pdf documents for defined keywords and returns information on detected occurrences. A Window version of the program exists, but the Linux version can be run directly using the Windows subsystem for Linux (WSL).

I have written a short shell file in bash to facilitate my terminology checks. The routine calls pdfgrep and searches the pdf manuscripts in the specified folder for the keywords defined in a separate text-file.

The keywords I wish to check are registered in separate text-file, statrevterms.cfg. The content of this file is currently:

Running the shell file with a pdf manuscript in the folder provides useful output for a simple and quick check that the manuscript is based on correct terminology. The files can be downloaded here.