makeit
packageThe makeit package provides a simple make-like utility to run R scripts if needed, based on the last modified time. It is implemented in base R with no additional software requirements, organizational overhead, or structural requirements.
The general idea is to run a workflow without repeating tasks that are already completed. A workflow consists of one or more R scripts, where each script generates output files.
The following tutorials come with the package and can be copied from
library/makeit/examples
to a working directory, or
downloaded from GitHub.
This example consists of a script analysis.R
that uses
input.dat
to produce output.dat
.
Before
analysis.R
input.dat
Run
make("analysis.R", "input.dat", "output.dat")
#> Running analysis.R
#> Sorting numbers ... estimated run time is 3 seconds
Try running again:
Note how a make()
call has the general form: script
x uses y to produce z.
After
analysis.R
input.dat
output.dat
This example consists of three scripts, where one runs after the other.
The plot script produces files inside a plots
folder and
the table script produces files inside a tables
folder.
Before
01_model.R
02_plots.R
03_tables.R
data.dat
Run
make("01_model.R", "data.dat", "results.dat")
#> Running 01_model.R
make("02_plots.R", "results.dat", c("plots/A.png", "plots/B.png"))
#> Running 02_plots.R
make("03_tables.R", "results.dat", c("tables/A.csv", "tables/B.csv"))
#> Running 03_tables.R
For convenience, a _make.R
file is provided, containing
these make()
calls.
After
plots/A.png
plots/B.png
tables/A.csv
tables/B.csv
01_model.R
02_plots.R
03_tables.R
data.dat
results.dat
Similar to the ‘sequential’ example above, but based on the four-minutes
tutorial that comes with targets
package.
Before
data_raw.csv
fit_model.R
get_data.R
plot_model.R
Run
make("get_data.R", "data_raw.csv", "data/data.csv")
#> Running get_data.R
make("fit_model.R", "data/data.csv", "output/coefs.dat")
#> Running fit_model.R
make("plot_model.R", c("data/data.csv", "output/coefs.dat"), "output/plot.pdf")
#> Running plot_model.R
#> Saving 7 x 5 in image
For convenience, a _make.R
file is provided, containing
these make()
calls.
After
data/data.csv
output/coefs.dat
output/plot.pdf
data_raw.csv
fit_model.R
get_data.R
plot_model.R
DAG example based on the diagram provided in the Wikipedia article on directed acyclic graph.
Each script produces a corresponding output file: a.R
produces out/a.dat
, b.R
produces
out/b.dat
, etc.
Before
a.R
b.R
c.R
d.R
e.R
Run
make("a.R", prereq=NULL, target="out/a.dat")
#> Running a.R
#> Writing out/a.dat
make("b.R", prereq="out/a.dat", target="out/b.dat")
#> Running b.R
#> Writing out/b.dat
make("c.R", prereq="out/a.dat", target="out/c.dat")
#> Running c.R
#> Writing out/c.dat
make("d.R", prereq=c("out/b.dat", "out/c.dat"), target="out/d.dat")
#> Running d.R
#> Writing out/d.dat
make("e.R", prereq="out/d.dat", target="out/e.dat")
#> Running e.R
#> Writing out/e.dat
For convenience, a _make.R
file is provided, containing
these make()
calls.
After
out/a.dat
out/b.dat
out/c.dat
out/d.dat
out/e.dat
a.R
b.R
c.R
d.R
e.R
DAG example based on the example from the targets
user
manual.
The second_target
depends on first_target
and outer_function
, which in turn depends on
inner_function
and global_object
.
Before
first_target.R
global_object.R
inner_function.R
outer_function.R
second_target.R
Run
make("first_target.R", NULL, "output/first_target.dat")
#> Running first_target.R
#> Writing output/first_target.dat
make("global_object.R", NULL, "output/global_object.dat")
#> Running global_object.R
#> Writing output/global_object
make("second_target.R",
prereq=c("output/first_target.dat", "output/global_object.dat",
"inner_function.R", "outer_function.R"),
target="output/second_target.dat")
#> Running second_target.R
#> Writing second_target.dat
For convenience, a _make.R
file is provided, containing
these make()
calls.
After
output/first_target.dat
output/global_object.dat
output/second_target.dat
first_target.R
global_object.R
inner_function.R
outer_function.R
second_target.R
The make()
function is a tool that can be applied to
many types of workflows, consisting of one or many R scripts. It is
especially useful when the complete workflow takes many minutes or hours
to run. Changing one part of the analysis will then update the related
plots and tables, without rerunning every part of the analysis.
Most analyses resemble the sequential
example above,
dividing the workflow into steps that run one after another. As an
introductory example, the sequential
workflow consists of
only three steps: model, plots, and tables.
In practice, it is usually practical to divide a workflow into more steps than that, the first step being data preparation, such as importing, filtering, aggregating, and converting the data to the format that the model expects.
If the model is non-trivial, it can be practical to have an output.R step righter after model.R, extracting the results from the model-specific format to a more general format that is easy to browse, tabulate, and plot. This way, the model.R script can be very short, making it easy to see and understand the modelling approach and configuration. Separating the fundamental modelling step from the manual labor of data preparation and plotting can make an analysis more open and reproducible - for others to browse and reuse.
The paradigm of using small dedicated scripts with clear input and output files (read and write function calls near the beginning and end of each script) is usually a better workflow design than managing a large monolithic script where the user navigates between sections to run selected blocks of code.
The four_minutes
and dag_targets
examples
above provide an interesting comparison between the makeit
package and the targets
package, for example.
makeit
The makeit
package is script-based, where each step
passes the results to the next step as output files. The user organizes
their workflow by writing scripts that produce files.
The makeit
package relies only on base R and takes a
very short time to learn, and can be used to run any existing workflows,
as long as they are based on scripts with input and output files. The
scripts may include functions, but that is not a requirement.
The package consists of a single function that does one thing: run an R script if underlying files have changed, otherwise do nothing.
TAF
The TAF
package contains a similar make()
function and is an ancestor of the makeit
package. The
overall aim of TAF
is to support and strengthen
reproducibility in science, as well as reviewability.
The TAF
package provides a structured modular design
that divides a workflow into four main stages: data.R
,
model.R
, output.R
, and report.R
.
The initial data are declared in a DATA.bib
file and an
optional SOFTWARE.bib
file can be used to declare specific
versions of R packages and other software.
The package consists of many useful tools to support reproducible workflows for scientific analyses.
targets
The targets
package is function-based, where each step
passes the results to the next step as objects in memory. The user
organizes their workflow by writing functions that produce objects. It
is the successor of the older drake
package.
The targets
package relies on many underlying packages,
takes some time to learn, and some work may be required to realign
existing workflows into functions. The functions may produce files, but
that is not a requirement.
The package consists of many useful tools to support workflow design and management.
Comparison
Package | Paradigm | State | Package dependencies | Time to learn | Run existing workflow | Features |
---|---|---|---|---|---|---|
makeit |
Scripts | Files | None | Very short | Must be file-based | One |
TAF |
Scripts | Files | None | Some | Must be file-based | Many |
targets |
Functions | Memory | Many | Some | Must be function-based | Many |
The CRAN task view for reproducible research provides an annotated list of packages related to pipeline toolkits and project workflows.
Magnusson, A. makeit: Run R Scripts if Needed.
https://cran.r-project.org/package=makeit
Magnusson, A. and C. Millar. TAF: Transparent Assessment Framework
for Reproducible Research.
https://cran.r-project.org/package=TAF
Landau, W.M. 2021. The targets R package: a dynamic Make-like
function-oriented pipeline toolkit for reproducibility and
high-performance computing. Journal of Open Source Software,
6(57), 2959.
https://doi.org/10.21105/joss.02959, https://cran.r-project.org/package=targets
Stallman, R.M. et al. An introduction to makefiles. Chapter
2 in the GNU Make manual.
https://www.gnu.org/software/make/manual/