The recordr package collects information about R script executions (also refered to as “runs”). The recorded information includes the files that were read and written by the R script, and details of the execution environment, such as the operating system version, R packages loaded, etc.
The recorded information for a script constitutes data provenance for the data products and analysis outputs (graphs, .csv files, etc) generated by a script execution, by providing information to describe how the data products were created.
The record() method takes an R script filename as an argument and sources it, recording files that were read and written by R functions that are registered with recordr. It is not necessary to modify an R script in order to use record.
The following example runs a sample script that is included with the recordr package:
library(recordr)
if(require("ggplot2")) {
rc <- new("Recordr")
sampleScript <- system.file("extdata/EmCoverage.R", package="recordr")
firstRunId <- record(rc, sampleScript , tag="first recordr run")
}
Information about the script execution is stored in the recordr cache (~/.recordr). recordr provides methods to search and view items stored in the cache. It is not recommended that files or directories be manually edited or deleted from the cache directories, with the exception of the items mentioned in this document.
Script runs that have been recorded can be listed using the listRuns() method. The listing can be filtered by the tag value specified when a run was recorded. Runs can also be filtered by run start time, run end time, the text of error messages for a run and by a sequence number, which is an integer value assigned to each run to assist in easily specifying a particular run for listing or viewing.
In this example, all runs with a tag containing the string “first” are listed. Because recordr has only run once in this demo, only one run is listed:
## Seq Script Tag Start Time
## 1 /tmp/Rtmpestg...a/EmCoverage.R first recordr run 2024-10-30 05:01:52 UTC
If no search parameters are specified to listRuns, then all recorded runs are listed.
The first time that recordr() is run, an initial metadata template file is copied to the file “~/.record/package_metadata_template.R”. Each time that record() is called, a metdata file is created for the current run by using the template file as a starting point and generating an EML document, creating EML elements for the items in the template file. In addition ‘otherEntity’ element is created for each each data object that is created by the run and the R script that was run.
The metadata template file can be edited before a run, using the values you specify to affect the generated EML document.
If you are using Rstudio, click on File->Open File (Ctrl-O) and open ~/.recordr, then click on “package_metadata_template.R” in the File pane.
Currently only the items that are in the template file can be updated, and new elements cannot be added, so for example, the ‘title’, ‘abstract’ and ‘creators’ can be edited.
recordr can also collect information during an R console session using the startRecord() and endRecord() methods. When startRecord() is typed in the R console, information capture begins. Information will be captured for any function registered with recordr, while all other console input will not cause any information capture. Information capture is terminated when endRecord() is entered in the console, and execution information is written to the recordr cache.
startRecord(rc, tag="first console run")
df <- read.csv(file = system.file("./extdata/coverages_2001-2010.csv", package="recordr"))
endocladia_coverage <- df[df$final_classification=="endocladia muricata",]
myDir <- tempdir()
csvOutFile <- sprintf("%s/Endocladia_muricata.csv", myDir)
write.csv(endocladia_coverage, file = csvOutFile)
endRecord(rc)
The history of all statements typed during this recorded console session is saved in the recordr cache and will be included in the data package uploaded to a data repository when publishRun() is called.
More detailed information can be retrieved and viewed for a run or set of runs using the viewRuns() method, for example:
## [details]: Run details
## ----------------------
## "/tmp/RtmpestgFF/Rinst146f67eb7d9c/recordr/extdata/EmCoverage.R" was executed on 2024-10-30 05:01:52 UTC
## Tag: "first recordr run"
## Run sequence #: 1
## Publish date: Not published
## Published to: NA
## Published Id: NA
## View at: NA
## Run by user: root
## Account subject: NA
## Run Id: urn:uuid:e381a16a-6080-4b65-8b32-830b6e0c912f
## Data package Id: urn:uuid:305b7f80-f94d-4b85-b682-2322b7da3920
## HostId: e19772d47ad4
## Operating system: x86_64-pc-linux-gnu
## R version: R version 4.4.1 (2024-06-14)
## Dependencies: stats, graphics, grDevices, utils, datasets, methods, base, rappdirs_0.3.3, sass_0.4.9, utf8_1.2.4, generics_0.1.3, xml2_1.3.6, stringi_1.8.4, RSQLite_2.3.7, digest_0.6.37, magrittr_2.0.3, grid_4.4.1, evaluate_1.0.1, EML_2.0.6.1, fastmap_1.2.0, blob_1.2.4, jsonlite_1.8.9, zip_2.3.1, jqr_1.3.5, DBI_1.2.3, purrr_1.0.2, fansi_1.0.6, scales_1.3.0, XML_3.99-0.17, lazyeval_0.2.2, jquerylib_0.1.4, cli_3.6.3, rlang_1.1.4, munsell_0.5.1, bit64_4.5.2, withr_3.0.2, cachem_1.1.0, yaml_2.3.10, tools_4.4.1, emld_0.5.1, uuid_1.2-1, memoise_2.0.1, dplyr_1.1.4, colorspace_2.1-1, hash_2.2.6.3, curl_5.2.3, buildtools_1.0.0, vctrs_0.6.5, R6_2.5.1, lifecycle_1.0.4, jsonld_2.2.1, stringr_1.5.1, V8_6.0.0, bit_4.5.0, pkgconfig_2.0.3, gtable_0.3.6, pillar_1.9.0, bslib_0.8.0, glue_1.8.0, Rcpp_1.0.13, xfun_0.48, tibble_3.2.1, tidyselect_1.2.1, sys_3.4.3, knitr_1.48, htmltools_0.5.8.1, datapack_1.4.1, maketools_1.3.1, compiler_4.4.1, roxygen2_7.3.2, redland_1.0.17-18, ggplot2_3.5.1, recordr_1.0.3.9000, rmarkdown_2.28
## Run start time: 2024-10-30 05:01:52 UTC
## Run end time: 2024-10-30 05:01:56 UTC
## Error message from this run: The "filename" argument value "/tmp/Rtmp9E6uJP/emCoverage.png" must be for file that exists
##
## [used]: 1 items used by this run
## -----------------------------------
## Location Size (kb) Modified time
## /tmp/RtmpestgFF/Rinst146f67e...tdata/coverages_2001-2010.csv 138365 2024-10-30 05:01:45.299826
##
## [generated]: 1 items generated by this run
## -----------------------------------------
## Location Size (kb) Modified time
## /tmp/Rtmp9E6uJP/Endocladia_muricata.csv 8052 2024-10-30 05:01:56.127906
Information for all matching runs is retrieved and displayed, The output displayed by viewRuns is divided into the sections “details”, “used” and “generated”, which can be selectively displayed using the sections parameter.
The record() method will currently record information for the following methods:
package | function |
---|---|
dataone | getObject |
dataone | create |
dataone | updateObject |
utils | read.csv |
utils | write.csv |
ggplot2 | ggsave |
base | readLines |
base | writeLines |
png | readPNG |
png | writePNG |
base | scan |
Other information about the execution environment is also recorded, such as the R packages that were loaded, the operating system, system user name.
Recordr can save copies of files that were read and written by R
scripts that are run with record
. In Addition, the R script
run is also retained. You may wish to do this so that you have copies of
the files as they existed when the program was run.
This provides reproducibility, so that your scripts can be re-rerun with the same inputs. Or you may wish to create a package of the set of files that were read or written by a particular script run, and archive the package locally, or publish it to a data repository.
By default, Recordr does not archive copies of files that were read
or written by the R scripts that are run with recordr
.
You must set the R option max_archive_file_size
to the
maximum size of a file that can be copied to the Recordr archive. If
this option is unset or set to 0
then no files will be
copied to the archive. If files are not copied to the archive, then
recordr will try to access them in the disk location that they were in
when record()
ran.
Setting max_archive_file_size
#options(recordr_max_archive_file_size=1000000.0)
#options(blocked_replica_node_list = TRUE) #options(capture_dataone_reads = TRUE) #options(capture_dataone_writes = TRUE) #options(capture_file_reads = TRUE) #options(capture_file_writes = TRUE) #options(certificate_path = ““) #options(dataone_env =”DEV”) #options(dataone_env = “DEV2”) options(dataone_env = “SANDBOX2”) #options(dataone_env = “STAGING”) #options(dataone_env = “STAGING2”) # mnTestKNB ##options(foaf_name = as.character(NA)) #options(number_of_replicas = 3) ##options(orcid_identifier = “orcid.org/0000-0002-2192-403X”) ##options(package_metadata_template_path = “~/.recordr/package_metadata_template.R”) #options(preferred_replica_node_list = list()) ##options(provenance_storage_directory = “~/.recordr”) #options(public_read_allowed = TRUE) #options(replication_allowed = TRUE) ##options(rights_holder = “CN=Peter Slaughter A34456,O=Google,C=US,DC=cilogon,DC=org”) #options(source_member_node_id = “urn:node:KNB”) ##options(submitter = “CN=Peter Slaughter A34456,O=Google,C=US,DC=cilogon,DC=org”) #options(target_member_node_id = “urn:node:mnDevUCSB2”) #options(target_member_node_id = “urn:node:mnTestKNB”) #options(target_member_node_id = “urn:node:mnStageUCSB2”) options(target_member_node_id = “urn:node:mnDemo2”)
The following description is provided for informational purposes only and is not required to use the recordr package.
The recordr package can record execution information for the commonly used R functions mentioned in the previous section by using wrapper functions that are called before a requested function is called. This overriding of functions is only in effect when the record() function is running. This overridding is accomplished by temporarily adding an entry to the R search path so that the recordr wrapper functions are first in the search path. For example, if a script that is run with the record() function calls the following function:
then the wrapper function recordr_read.csv is first called
because record() has temporarily bound
recordr_read.csv to the function name read.csv in the
temporary environment named “.recordr” that is attached to the search
path, so that the overridden function appears first in the search path,
regardless of package load order. The function recordr_read.csv
records that the file
/usr/smith/data/coverages_2001-2010.csv
was read. Then this
wrapper function searches for the next function read.csv
in
the search path, which is the function that would have been run if
record() was not active.
When the script has completed, the record() function unattaches the “.recordr” environment from the search path, thereby restoring the R search path to it’s previous state, as it was before record() was called.
Note that this mechanism that record() used to override functions doesn’t work for function calls that are fully qualified, i.e. the package name is included in the call. For example, the following function call would not be recorded:
Also, the record() function currently cannot record
information for input or output files that are opened as a connection,
for example, the following call to writeLines
would not be
recorded:
# Write out to a file using a connection
sbuf <- paste(LETTERS, collapse="")
tfile <- sprintf("%s/letters.dat", tempdir())
fcon <- file(description=tfile, open="w")
writeLines(sbuf, fcon)
close(fcon)
This problem will be addressed in the next feature release of recordr.
in demo mode - recordr stores information in the R temp directory, so any information recorded will be lost when the current R session ends
in order to retain information permanently:
recordrConfig(rc, “homedir”)
or
recordrConfig(rc, “homedir”, “/Users/smith/recordr”)