--- title: "Uploading Datasets to DataONE" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Uploading Datasets to DataONE} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\usepackage[utf8]{inputenc} --- ## Introduction This document describes how to use the *dataone* R package to upload data to DataONE, and how to perform maintenance operations on the data after upload. The *dataone* R package provides methods to enable R scripts to interact with DataONE Coordinating Nodes (CN) and Member Nodes (MN), to search for, download, upload and update data and metadata. The *dataone* R package takes care of the details of calling the corresponding DataONE web service on a DataONE node. For example, the *dataone* `createObject` R method calls the DataONE web service [MNStorage.create()](https://purl.dataone.org/architecture/apis/MN_APIs.html#MNStorage.create) that uploads a dataset to a DataONE MN. Before uploading any data to a DataONE MN, it is necessary to obtain a DataONE user identity that will be provided with each request to upload or update data. The method that DataONE uses to achieve this is known as *user identity authentication*, and requires that an *authentication token*, which is a character string, be provided during upload. The process to obtain this token is described in the *DataONE Federation* vignette, in the section **DataONE User Authentication With Tokens**, which is viewable with the R command `vignette("dataone-overview")`. (Note: DataONE originally used X.509 certificates for authentication, which are still supported.) ## Uploading A Package Using `uploadDataPackage` Datasets and metadata can be uploaded individually or as a collection. Such a collection, whether contained in local R objects or existing on a DataONE repository, will be informally referred to as a `package` or 'data package'. Figure 1. is a diagram of a typical DataONE package showing a metadata file that describes, or `documents` the data granules that the package contains. ![](https://raw.githubusercontent.com/DataONEorg/rdataone/main/vignettes/package-diagram.png) The steps necessary to to prepare and upload a package to DataONE using the `uploadDataPackage` method will be shown. A complete script that uses these steps is shown here, with detailed explanations of the steps following. ```{r} library(dataone) library(datapack) library(uuid) dp <- new("DataPackage") emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone") metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile) dp <- addMember(dp, metadataObj) sourceData <- system.file("extdata/OwlNightj.csv", package="dataone") sourceObj <- new("DataObject", format="text/csv", filename=sourceData) dp <- addMember(dp, sourceObj, metadataObj) progFile <- system.file("extdata/filterObs.R", package="dataone") progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc") dp <- addMember(dp, progObj, metadataObj) outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone") outputObj <- new("DataObject", format="text/csv", filename=outputData) dp <- addMember(dp, outputObj, metadataObj) myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission") ``` ```{r, eval=FALSE} d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2") packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE) ``` This particular package contains the R script `filterObs.R`, the input file `OwlNightj.csv` that was read by the script and the output file `Strix-occidentalis-obs.csv` that was created by the R script, which was run at a previous time. The following sections describe each line of the above script in detail. ### 1. Create a DataPackage object. In order to use `uploadDataPackage`, it is necessary to prepare an R *DataPackage* object which is a container for the set of files that will be included in the package. The following commands load the required libraries and creates an empty DataPackage object that will be added to later: ```{r} library(dataone) library(datapack) library(uuid) dp <- new("DataPackage") ``` When using the `uploadDataPackage` method, data structures that are required by DataONE are created, configured and uploaded automatically with the package. These data structures include a [ResourceMap](https://purl.dataone.org/architecture/design/DataPackage.html#generating-resource-maps) that details the contents of the package, and SystemMetadata objects that contain DataONE system information for each of the science datasets and associated science metadata. A `dataone` *DataObject* is a container that holds both the data bytes and the system information for a metadata file, data or other type of file. A DataObject is created for each file that will be included in a DataPackage. ### 2. Prepare a metadata file that will describe the files in the package The next step is to prepare a metadata file that will describe the science datasets and other files in the package. The most common metadata format used in the DataONE network is the [Ecological Metadata Langauge](https://knb.ecoinformatics.org/#external//emlparser/docs/index.html) (EML). Other supported formats include FGDC, ISO 19115 and others. Additional information about EML is available at https://knb.ecoinformatics.org/#external//emlparser/docs/index.html. Detailed directions regarding authoring metadata documents are outside the scope of this document. DataONE requires that any file uploaded to a member node have a unique identifier associated with it. When a DataObject is created, a unique identifier is generated for the DataObject if one is not specified using the `id` parameter. This automatically generated identifier has the format "urn:uuid:", for example "urn:uuid:c3443142-6260-4ea5-aaa1-1114981e04ad". The following commands create the DataObject for the science metadata, using an automatically generated identifier: ```{r} emlFile <- system.file("extdata/strix-pacific-northwest.xml", package="dataone") metadataObj <- new("DataObject", format="eml://ecoinformatics.org/eml-2.1.1", filename=emlFile) ``` Now add the metadata object to the DataPackage: ```{r} dp <- addMember(dp, metadataObj) ``` Files are considered members of a package when they are enumerated and described by a metadata file, and a relationship between the metadata and data object is explicitly stated. DataONE (and the `dataone` R package) has adopted the package guidelines detailed by the [DataONE package](https://purl.dataone.org/architecture/design/DataPackage.html) implementation. In this specification, the relationship that links a metadata object and a science object is CiTO (Citation Typing Ontology) *documents*. This relationship between the science metadata and data objects will be added to the DataPackage automatically for each data object as it is added to the DataPackage, if the metadata object is first added, then referenced as the DataObjects are added. As the metadata object has already been added, it can be referenced as each DataObject is added. ```{r, eval=FALSE} dp <- addMember(dp, sourceObj, metadataObj) ``` Since `metadataObj` is included as the third argument here, the CiTO *documents* relationship will automatically be added between `metadataObj` and `sourceObj`. Alternatively, this relationship between the metadata and science objects can be made explicitly using the `insertRelationship()` method: ```{r, eval=FALSE} dp <- addMember(dp, metadataObj) dp <- addMember(dp, sourceObj) dp <- insertRelationship(dp, getIdentifier(metadataObj), getIdentifier(sourceObj)) ``` Note that the relationship type, using the `insertRelationship()` `predicate` argument does not have to be specified in this case, as the CiTO *documents* relationship is the default value for `insertRelationship`. ### 3. Create and add a DataObject for each data file A DataObject must be created for each metadata file, data file or any other type of file that will be included in the package. A `dataone` *SystemMetadata* R object will be created automatically and stored in each DataObject. The information from the *SystemMetadata* R object will be used by DataONE to maintain low level information about the dataset, such as the access policy, the user identity of the *rightsholder* (the user identity that can modify access the dataset), which Member Nodes it can be replicated to, etc. The example below creates a DataObject for a science dataset: ```{r} sourceData <- system.file("extdata/OwlNightj.csv", package="dataone") sourceObj <- new("DataObject", format="text/csv", filename=sourceData) dp <- addMember(dp, sourceObj, metadataObj) ``` An optional *user* argument can be specified when creating a DataObject, which will be used to set the DataONE *submitter* and *rightsholder* of the dataset when it is uploaded. The rightsholder is granted all access privileges to the object. If *user* is not specified for a *DataObject*, then the submitter and rightsholder for an object will automatically be set, when the object is uploaded to DataONE, to the DataONE user that created the authentication token or X.509 certificate. Now DataObjects for an R script and for a file created by the R script will be created: ```{r} progFile <- system.file("extdata/filterObs.R", package="dataone") progObj <- new("DataObject", format="application/R", filename=progFile, mediaType="text/x-rsrc") dp <- addMember(dp, progObj, mo=metadataObj) outputData <- system.file("extdata/Strix-occidentalis-obs.csv", package="dataone") outputObj <- new("DataObject", format="text/csv", filename=outputData) dp <- addMember(dp, outputObj, mo=metadataObj) ``` *Important Note*: files previously uploaded to DataONE can be added to the DataPackage, for more information see the section "Adding Existing DataONE Files To A Package". ### 4. Determine what access your data and metadata should have DataONE provides a mechanism that allows data submitters to control access to their data. The levels of access available to objects in DataONE are "read", "write", and "changePermission". The "read" permission allows a user the ability to view the content of a DataONE object. The "write" permission allows a user the ability to change the content of an object via update services. The "changePermission" permission allows the ability to change the access policy for an object and includes both read and write permissions. The access rules that are added to DataObjects in a DataPackage will determine the access that is granted to users accessing the package after it is uploaded to DataONE. Each of these permissions can be granted to a single user, a group of users, or the special *public* user which means all users. Each object in DataONE can have one or more access rules that control the access of that object. The complete set of access rules for an object is referred to as its *access policy*. Access rules can be added to each *DataObject* individually after it has been created. Alternatively, access rules can be specified for all package members when a package is uploaded using `uploadDataPackage`. This method is shown at the end of this section. To grant read permission to all users: ```{r} sourceObj <- setPublicAccess(sourceObj) ``` Individual access rules to be added for a DataONE user identity can also be added to the access policy. Access rules are added to a *DataObject* using the `addAccessRule` method. The following access rule will grant the user with the ORCID `https://orcid.org/0000-0002-2192-403X` `changePermission` access to the dataset: ```{r} myAccessRules <- data.frame(subject="https://orcid.org/0000-0002-2192-403X", permission="changePermission") sourceObj <- addAccessRule(sourceObj, myAccessRules) ``` DataONE user identities and user authentication are described in section *DataONE User Authentication* in the vignette *dataone-overview* (to view this vignette, type this command in the R console: `vignette("dataone-overview")`) ### 5. Upload the DataPackage When all DataObjects have been added to the DataPackage, call the `uploadDataPackage` method to upload the entire DataPackage. As mentioned previous, as an alternative to adding access rules to each DataObject individually before adding it to the DataPackage, the access rules can be specified once when the package is uploaded to DataONE. For example, to add public access to every object in the package, and add the custom access rule show above, the `public` and `accessRules` arguments are used when calling `updateDataPackage`: ```{r, eval=FALSE} d1c <- D1Client("STAGING", "urn:node:mnStageUCSB2") packageId <- uploadDataPackage(d1c, dp, public=TRUE, accessRules=myAccessRules, quiet=FALSE) message(sprintf("Uploaded package with identifier: %s", packageId)) ``` (Note that the example uses a DataONE test environment *STAGING*, and not the production environment.) After *uploadDataPackage* has been called successfully, the package can be viewed on the member node, searched for using the DataONE search facility. Note that if objects in DataONE are not publicly readable, and the authenticated user performing the search isn't granted access in an object's access policy, then the objects will not be viewable or discoverable via the search facility for that user. ## Using a Digital Object Identifier with a package. A Digital Object Identifier (DOI) may be assigned to the metadata *DataObject*, using the *generateIdentifier* method: ```{r, eval=FALSE} cn <- CNode("STAGING") mn <- getMNode(cn, "urn:node:mnStageUCSB2") doi <- generateIdentifier(mn, "DOI") metadataObj <- new("DataObject", id=doi, format="eml://ecoinformatics.org/eml-2.1.1", file=sampleEML) ``` The `generateIdentifier` method requests that the DataONE MN generate a properly formatted DOI. ## Adding Existing DataONE Files To A Package Files that have previously been uploaded to DataONE can be added to a new DataPackage. New DataObjects can be created by downloading an existing DataONE item: ```{r, eval=FALSE} dataObj <- getDataObject(d1c, id="urn:uuid:1234", lazyLoad=T, limit="1TB") dp <- addMember(dp, dataObj, mo=metadatqObj) ``` The above example creates a new DataObject by downloading the existing item from DataONE that has the identifier "urn:uuid:1234" and using the 'd1c` variable previously defined. This DataObject is then added to the package. In this example, the optional "lazyLoad" and "limit" parameters are used. These parameters cause *getDataObject* to only download the system information (i.e. access rules, format information) and not the data bytes associated with this identifier. Using these parameters can be helpful if the file associated with this identifier is very large. For the purpose of adding this DataONE identifier to a DataPackage, the data bytes are not required. Reusing DataONE items this way results in one copy of the item existing in DataONE but being included in multiple DataONE packages. ## Uploading Individual Data And Metadata Files A single data or metadata file can be uploaded to a DataONE MN using the *createObject* method. When uploading a single file using this method, additional information must be supplied to DataONE that controls how DataONE interacts with the uploaded file. This additional information is stored in DataONE as a *system metadata* object and contains information such as who can access or update the file, how many copies of the file should be maintained, whether the file has been superseded by another object, etc. The system metadata information that will be uploaded to DataONE is collected and stored in an R object type *datapack::SystemMetadata*, as shown below: ```{r} library(digest) # Create a system metadata object for a data file. # Just for demonstration purposes, create a temporary data file. testdf <- data.frame(x=1:20,y=11:30) csvfile <- paste(tempfile(), ".csv", sep="") write.csv(testdf, csvfile, row.names=FALSE) format <- "text/csv" size <- file.info(csvfile)$size sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE) # Generate a unique identifier for the dataset pid <- sprintf("urn:uuid:%s", UUIDgenerate()) sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256) sysmeta <- addAccessRule(sysmeta, "public", "read") ``` Alternatively, the system metadata could have been created with a *seriesId*. The *seriesId* is explained in the *dataone_overview* vignette. The following example shows the creation of a *SystemMetadata* object using the optional *seriesId*: ```{r} # Create a system metadata object for a data file. # Just for demonstration purposes, create a temporary data file. testdf <- data.frame(x=1:20,y=11:30) csvfile <- paste(tempfile(), ".csv", sep="") write.csv(testdf, csvfile, row.names=FALSE) format <- "text/csv" size <- file.info(csvfile)$size sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE) # Generate a unique identifier for the dataset pid <- sprintf("urn:uuid:%s", UUIDgenerate()) # The seriesId can be any unique character string. seriesId <- sprintf("urn:uuid:%s", UUIDgenerate()) sysmeta <- new("SystemMetadata", identifier=pid, formatId=format, size=size, checksum=sha256, seriesId=seriesId) ``` A unique identifier must be specified for each system metadata, whether or not a seriesId is used. The dataset can now be uploaded to DataONE with the associated system metadata: ```{r,eval=F} cn <- CNode("STAGING") mn <- getMNode(cn, "urn:node:mnStageUCSB2") response <- createObject(mn, pid, csvfile, sysmeta) ``` Note that for this example, the DataONE test environment *STAGING* is used, and not the production environment. ## Maintaining Uploaded Datasets After data has been uploaded to DataONE, maintenance operations can be performed on these objects using the methods described in the following sections. ### Update the DataONE system metadata for an object (MNode: updateSystemMetadata) The system metadata can be updated for an object in DataONE without updating the data bytes of the object itself. For example, if an object was only readable by the data submitter, the access policy for an object can be updated to enable public access. System metadata is updated for an object using the *updateSystemMetadata* method: The following example first downloads the current system metadata for the pid from the previous example, then updates the access policy that will be applied to the object and uploads the new system metadata to DataONE so that the changes will be applied: ```{r, eval=F} cn <- CNode("STAGING") mn <- getMNode(cn, "urn:node:mnStageUCSB2") sysmeta <- getSystemMetadata(mn, pid) sysmeta <- addAccessRule(sysmeta, "public", "read") status <- updateSystemMetadata(mn, pid, sysmeta) ``` Note that updating an object in DataONE requires the proper access. For example, for the identifier shown above (urn:uuid:17d61d5a-061a-4778-9cdf-4e14751aaddc), only the DataONE user identity that is the *rightsholder*, or another user identity that has been granted write access by the rightsholder will be able to update the object on the DataONE member node. Running the previous example without the proper authentication will produce an error. ### Replace an object with a newer version (MNode: updateObject) The *updateObject* updates an existing object by creating a new object identified by a new PID on the Member Node. The new object replaces and *obsoletes* the old object. An obsoleted object in DataONE does not appear in search results, however it is still available for download if the identifier is known. ```{r, eval=F} # Update object from previous example with a new version updateid <- sprintf("urn:uuid:%s", UUIDgenerate()) testdf <- data.frame(x=1:20,y=11:30) csvfile <- paste(tempfile(), ".csv", sep="") write.csv(testdf, csvfile, row.names=FALSE) size <- file.info(csvfile)$size sha256 <- digest(csvfile, algo="sha256", serialize=FALSE, file=TRUE) # Start with the old object's sysmeta, then modify it to match # the new object. We could have also created a sysmeta from scratch. sysmeta <- getSystemMetadata(mn, pid) sysmeta@identifier <- updateid sysmeta@size <- size sysmeta@checksum <- sha256 sysmeta@obsoletes <- pid # Now update the object on the member node. response <- updateObject(mn, pid, csvfile, updateid, sysmeta) # Get the new, updated sysmeta and check it to ensure that the update # worked, i.e. "obsoletes" is the old pid that was replaced by the update. updsysmeta <- getSystemMetadata(mn, updateid) updsysmeta@obsoletes ``` The Member Node will mark the object as being *obsolete* by setting a property in the system metadata on the object being replaced. An object marked as *obsolete* will not appear in search results, however, such an object is still available for download if the PID is known. ### Remove an object from DataONE search An object can be removed from searches done with the DataONE search mechanism by calling the *archive* method with the PID of the object. This operation does not delete the object bytes, but instead updates the system metadata for the object to set the *archived* flag to true. The object can still be referenced with its PID and downloaded, but it will not appear in any search results. Objects that are archived can not be updated using the *updateObject* method. Once an object is archived it cannot be un-archived. The following statement archives the object that was just created in the previous example with the *updateObject* method. ```{r, eval=FALSE} response <- archive(mn, updateid) ``` The following commands can be used to verify that the object was archived. ```{r, eval=FALSE} sysmeta <- getSystemMetadata(mn, updateid) sysmeta@archived ```