The primary method to locate data in the DataONE network is to use the web based search tool located at https://search.dataone.org.
Data can also be discovered programmatically from R using the
dataone R package method query
. Both the web page
and the R method access the same underlying search mechanism, the Apache
Foundation Solr search engine that runs on the DataONE Coordinating
Node. The DataONE Solr index, similar to a catalog or database,
contains information for every dataset that is available from the
DataONE network.
The same search mechanism is available on DataONE Member Nodes. However, the search index on a member node only contains dataset information for datasets that are contained on that member node.
Information about the Solr search engine can be obtained at Solr Resources.
Additional information about searching DataONE can be viewed at Content Discovery.
query
MethodSome familiarity with Solr is helpful when using the
query
method effectively, however basic queries can be very
powerful and the examples in this document can provide a starting point.
As an alternative to composing queries using Solr syntax, a simpler
search mechanism is available with the query
method that
uses name, value pairs (See the section: *A Simplified Search”).
Additional information about the query
method can be
obtained from the R help facility, e.g. ?query
from the R
command line.
The following example queries DataONE and returns values in an R list, with each value converted from the Solr result to the appropriate R data type:
library(dataone)
cn <- CNode("PROD")
queryParamList <- list(q="id:doi*", rows="5", fq="abstract:carbon", fl="id,title,dateUploaded,abstract,datasource,size")
result <- query(cn, solrQuery=queryParamList, as="list")
The ‘solrQuery’ argument takes a list of query parameters that are
sent to the DataONE Solr search engine and control how the search is
performed. The name of each list element is a Solr keyword which is
combined with the list element value to create each Solr query term. All
these terms are combined to create the complete Solr query. The
queryParamlist
in the example above will be used to
construct this Solr query:
?q=id:doi*&rows=5,&fq=abstract:carbon&fl=id,title,dateUploaded,abstract,datasource,size
The q=id:doi*
is the main query term and specifies that
all DataONE objects that have an id
field value beginning
with doi*
should be returned.
The &fq=abstract:carbon
term is a
filter query
, which filters the results so that only
results with an abstract
field containing the word
carbon
will be returned.
The
&fl=id,title,dateUploaded,abstract,datasource,size
term
is a field
specifier, so only those fields in the list will
be included in the result set.
The &rows=5
term specifies that a maximum of five
results will be returned.
The result
object contains all the data values found and
returned from the query. Each element of the returned list contains
information for one dataset. Each returned attribute for a dataset can
be accessed with the appropriate element name, for example, to access
the title information of the first dataset returned, use the R
statement:
To print out selected information for all returned values, use:
ids <- lapply(result, function(x) {
message(sprintf("id: %s", x$id))
message(sprintf("origin member node: %s", x$datasource))
message(sprintf("title: %s", x$title))
message(sprintf("date uploaded: %s", x$dateUploaded))
x$id
})
The complete list of possible searchable values stored for a dataset can be viewed using getQueryEngineDescription():
However, the values available for a particular dataset may be a subset of these, depending on the metadata provided when the dataset is uploaded to DataONE.
The DataONE Coordinating Node (CN) contains metadata about datasets from all Member Nodes (MN) in the network. As the above example shows, sending a query to the CN may find matching datasets located on potentially any Member Node in the network.
To return results as a data frame, use the as
parameter
as shown below. In addition all values will be stored as
character
, because parse=FALSE
is
specified:
The result
is a data frame with each row containing
information for one dataset. To print all returned DataONE identifiers
from the result:
A search may be performed by just specifying query fields and values using the searchTerms parameter. For example, to search for datasets that mention ‘kelp’ in the abstract and have an attribute description that contains the word ‘biomass’, the following query could be used:
cn <- CNode("PROD")
mn <- getMNode(cn, "urn:node:KNB")
mySearchTerms <- list(abstract="kelp", attribute="biomass")
result <- query(cn, searchTerms=mySearchTerms, as="data.frame")
Using the searchTerms parameter causes the query method to construct a Solr query based on the list, that is passed on to the DataONE node specified.
The names in the searchTerm list are the query field names available from the Solr query engine being used. These names can be determined using the getQueryEngineDescription function.
If it is known which Member Node holds the data of interest, then a search can be limited to just that MN by sending the search directly to that MN instead of the CN. For example, if the dataset of interest is on The Knowledge Network for Biocomplexity (KNB) Member Node, then the search is performed with the statements:
An alternative way to locate datasets on a particular node is to send
the query to the CN but limit the returned results to data holdings that
originated from the specific member node by using Solr filter query
parameter (&fq
) can be used for the
datasource
field: