This document provides background information about data standards, json-schemas in general and the structure of the Wildlife Disease Data standard specifically. WDDS is focused making it easy to store disease data in a consistent and FAIR format.
A JSON-schema is a human and machine readable document that defines a
data standard by describing the structure, properties, and constraints
for a dataset. For those of us more accustomed to thinking about
spreadsheet files and data frames, a property is roughly equivalent to a
field or column. The JSON-schema defines the rules around the type of
data used in particular property (character, numeric, logical, etc), and
its values (e.g. massUnits must be one of kg, mg, or g;
latitude must be between -90 and 90; sampleID
must be unique). The schema also describes how those fields should be
combined into a coherent whole (i.e. the structure of the dataset).
In a JSON-schema, fields can have parent child relationships. A field
may itself be schema. For example, the data property in
this standard defines a data object that is a flat table
with constraints, types, and/or requirements. In this way, JSON-schema
allows for the construction of modular schema documents that can
leverage existing schemas (e.g. darwin core, or datacite).
Once we have created a schema, we can then validate data against it. The validation process happens via a validation engine and tells us if the data conform to the standard. If the data do not conform, then the validation engine tells us precisely where the data are non-conformant and what the data standard expected to see.
For more detailed information see JSON-Schema.org
The Wildlife Disease Data Standard is composed of two sub-schemas (1)
disease_data and (2) project_metadata.
disease_data describes the structure and contents of the
wildlife disease data. It has certain required fields and is extensible.
This data should be stored as a tidy dataset in a flat
file like a CSV. This component of the standard relies heavily on the Darwin Core data standard.
project_metadata describes the structure and contents of
the descriptive metadata. That is, metadata about the project that
enables discovery, identification, and attribution. This component of
the standard relies heavily on the Data Cite
Metadata Schema.
Researchers may validate their data against each sub-schema
separately, or use them in tandem to validate an entire data package.
The term “data package” refers to a list or JSON object that contains
both the disease_data and project_metadata
components.
Property: synonymous with field or column in a
table. A property corresponds to a particular attribute (e.g. age,
collectedBy, latitude, etc) of the data.
Required: A property is must be included for a given
schema or object within a schema.
Type: Type
of data. Common values include array, object, string, number,
integer, null, and boolean.
Array: A comma separated group of values. Similar to a
vector in R but a little more flexible.
Array Items: Array items define acceptable values for
an array.
- minItems - how many items must be present in the array - minimum -
inclusive - smallest value allowed in an array - maximum - inclusive -
largest value allowed in an array - enum - controlled vocabulary for an
array
We recommend that data producers use controlled vocabularies or ontologies when filling out free text fields. We recognize that selecting an appropriate vocabulary can be challenging and recommend the following platforms for finding appropriate terms.
Recommended ontology hosting and search platforms with distinct funding sources.
| Name | URL |
|---|---|
| Ontobee | https://ontobee.org/ |
| Ontology Lookup Service | https://www.ebi.ac.uk/ols4/ |
| BioPortal | https://bioportal.bioontology.org/ |
All three platforms allow users to search for terms stored in ontologies, explore relationships between terms, and find analogues. A user will have to explore a given ontology to find the most appropriate term. In Table S2 we list specific ontologies or authorities that may be appropriate for a given field.
Recommended ontologies or authorities for specific fields.
| Field | URL |
|---|---|
| Host Identification | https://www.gbif.org/species/search |
| Gene Target | https://www.ebi.ac.uk/ols4/ontologies/go |
| Sample Collection Method | http://purl.obolibrary.org/obo/OBI_0000659 |
| Sample Collection Body Part | https://www.ebi.ac.uk/ols4/ontologies/uberon |
| Sample Collection Material | http://purl.obolibrary.org/obo/OBI_0001479 |
Type: object
Description: REQUIRED Wildlife
disease data. Stored in tidy form.
Required Fields: sampleID, latitude, longitude,
sampleCollectionMethod, hostIdentification, detectionTarget,
detectionMethod, detectionOutcome, parasiteIdentification
Reference: schemas/disease_data.json
Type: object
Description: REQUIRED Metadata
for a project that largely follows the Datacite data standard.
Required Fields: methodology, creators, titles,
publicationYear, language, descriptions, fundingReferences
Reference: schemas/project_metadata.json
Type: object
Description: REQUIRED A broad
categorization of how data were collected.
Properties:
Type: array
Description: REQUIRED The full
names of the creators. Should be in the format familyName,
givenName.
Array Items
Type: string
Description: REQUIRED DataCite
name
Type: string
Description: DataCite
nameType
Type: string
Description: DataCite
givenName
Type: string
Description: DataCite
familyName
Type: array
Description: DataCite
nameIdentifiers
Array Items
Type: array
Description: DataCite
affiliation
Array Items
Type: string
Description: DataCite
lang
Type: array
Description: REQUIRED A name
or title by which a resource is known.
Array Items
Type: array
Description: A unique string that identifies a
resource.
Array Items
Type: array
Description: Subject, keyword, classification code, or
key phrase describing the resource.
Array Items
Type: string
Description: REQUIRED The year
when the data was or will be made publicly available.
Type: array
Description: Any rights information for this
resource.
Array Items
Type: array
Description: REQUIRED All
additional information that does not fit in any of the other categories.
May be used for technical information or detailed information associated
with a scientific instrument.
Array Items
Type: string
Description: REQUIRED The
primary language of the resource.
Type: array
Description: REQUIRED Name and
other identifying information of a funding provider.
Array Items
Type: array
Description: DataCite
relatedIdentifiers
Array Items
Type: string
Description: REQUIRED DataCite
relationType
Type: string
Description: DataCite
relatedMetadataScheme
Type: string
Description: DataCite
schemeUri
Type: string
Description: DataCite
schemeType
Type: string
Description: DataCite
resourceTypeGeneral
Type: string
Description: REQUIRED DataCite
relatedIdentifier
Type: string
Description: REQUIRED DataCite
relatedIdentifierType