Skip to content

Dataset Metadata Schema

This page describes the current schema used by ISI the Datamart to represent datasets.

  • Schema Version: 1.0.0
  • Release date: June 12th, 2020
  • Authors: Pedro Szekely, Ke-Thia Yao and Daniel Garijo

Dataset Definition

We define a Dataset as a collection of files which contain information (typically observations) about one or multiple variables that describe entities of interest. For example, consider the sample table below:

Country Number of homicides Year
Burundi 1000 2000
USA 2000 2000

The table contains information about the variable number of homicides, which describe Countries (Burundi, USA) in some year. Here year is a special type of variable which describes the information in the row (i.e., the number of homicides on a particular year). We refer to these special variables as qualifiers.

Warning

If a dataset containes several files, it is not required to declare all of its parts as datasets.

Describing dataset metadata

Datasets have the following required, recommended and optional properties. Required properties MUST be submitted as part of the metadata in order to be inserted in Datamart. Recommended properties may not be included, but are highly recommended in order to exploit the full features of Datamart. Optional properties provide additional insight into the dataset, helping others understand its context. Note that some properties have qualifiers. Qualifiers are additional fields which add more information about the property and object being described, and are used by concatenating them to the described property. For example, if we want to describe the file format a dataset has at a url, we can use url_file_format qualifier to describe it.

Required Property Description and Examples
name [P1476] Expected value: String
Description: Full name of the dataset
Example: "Criminal records in the US for the year 2000"
description [schema:description] Expected value: String
Description: Text with a brief explanation of the dataset and its context
Example: "This dataset contains criminal records in the US (homicides, robbery, assault) organized by State and County as reported by their local administrations."
url [P2699] Expected value: URL
Description: URL where to download the Dataset. It the dataset includes several files, this would be the URL where to download all of them.
Example: http://s3-us-gov-west-1.amazonaws.com/cg-d4b776d0-d898-4153-90c8-8336f86bdfec/2018/AL-2018.zip
Qualifiers [OPTIONAL]: of [P642] digital_data_download [Q165194]; file_format [P2701] (e.g., ZIP [Q136218], N-Triples [Q44044])
dataset_id [P1813] Expected value: String
Description: ID of the dataset to be used in Datamart.
Example: "OECD"
Recommended Property Description and Examples
keywords [P2006020006] Expected value: String
Description: Keywords describing the dataset. Multiple entries are delimited by commas
Example: "crime, homicide"
creator [P170] Expected value: String (will be matched to QNode of Person or Organization)
Description: Person or Organization responsible for the creation of the Dataset
Example: "Federal Bureau of Investigation"
Example: "John Doe"
contributor [P767] Expected value: String (will be matched to QNode of Person or Organization)
Description: Person or Organization who helped in the development of the Dataset.
Example: "John Doe"
cites_work [P2860] Expected value: String
Description: Bibliographic citation for the dataset
Example: "Doe J (2014) Influence of X ... https://doi.org/10.1111/111"
Example: https://doi.org/10.1111/111
copyright_license [P275] Expected value: String
Description: license under which this copyrighted work is released
Example: "Creative Commons Attribution-ShareAlike 4.0 International" (Q18199165)
version [P2006020007] Expected value: String
Description: Version number of the Dataset. Semantic versioning in the form of X.Y.Z is preferred (where X indicates a major version, Y a minor version and Z indicates a patch or bug fixes).
Example: "1.0.0"
doi [P356] Expected value: String
Description: Digital Objet Identifier (DOI) of the dataset. Note that this identifier is different from the DOI used in "Cites Work".
Example: "https://doi.org/10.1000/182"
main_subject [P921] Expected value: String (will be mapped to QNode)
Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset
Example: "USA" (Q30)
Example: "Burundi"(Q967)
coordinate_location [P625] Expected value: String
Description: Geocoordinates of the subject in WGS84 format.
Example: "14°S, 53°W"
geoshape [P3896] Expected value: String
Description: Geographic data in Well Known Text (WKT) format.
Example: "POINT (30 10)"
Example: "POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))"
country [P17] Expected value: String
Description: Country where the dataset observations were collected
Example: "USA" (Q30)
Example: "Burundi"(Q967)
location [P276] Expected value: String (will be mapped to QNode)
Description: Location of the Dataset
Example: "Los Angeles" (Q65)
Example: "Burundi"(Q967)
start_time [P580] Expected value: String
Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years.
Example: "2020-04-06"
Example: "2020"
Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577])
end_time [P582] Expected value: String
Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years.
Example: "2020-04-06"
Example: "2020"
Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577])
data_interval [P6339] Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)]
Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset
Qualifiers [OPTIONAL]: start_time [P580], end_time [P582]
variable_measured [P2006020003] Expected value: Variable
Description: Variables that are measured in a Dataset. Variables MUST be described at least with their corresponding full name (name property).
Example: {"variable_id":"Price",
"name":"Published price listed or paid for a product",
"identifier":"https://www.wikidata.org/wiki/Property:P2284"}
mapping_file P2006020005] Expected value: URL
Description: File used to create map the dataset statements to WikiData tiples
Example: http://example.com/T2WMLProject-FBI
Qualifiers [OPTIONAL]: file_format (P2701)
Optional Property Description and Examples
official_website [P856] Expected value: String
Description: URL of the official homepage of a Dataset
Example: https://crime-data-explorer.fr.cloud.gov
date_created [P2006020008] Expected value: Date
Description: Creation date of the Dataset in ISO 8601 format (YYYY-MM-DD)
Example: 2020-04-06
api_endpoint [P6269] Expected value: String
Description: Base URL of a web service
Example: https://www.wikidata.org/w/api.php
included_in_data_catalog [P2006020009] Expected value: String (will be mapped to QNode)
Description: Catalog where the Dataset is included
Example: "FigShare"(Q17013516)
has_part [P527] Expected value: String
Description: Link to the files that are included on a Dataset (in case the dataset contains multiple files)
Example: http://example.com/example.csv1
Qualifiers [OPTIONAL]: file_format (P2701) (e.g., CSV [Q935809])
last_update [P5017] Expected value: Date
Description: Date a dataset was last updated in ISO 8601 format (YYYY-MM-DD)
Example: 2020-04-06
updated_by [P2010280001] Expected value: String
Description: Person who edited a dataset last
Example: John Doe

When a property is marked as (will be mapped to QNode) it means that Datamart will automatically transform the target string into an entity with a QNode in Wikidata. If no match is found, a new QNode will be created.

Variable Metadata

Dataset variables describe the contents of a table (typically a column). When describing properties, we have the following required and recommended properties:

Required Property Description and Examples
name [P1476] Expected value: String
Description: Full name of the variable
Example: "Number of homicides"
Recommended Property Description and Examples
variable_id [P1813] Expected value: String
Description: Identifier associated with the variable. It identifies this variable in particular in this dataset, using the name it has in its corresponding column header
Example: "homicides_n"
dataset_id [P1813] Expected value: String
Description: Identifier of the dataset this variable belongs to.
Example: "UAZ"
description [schema:description] Expected value: String
Description: Text with a brief explanation of the Variable and its context
Example: "The number of homicides in a region."
corresponds_to_property [P1687] Expected value: URL
Description: URL of the variable in Wikidata. If provided, this value helps Datamart relating the variable to other variables that measure the same thing
Example: https://www.wikidata.org/wiki/Property:P2284 (for price)
main_subject [P921] Expected value: List[Object]
Description: Primary topic(s) of a variable. This property may be used to identify all the entities described by the variable. Each main subject is described by an identifier and a name.
Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"}
Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
unit_of_measure [P1880] Expected value: List[String] (Will be mapped to QNode)
Description: Unit of measurement used to measure the variable value.
Example: "Ethiopian Dollars per Kilogram"
Example: "ETB/Kg"
country [P17] Expected value: List[Object]
Description: Country where the variable observations were collected. Each country is described by a name and an identifier
Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"}
Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
location [P276] Expected value: List[Object]
Description: Location of the variable. Each location is described with a name and an identifier
Example: {"name":"Los Angeles", "identifier":"https://www.wikidata.org/wiki/Q65"}
Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
start_time [P580] Expected value: String
Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years.
Example: "2020-04-06"
Example: "2020"
Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]), calendar [P2803] (e.g., Gregorian Q12138)
end_time [P582] Expected value: String
Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years.
Example: "2020-04-06"
Example: "2020"
Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]), calendar [P2803] (e.g., Gregorian Q12138)
data_interval [P6339] Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)]
Description: Interval at which the observations are collected in the dataset.
has_column_index [P2006020001] Expected value: Integer
Description: Column number that corresponds to the variable.
Example: 2
has_qualifier [P2006020002] Expected value: List[String]
Description: Qualifiers used to describe the variable
Example:"Fertilizer"
Example: "Source"
count [P1114] Expected value: Integer
Description: Number of instances of this property in this table. For instance, number of rows in a CSV that use this property.
Example: 150
geospatial_granularity [P2006180001] Expected value: String
Description: Administrative area (admin1..admin3) the variable belongs to. This classification depends on the administrative territorial entities used by countries.
Example: 150
tag [P2010050001] Expected value: String
Description: An external category that we may want to map a variable to.
Example: "Precipitation_volume" (used as tag for variable name precip)

Additional qualifiers may identify descriprion properties that have not been included in the variable schema. These properties are often describing a single variable. For example if the variable measures production, then the fertilizer_type may be a qualifier, while if the variable measures an observation in the sea soil, the depth and point in time at which the measurement was collected are a qualifiers.

Example:

The following JSON snippet below illustrates an example of the metadata of the food production index for Ethiopia. The fertilizer property is considered a qualifier of the variable being described, as it is not included in the schema above:

{
    "name": "Food production index",
    "variable_id:" "FPI",
    "correspondsToProperty": "https://datamart.isi.edu/wiki/Property:P110026",
    "description": "Food production index, calculated from ...",
    "main_subject": [
        {"name":"Kercha", 
        "identifier":"https://www.wikidata.org/wiki/Q6393737"},
        {"name":"Liben", 
        "identifier":"https://www.wikidata.org/wiki/Q3237714"}],
    "unit_of_measure": ["tonnes/year"],
    "country": [{"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "start_time": "1993",
    "end_time": "2016",
    "end_time_precision": "Year",
    "data_interval": "Monthly",
    "has_qualifier": "Fertilizer"
} 

Another example with minimal metadata:

{
    'name': 'UAZ Indicators', 
    'description': 'Collection of indicators, including indicators from FAO, WDI, FEWSNET, CLiMIS, UNICEF, ieconomics.com, UNHCR, DSSAT, WHO, IMF, WHP, ACLDE, World Bank and IOM-DTM', 
    'url': 'https://github.com/ml4ai/delphi', 
    'dataset_id': 'UAZ'
}

Some variables may belong to already existing CSVs, and therefore we may have information about their position in the table. In example case below, has_column_index is used to identify that the variable was in the second column of the spreadsheet, while the dataset_id qualifier indicates the URL of the dataset the variable was included in:

{
    "name": "Number of homicides worldwide",
    "variable_id": "NumberH",
    "description": "Number of homicides per country/year as collected by ...",
    "main_subject":[
        {"name":"United States of America", 
        "identifier":"https://www.wikidata.org/wiki/Q30"},
        {"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "start_time": "2000",
    "end_time": "2020",
    "end_time_precision": "Year",
    "end_time_calendar:": "Gregorian",
    "data_interval": "Year",
    "has_column_index":"2",
    "dataset_id": "http://example.org/Crimes.csv"
} 

Example with minimimal metadata:

{
    "name": "gross domestic product based on purchasing power parity",
     "variable_id": "GDP2", 
     "dataset_id": "WDI"
}

Acknowledgements:

We have used Wikidata and Schema.org as reference schemas to build the Datamart Dataset Schema. We have also used the Google Dataset Search guide as a reference for structuring our suggested minimum and required properties.


Contribution Guidelines

If you have suggestions or concerns with any of the aspects covered in this schema, please open an issue in our Github repository with the headline [DatasetSchema].