Dataset Metadata Schema
This page describes the current schema used by ISI the Datamart to represent datasets.
- Schema Version: 1.0.0
- Release date: June 12th, 2020
- Authors: Pedro Szekely, Ke-Thia Yao and Daniel Garijo
Dataset Definition¶
We define a Dataset as a collection of files which contain information (typically observations) about one or multiple variables that describe entities of interest. For example, consider the sample table below:
Country | Number of homicides | Year |
---|---|---|
Burundi | 1000 | 2000 |
USA | 2000 | 2000 |
The table contains information about the variable number of homicides, which describe Countries (Burundi, USA) in some year. Here year is a special type of variable which describes the information in the row (i.e., the number of homicides on a particular year). We refer to these special variables as qualifiers.
Warning
If a dataset containes several files, it is not required to declare all of its parts as datasets.
Describing dataset metadata¶
Datasets have the following required, recommended and optional properties. Required properties MUST be submitted as part of the metadata in order to be inserted in Datamart. Recommended properties may not be included, but are highly recommended in order to exploit the full features of Datamart. Optional properties provide additional insight into the dataset, helping others understand its context. Note that some properties have qualifiers. Qualifiers are additional fields which add more information about the property and object being described, and are used by concatenating them to the described property. For example, if we want to describe the file format a dataset has at a url, we can use url_file_format
qualifier to describe it.
Required Property | Description and Examples |
---|---|
name [P1476] |
Expected value: String Description: Full name of the dataset Example: "Criminal records in the US for the year 2000" |
description [schema:description] |
Expected value: String Description: Text with a brief explanation of the dataset and its context Example: "This dataset contains criminal records in the US (homicides, robbery, assault) organized by State and County as reported by their local administrations." |
url [P2699] |
Expected value: URL Description: URL where to download the Dataset. It the dataset includes several files, this would be the URL where to download all of them. Example: http://s3-us-gov-west-1.amazonaws.com/cg-d4b776d0-d898-4153-90c8-8336f86bdfec/2018/AL-2018.zip Qualifiers [OPTIONAL]: of [P642] digital_data_download [Q165194]; file_format [P2701] (e.g., ZIP [Q136218], N-Triples [Q44044]) |
dataset_id [P1813] |
Expected value: String Description: ID of the dataset to be used in Datamart. Example: "OECD" |
Recommended Property | Description and Examples |
---|---|
keywords [P2006020006] |
Expected value: String Description: Keywords describing the dataset. Multiple entries are delimited by commas Example: "crime, homicide" |
creator [P170] |
Expected value: String (will be matched to QNode of Person or Organization) Description: Person or Organization responsible for the creation of the Dataset Example: "Federal Bureau of Investigation" Example: "John Doe" |
contributor [P767] |
Expected value: String (will be matched to QNode of Person or Organization) Description: Person or Organization who helped in the development of the Dataset. Example: "John Doe" |
cites_work [P2860] |
Expected value: String Description: Bibliographic citation for the dataset Example: "Doe J (2014) Influence of X ... https://doi.org/10.1111/111" Example: https://doi.org/10.1111/111 |
copyright_license [P275] |
Expected value: String Description: license under which this copyrighted work is released Example: "Creative Commons Attribution-ShareAlike 4.0 International" (Q18199165) |
version [P2006020007] |
Expected value: String Description: Version number of the Dataset. Semantic versioning in the form of X.Y.Z is preferred (where X indicates a major version, Y a minor version and Z indicates a patch or bug fixes). Example: "1.0.0" |
doi [P356] |
Expected value: String Description: Digital Objet Identifier (DOI) of the dataset. Note that this identifier is different from the DOI used in "Cites Work". Example: "https://doi.org/10.1000/182" |
main_subject [P921] |
Expected value: String (will be mapped to QNode) Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset Example: "USA" (Q30) Example: "Burundi"(Q967) |
coordinate_location [P625] |
Expected value: String Description: Geocoordinates of the subject in WGS84 format. Example: "14°S, 53°W" |
geoshape [P3896] |
Expected value: String Description: Geographic data in Well Known Text (WKT) format. Example: "POINT (30 10)" Example: "POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))" |
country [P17] |
Expected value: String Description: Country where the dataset observations were collected Example: "USA" (Q30) Example: "Burundi"(Q967) |
location [P276] |
Expected value: String (will be mapped to QNode) Description: Location of the Dataset Example: "Los Angeles" (Q65) Example: "Burundi"(Q967) |
start_time [P580] |
Expected value: String Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]) |
end_time [P582] |
Expected value: String Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]) |
data_interval [P6339] |
Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)] Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset Qualifiers [OPTIONAL]: start_time [P580], end_time [P582] |
variable_measured [P2006020003] |
Expected value: Variable Description: Variables that are measured in a Dataset. Variables MUST be described at least with their corresponding full name ( name property). Example: {"variable_id":"Price", "name":"Published price listed or paid for a product", "identifier":"https://www.wikidata.org/wiki/Property:P2284"} |
mapping_file P2006020005] |
Expected value: URL Description: File used to create map the dataset statements to WikiData tiples Example: http://example.com/T2WMLProject-FBI Qualifiers [OPTIONAL]: file_format (P2701) |
Optional Property | Description and Examples |
---|---|
official_website [P856] |
Expected value: String Description: URL of the official homepage of a Dataset Example: https://crime-data-explorer.fr.cloud.gov |
date_created [P2006020008] |
Expected value: Date Description: Creation date of the Dataset in ISO 8601 format (YYYY-MM-DD) Example: 2020-04-06 |
api_endpoint [P6269] |
Expected value: String Description: Base URL of a web service Example: https://www.wikidata.org/w/api.php |
included_in_data_catalog [P2006020009] |
Expected value: String (will be mapped to QNode) Description: Catalog where the Dataset is included Example: "FigShare"(Q17013516) |
has_part [P527] |
Expected value: String Description: Link to the files that are included on a Dataset (in case the dataset contains multiple files) Example: http://example.com/example.csv1 Qualifiers [OPTIONAL]: file_format (P2701) (e.g., CSV [Q935809]) |
last_update [P5017] |
Expected value: Date Description: Date a dataset was last updated in ISO 8601 format (YYYY-MM-DD) Example: 2020-04-06 |
updated_by [P2010280001] |
Expected value: String Description: Person who edited a dataset last Example: John Doe |
When a property is marked as (will be mapped to QNode) it means that Datamart will automatically transform the target string into an entity with a QNode in Wikidata. If no match is found, a new QNode will be created.
Variable Metadata¶
Dataset variables describe the contents of a table (typically a column). When describing properties, we have the following required and recommended properties:
Required Property | Description and Examples |
---|---|
name [P1476] |
Expected value: String Description: Full name of the variable Example: "Number of homicides" |
Recommended Property | Description and Examples |
---|---|
variable_id [P1813] |
Expected value: String Description: Identifier associated with the variable. It identifies this variable in particular in this dataset, using the name it has in its corresponding column header Example: "homicides_n" |
dataset_id [P1813] |
Expected value: String Description: Identifier of the dataset this variable belongs to. Example: "UAZ" |
description [schema:description] |
Expected value: String Description: Text with a brief explanation of the Variable and its context Example: "The number of homicides in a region." |
corresponds_to_property [P1687] |
Expected value: URL Description: URL of the variable in Wikidata. If provided, this value helps Datamart relating the variable to other variables that measure the same thing Example: https://www.wikidata.org/wiki/Property:P2284 (for price) |
main_subject [P921] |
Expected value: List[Object] Description: Primary topic(s) of a variable. This property may be used to identify all the entities described by the variable. Each main subject is described by an identifier and a name. Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"} Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"} |
unit_of_measure [P1880] |
Expected value: List[String] (Will be mapped to QNode) Description: Unit of measurement used to measure the variable value. Example: "Ethiopian Dollars per Kilogram" Example: "ETB/Kg" |
country [P17] |
Expected value: List[Object] Description: Country where the variable observations were collected. Each country is described by a name and an identifier Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"} Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"} |
location [P276] |
Expected value: List[Object] Description: Location of the variable. Each location is described with a name and an identifier Example: {"name":"Los Angeles", "identifier":"https://www.wikidata.org/wiki/Q65"} Example: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"} |
start_time [P580] |
Expected value: String Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]), calendar [P2803] (e.g., Gregorian Q12138) |
end_time [P582] |
Expected value: String Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]: precision [P2803] (e.g., Year [Q577]), calendar [P2803] (e.g., Gregorian Q12138) |
data_interval [P6339] |
Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)] Description: Interval at which the observations are collected in the dataset. |
has_column_index [P2006020001] |
Expected value: Integer Description: Column number that corresponds to the variable. Example: 2 |
has_qualifier [P2006020002] |
Expected value: List[String] Description: Qualifiers used to describe the variable Example:"Fertilizer" Example: "Source" |
count [P1114] |
Expected value: Integer Description: Number of instances of this property in this table. For instance, number of rows in a CSV that use this property. Example: 150 |
geospatial_granularity [P2006180001] |
Expected value: String Description: Administrative area (admin1..admin3) the variable belongs to. This classification depends on the administrative territorial entities used by countries. Example: 150 |
tag [P2010050001] |
Expected value: String Description: An external category that we may want to map a variable to. Example: "Precipitation_volume" (used as tag for variable name precip ) |
Additional qualifiers
may identify descriprion properties that have not been included in the variable schema. These properties are often describing a single variable. For example if the variable measures production, then the fertilizer_type
may be a qualifier, while if the variable measures an observation in the sea soil, the depth
and point in time
at which the measurement was collected are a qualifiers.
Example:¶
The following JSON snippet below illustrates an example of the metadata of the food production index for Ethiopia. The fertilizer
property is considered a qualifier of the variable being described, as it is not included in the schema above:
{
"name": "Food production index",
"variable_id:" "FPI",
"correspondsToProperty": "https://datamart.isi.edu/wiki/Property:P110026",
"description": "Food production index, calculated from ...",
"main_subject": [
{"name":"Kercha",
"identifier":"https://www.wikidata.org/wiki/Q6393737"},
{"name":"Liben",
"identifier":"https://www.wikidata.org/wiki/Q3237714"}],
"unit_of_measure": ["tonnes/year"],
"country": [{"name":"Ethiopia",
"identifier":"https://www.wikidata.org/wiki/Q115"}],
"start_time": "1993",
"end_time": "2016",
"end_time_precision": "Year",
"data_interval": "Monthly",
"has_qualifier": "Fertilizer"
}
Another example with minimal metadata:
{
'name': 'UAZ Indicators',
'description': 'Collection of indicators, including indicators from FAO, WDI, FEWSNET, CLiMIS, UNICEF, ieconomics.com, UNHCR, DSSAT, WHO, IMF, WHP, ACLDE, World Bank and IOM-DTM',
'url': 'https://github.com/ml4ai/delphi',
'dataset_id': 'UAZ'
}
Some variables may belong to already existing CSVs, and therefore we may have information about their position in the table. In example case below, has_column_index
is used to identify that the variable was in the second column of the spreadsheet, while the dataset_id
qualifier indicates the URL of the dataset the variable was included in:
{
"name": "Number of homicides worldwide",
"variable_id": "NumberH",
"description": "Number of homicides per country/year as collected by ...",
"main_subject":[
{"name":"United States of America",
"identifier":"https://www.wikidata.org/wiki/Q30"},
{"name":"Ethiopia",
"identifier":"https://www.wikidata.org/wiki/Q115"}],
"start_time": "2000",
"end_time": "2020",
"end_time_precision": "Year",
"end_time_calendar:": "Gregorian",
"data_interval": "Year",
"has_column_index":"2",
"dataset_id": "http://example.org/Crimes.csv"
}
Example with minimimal metadata:
{
"name": "gross domestic product based on purchasing power parity",
"variable_id": "GDP2",
"dataset_id": "WDI"
}
Acknowledgements:¶
We have used Wikidata and Schema.org as reference schemas to build the Datamart Dataset Schema. We have also used the Google Dataset Search guide as a reference for structuring our suggested minimum and required properties.
Contribution Guidelines¶
If you have suggestions or concerns with any of the aspects covered in this schema, please open an issue in our Github repository with the headline [DatasetSchema].