Dataset Metadata Schema

This page describes the current schema used by ISI the Datamart to represent datasets.

Schema Version: 1.0.0
Release date: June 12th, 2020
Authors: Pedro Szekely, Ke-Thia Yao and Daniel Garijo

Dataset Definition¶

We define a Dataset as a collection of files which contain information (typically observations) about one or multiple variables that describe entities of interest. For example, consider the sample table below:

Country	Number of homicides	Year
Burundi	1000	2000
USA	2000	2000

The table contains information about the variable number of homicides, which describe Countries (Burundi, USA) in some year. Here year is a special type of variable which describes the information in the row (i.e., the number of homicides on a particular year). We refer to these special variables as qualifiers.

Warning

If a dataset containes several files, it is not required to declare all of its parts as datasets.

Describing dataset metadata¶

Datasets have the following required, recommended and optional properties. Required properties MUST be submitted as part of the metadata in order to be inserted in Datamart. Recommended properties may not be included, but are highly recommended in order to exploit the full features of Datamart. Optional properties provide additional insight into the dataset, helping others understand its context. Note that some properties have qualifiers. Qualifiers are additional fields which add more information about the property and object being described, and are used by concatenating them to the described property. For example, if we want to describe the file format a dataset has at a url, we can use url_file_format qualifier to describe it.

Required Property	Description and Examples
`name` [P1476]	*Expected value: String* *Description: Full name of the dataset Example*: "Criminal records in the US for the year 2000"
`description` [schema:description]	*Expected value: String* *Description: Text with a brief explanation of the dataset and its context Example*: "This dataset contains criminal records in the US (homicides, robbery, assault) organized by State and County as reported by their local administrations."
`url` [P2699]	*Expected value: URL* *Description: URL where to download the Dataset. It the dataset includes several files, this would be the URL where to download all of them. Example: http://s3-us-gov-west-1.amazonaws.com/cg-d4b776d0-d898-4153-90c8-8336f86bdfec/2018/AL-2018.zip Qualifiers [OPTIONAL]*: `of` [P642] `digital_data_download` [Q165194]; `file_format` [P2701] (e.g., ZIP [Q136218], N-Triples [Q44044])
`dataset_id` [P1813]	*Expected value: String* *Description: ID of the dataset to be used in Datamart. Example*: "OECD"

Recommended Property	Description and Examples
`keywords` [P2006020006]	*Expected value: String* *Description: Keywords describing the dataset. Multiple entries are delimited by commas Example*: "crime, homicide"
`creator` [P170]	*Expected value: String* (will be matched to QNode of Person or Organization) *Description: Person or Organization responsible for the creation of the Dataset Example: "Federal Bureau of Investigation" Example*: "John Doe"
`contributor` [P767]	*Expected value: String* (will be matched to QNode of Person or Organization) *Description: Person or Organization who helped in the development of the Dataset. Example*: "John Doe"
`cites_work` [P2860]	*Expected value: String* *Description: Bibliographic citation for the dataset Example: "Doe J (2014) Influence of X ... https://doi.org/10.1111/111" Example*: https://doi.org/10.1111/111
`copyright_license` [P275]	*Expected value: String* *Description: license under which this copyrighted work is released Example*: "Creative Commons Attribution-ShareAlike 4.0 International" (Q18199165)
`version` [P2006020007]	*Expected value: String* *Description: Version number of the Dataset. Semantic versioning in the form of X.Y.Z is preferred (where X indicates a major version, Y a minor version and Z indicates a patch or bug fixes). Example*: "1.0.0"
`doi` [P356]	*Expected value: String* *Description: Digital Objet Identifier (DOI) of the dataset. Note that this identifier is different from the DOI used in "Cites Work". Example*: "https://doi.org/10.1000/182"
`main_subject` [P921]	*Expected value: String* (will be mapped to QNode) *Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset Example: "USA" (Q30) Example*: "Burundi"(Q967)
`coordinate_location` [P625]	*Expected value: String* *Description: Geocoordinates of the subject in WGS84 format. Example*: "14°S, 53°W"
`geoshape` [P3896]	*Expected value: String* *Description: Geographic data in Well Known Text (WKT) format. Example: "POINT (30 10)" Example*: "POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))"
`country` [P17]	*Expected value: String* *Description: Country where the dataset observations were collected Example: "USA" (Q30) Example*: "Burundi"(Q967)
`location` [P276]	*Expected value: String* (will be mapped to QNode) *Description: Location of the Dataset Example: "Los Angeles" (Q65) Example*: "Burundi"(Q967)
`start_time` [P580]	*Expected value: String* *Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]*: `precision` [P2803] (e.g., Year [Q577])
`end_time` [P582]	*Expected value: String* *Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]*: `precision` [P2803] (e.g., Year [Q577])
`data_interval` [P6339]	*Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)]* *Description: Primary topic(s) of a Dataset. This property may be used to identify all the entities described in a dataset Qualifiers [OPTIONAL]*: `start_time` [P580], `end_time` [P582]
`variable_measured` [P2006020003]	*Expected value: Variable* *Description: Variables that are measured in a Dataset. Variables MUST be described at least with their corresponding full name (`name` property). Example: {"variable_id":"Price", "name":"Published price listed or paid for a product", "identifier*":"https://www.wikidata.org/wiki/Property:P2284"}
`mapping_file` P2006020005]	*Expected value: URL* *Description: File used to create map the dataset statements to WikiData tiples Example: http://example.com/T2WMLProject-FBI Qualifiers [OPTIONAL]*: `file_format` (P2701)

Optional Property	Description and Examples
`official_website` [P856]	*Expected value: String* *Description: URL of the official homepage of a Dataset Example*: https://crime-data-explorer.fr.cloud.gov
`date_created` [P2006020008]	*Expected value: Date* *Description: Creation date of the Dataset in ISO 8601 format (YYYY-MM-DD) Example*: 2020-04-06
`api_endpoint` [P6269]	*Expected value: String* *Description: Base URL of a web service Example*: https://www.wikidata.org/w/api.php
`included_in_data_catalog` [P2006020009]	*Expected value: String* (will be mapped to QNode) *Description: Catalog where the Dataset is included Example*: "FigShare"(Q17013516)
`has_part` [P527]	*Expected value: String* *Description: Link to the files that are included on a Dataset (in case the dataset contains multiple files) Example: http://example.com/example.csv1 Qualifiers [OPTIONAL]*: `file_format` (P2701) (e.g., CSV [Q935809])
`last_update` [P5017]	*Expected value: Date* *Description: Date a dataset was last updated in ISO 8601 format (YYYY-MM-DD) Example*: 2020-04-06
`updated_by` [P2010280001]	*Expected value: String* *Description: Person who edited a dataset last Example*: John Doe

When a property is marked as (will be mapped to QNode) it means that Datamart will automatically transform the target string into an entity with a QNode in Wikidata. If no match is found, a new QNode will be created.

Variable Metadata¶

Dataset variables describe the contents of a table (typically a column). When describing properties, we have the following required and recommended properties:

Required Property	Description and Examples
`name` [P1476]	*Expected value: String* *Description: Full name of the variable Example*: "Number of homicides"

Recommended Property	Description and Examples
`variable_id` [P1813]	*Expected value: String* *Description: Identifier associated with the variable. It identifies this variable in particular in this dataset, using the name it has in its corresponding column header Example*: "homicides_n"
`dataset_id` [P1813]	*Expected value: String* *Description: Identifier of the dataset this variable belongs to. Example*: "UAZ"
`description` [schema:description]	*Expected value: String* *Description: Text with a brief explanation of the Variable and its context Example*: "The number of homicides in a region."
`corresponds_to_property` [P1687]	*Expected value: URL* *Description: URL of the variable in Wikidata. If provided, this value helps Datamart relating the variable to other variables that measure the same thing Example*: https://www.wikidata.org/wiki/Property:P2284 (for price)
`main_subject` [P921]	*Expected value: List[Object]* *Description: Primary topic(s) of a variable. This property may be used to identify all the entities described by the variable. Each main subject is described by an identifier and a name. Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"} Example*: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
`unit_of_measure` [P1880]	*Expected value: List[String]* (Will be mapped to QNode) *Description: Unit of measurement used to measure the variable value. Example: "Ethiopian Dollars per Kilogram" Example*: "ETB/Kg"
`country` [P17]	*Expected value: List[Object]* *Description: Country where the variable observations were collected. Each country is described by a name and an identifier Example: {"name":"USA", "identifier": "https://www.wikidata.org/wiki/Q30"} Example*: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
`location` [P276]	*Expected value: List[Object]* *Description: Location of the variable. Each location is described with a name and an identifier Example: {"name":"Los Angeles", "identifier":"https://www.wikidata.org/wiki/Q65"} Example*: {"name":"Burundi", "identifier":"https://www.wikidata.org/wiki/Q967"}
`start_time` [P580]	*Expected value: String* *Description: Time at which the Dataset starts collecting observationsThe value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]*: `precision` [P2803] (e.g., Year [Q577]), `calendar` [P2803] (e.g., Gregorian Q12138)
`end_time` [P582]	*Expected value: String* *Description: Time at which the Dataset stops collecting observations. The value should follow the ISO 8601 format (YYYY-MM-DD). Precision may vary from seconds to years. Example: "2020-04-06" Example: "2020" Qualifiers [OPTIONAL]*: `precision` [P2803] (e.g., Year [Q577]), `calendar` [P2803] (e.g., Gregorian Q12138)
`data_interval` [P6339]	*Expected value: String [Millenium (Q36507) OR Century (Q578) OR Decade (Q39911) OR Year (Q577) OR Month (Q5151)OR Day (Q573) OR Hour (Q25235) OR Minute (Q7727) OR Second (Q11574)]* *Description*: Interval at which the observations are collected in the dataset.
`has_column_index` [P2006020001]	*Expected value: Integer* *Description: Column number that corresponds to the variable. Example*: 2
`has_qualifier` [P2006020002]	*Expected value: List[String]* *Description: Qualifiers used to describe the variable Example:"Fertilizer" Example*: "Source"
`count` [P1114]	*Expected value: Integer* *Description: Number of instances of this property in this table. For instance, number of rows in a CSV that use this property. Example*: 150
`geospatial_granularity` [P2006180001]	*Expected value: String* *Description: Administrative area (admin1..admin3) the variable belongs to. This classification depends on the administrative territorial entities used by countries. Example*: 150
`tag` [P2010050001]	*Expected value: String* *Description: An external category that we may want to map a variable to. Example*: "Precipitation_volume" (used as `tag` for variable name `precip`)

Additional qualifiers may identify descriprion properties that have not been included in the variable schema. These properties are often describing a single variable. For example if the variable measures production, then the fertilizer_type may be a qualifier, while if the variable measures an observation in the sea soil, the depth and point in time at which the measurement was collected are a qualifiers.

Example:¶

The following JSON snippet below illustrates an example of the metadata of the food production index for Ethiopia. The fertilizer property is considered a qualifier of the variable being described, as it is not included in the schema above:

{
    "name": "Food production index",
    "variable_id:" "FPI",
    "correspondsToProperty": "https://datamart.isi.edu/wiki/Property:P110026",
    "description": "Food production index, calculated from ...",
    "main_subject": [
        {"name":"Kercha", 
        "identifier":"https://www.wikidata.org/wiki/Q6393737"},
        {"name":"Liben", 
        "identifier":"https://www.wikidata.org/wiki/Q3237714"}],
    "unit_of_measure": ["tonnes/year"],
    "country": [{"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "start_time": "1993",
    "end_time": "2016",
    "end_time_precision": "Year",
    "data_interval": "Monthly",
    "has_qualifier": "Fertilizer"
}

Another example with minimal metadata:

{
    'name': 'UAZ Indicators', 
    'description': 'Collection of indicators, including indicators from FAO, WDI, FEWSNET, CLiMIS, UNICEF, ieconomics.com, UNHCR, DSSAT, WHO, IMF, WHP, ACLDE, World Bank and IOM-DTM', 
    'url': 'https://github.com/ml4ai/delphi', 
    'dataset_id': 'UAZ'
}

Some variables may belong to already existing CSVs, and therefore we may have information about their position in the table. In example case below, has_column_index is used to identify that the variable was in the second column of the spreadsheet, while the dataset_id qualifier indicates the URL of the dataset the variable was included in:

{
    "name": "Number of homicides worldwide",
    "variable_id": "NumberH",
    "description": "Number of homicides per country/year as collected by ...",
    "main_subject":[
        {"name":"United States of America", 
        "identifier":"https://www.wikidata.org/wiki/Q30"},
        {"name":"Ethiopia", 
        "identifier":"https://www.wikidata.org/wiki/Q115"}],
    "start_time": "2000",
    "end_time": "2020",
    "end_time_precision": "Year",
    "end_time_calendar:": "Gregorian",
    "data_interval": "Year",
    "has_column_index":"2",
    "dataset_id": "http://example.org/Crimes.csv"
}

Example with minimimal metadata:

{
    "name": "gross domestic product based on purchasing power parity",
     "variable_id": "GDP2", 
     "dataset_id": "WDI"
}

Acknowledgements:¶

We have used Wikidata and Schema.org as reference schemas to build the Datamart Dataset Schema. We have also used the Google Dataset Search guide as a reference for structuring our suggested minimum and required properties.

Contribution Guidelines¶

If you have suggestions or concerns with any of the aspects covered in this schema, please open an issue in our Github repository with the headline [DatasetSchema].