Data Schema and Examples¶
This page describes the data format returned by ISI Datamart.
- Schema Version: 0.0.3
- Release date: April 15th, 2020
- Authors: Pedro Szekely, Ke-Thia Yao and Daniel Garijo
Canonical Data Format¶
The canonical data format specifies the preferred schema for downloading or uploading data for individual variables from or to Datamart. Additional formats will be supported for different use cases, and these are documented in the next sections.
Data can be downloaded or uploaded to Datamart as text delimited files (CSV or TSV) and as JSON where the names of the attributes in the JSON format corresponds to the names of columns in the text delimited files.
Columns¶
The canonical data format supports a large number of columns, which provide significant flexibility and convenience for both upload (data registration) and download. The table below lists all the columns that may appear in a canonical data file:
Column Name
: the name of the column in the data fileType
: the type of value; the typeidentifier
designates strings to used as identifiers and should contain only alphanumeric, underscore or minus.Up
: whether this column is required (r
) or optional (o
) for data upload.Down
: whether this column is required (r
) or optional (o
) for data download. When a column is optional, capitalO
specifies that the column will be included by default.Description
: the meaning of the column.Examples
: examples of values that may be present in a cell, different examples appear one per line.
Column Name | Type | Up | Down | Description | Examples |
---|---|---|---|---|---|
dataset_id |
identifier | o | r | Globally unique identifier of a dataset in Datamart | Wikidata |
dataset |
string | o | r | Name of a dataset | Wikidata Knowledge Base |
variable_id |
identifier | o | r | Identifier of a variable; must be unique within the dataset, but variables with the same identifier may be present in multiple datasets, e.g., the population variable may be present in a dataset from the census and a dataset from the World Bank | population price |
variable |
string | o | O | Name of a variable | Total population both sexes Price based on local currency |
category |
string | o | o | The category of data present in a variable. For example, variable production could be tagged with category Crops, and may have multiple main subjects corresponding to individual corps such as maize, wheat, etc. |
Crops |
main_subject |
string | r | r | The subject for which a variable is measured, for example population of Ethiopia , production of Maize , speed of TGV . Often, additional values are needed to fully specify a variable, for example the population may be measured for a specific age group, ethnicity or gender. The main subject is meant to be used as a primary key for joins as it is expected that joins on main subject are meaningful. |
Ethiopia Maize TGV |
main_subject_id |
identifier | o | O | Wikidata identifier for the entity present in the main_subject column. The value may be missing if Datamart is unable to entity link the main subject to Wikidata. |
Q25618328 (for Maize) Q843942 (for Teff) |
value |
number | r | r | The value of the variable. The value is a number and will be left blank when the value of the variable is non numeric; in such cases the value will be present in the qualifier columns. | 1.182 |
value_unit |
string | o | O | The unit of measure of the variable value. When possible the unit of measure will be specified in SI units in a machine and human readable language. It is possible to list the units as a string. | quintal M quintal (mega quintal) |
time |
string | r | r | The time when the value of a variable was measured. For download, the time will be formated in ISO format with seconds resolution. For upload the dates may be provided in a wide variety of formats. | 2016-01-01T00:00:00 (download) Jan 1, 2016 (upload) |
time_precision |
string | o | O | Precision of the time (second, minute, hour, day, month, year). On download, times will are formatted in ISO format with second precision for easy parsing. The time_precision attribute specifies the intended precision, and is useful for determining how to format the axis of charts. For upload, the time precision is optional and if not provided is inferred automatically. |
year |
country |
string | o | O | The country associated with the location information of the variable value. The country is optional for upload; if not provided, it will be automatically inferred from other geospatial attributes, or left empty. For download, the country will be present if it was provided on upload or inferred. The country will be formatted using the Wikipedia English label. | Ethiopia |
country_id |
string | o | o | The Wikidata identifier for the country. | Q115 (Ethiopia) |
admin1 |
string | o | o | First-level administrative country subdivision, such as states in USA, provinces in Canada, and regions in Ethiopia. This attribute is optional for upload, and will be inferred automatically from lower level admin attributes or geospatial coordinates if provided. For download, the admin1 will be formatted using the Wikipedia English label. | Oromia Region California |
admin1_id |
string | o | o | Wikidata identifier for the first-level administrative country subdivision. | Q202107 (Oromia Region) |
admin2 |
string | o | o | Second-level administrative country subdivision, such as counties in USA and zones in Ethiopia. This attribute is optional for upload, and will be inferred automatically from lower level admin attributes or geospatial coordinates if provided. For download, the admin2 will be formatted using the Wikipedia English label. | Bale Zone |
admin2_id |
string | o | o | Wikidata identifier for the second-level administrative country subdivision. | Q804883 (Bale Zone) |
admin3 |
string | o | o | Third-level administrative country subdivision, such as municipalities in USA or woredas in Ethiopia. This attribute is optional for upload, and will be inferred automatically from geospatial coordinates if provided. For download, the admin3 will be formatted using the Wikipedia English label. | Goba |
admin3_id |
string | o | o | Wikidata identifier for the third-level administrative country subdivision | Q3109573 (Goba)) |
place |
string | o | o | Geographic place, such as cities, neighborhoods, mountains, lakes, etc. This attribute is used to specify places that do not fit into any of the other attributes for geopolitical entities. For upload any string can be entered and the system will attempt to link it to a place in Wikidata (geonames). If linking is possible and unambiguous the Wikidata identifier will be recorded in place_id . |
Addis Ababa |
place_id |
string | o | o | Wikidata identifier for a place. This attribute is optional for upload, if not provided the system will attempt to infer it from the value of the place attribute if provided. |
Q3624 (Addis Ababa) |
coordinate |
string | o | O | Latitude and longitude coordinates in WKT format. This attribute is optional for upload. If not provided, it will be inferred from the shape, if provided, otherwise from the lowest level admin level or place provided. | POINT(9.001 38.757) |
shape |
string | o | o | Geometric shape in WKT format. This attribute is not inferred if not provided. | POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10)) |
stated_in |
string | o | O | Source of the data variable | WDI |
stated_in_id |
string | o | O | Wikidata identifier for the source of the data variable | Q8035640 |
Qualifiers¶
The table above specifies the standard columns in the canonical data format. In many applications, it is useful or necessary to supply additional information to fully specify the context for the measurement of a variable. Qualifiers are an extension mechanism to specify additional attributes for a variable. Each variable may have its own collection of qualifiers.
For example, a population
variable may include qualifiers to specify that population values are for a specific ethnicity, age group or
gender; a number of casualties
variable may include qualifiers to distinguish civilian or combatant casualties.
Qualifiers can appear as additional columns in canonical data files for variables. The following conventions are used:
- Any column whose name is different from the names used in the table above will be interpreted as a column containing qualifier information.
- Columns with suffix
_id
will be interpreted as containing identifiers for another column. For example,gender_id
is interpreted as containing identifiers for agender
column.
For upload, Datamart will provide little or no understanding of qualifier columns; Datamart will automatically detect common data types such as numbers and dates, and store them appropriately to support range queries. The user in the loop curation tools will provide additional support for qualifier columns, normalizing values by linking them to Wikidata.
Examples¶
Country Population¶
Uploading tables to Datamart must be with respect to a particular variable (variable_id
) within a dataset
(dataset_id
). ISI Datamart follows the REST software architectural style. The dataset_id
and variable_id
are
specified as part of the REST endpoint URL.
The following table is an example of a simple canonical data file for a variable called population
.
variable_id | value | time | country |
---|---|---|---|
population | 100,000,000 | 2018 | Ethiopia |
population | 109,000,000 | 2019 | Ethiopia |
population | 320,000,000 | 2018 | USA |
population | 328,000,000 | 2019 | USA |
When this table is uploaded to Datamart, say to the dataset_id
Wikidata
, it will automatically perform the following enhancements:
- Check that
dataset_id
Wikidata
exists - Check that
variable_id
population
is variable ofWikidata
- Validates whether the cells of the
variable_id
column matches the identifierpopulation
. - Validates whether the cells of the
value
column are numeric. - Validates whether the cells of the
time
column are dates. - Parse the values as numbers assuming American conventions for commas and decimal points.
- Convert the times to ISO format and automatically determine the precision as year.
- Link the countries to Wikidata so that country names are standardized (United States of America instead of USA), and support downloading the country identifiers to support accurate joins.
- Infer
country
as the main subject of the variable.
When a client downloads the variable, it will by default include columns as show below. Note: the API enables clients to configure the set of optional columns present in the downloaded files; the API also supports downloading the raw uploaded file.
dataset_id | variable_id | main_subject | main_subject_id | value | time | time_precision | country | coordinate |
---|---|---|---|---|---|---|---|---|
Wikidata | population | Ethiopia | Q115 | 100000000 | 2018-01-01T00:00:00 | year | Ethiopia | POINT(9.145 40.490) |
Wikidata | population | Ethiopia | Q115 | 109000000 | 2019-01-01T00:00:00 | year | Ethiopia | POINT(9.145 40.490) |
Wikidata | population | United States of America | Q30 | 320000000 | 2018-01-01T00:00:00 | year | United States of America | POINT(37.090 -95.713) |
Wikidata | population | United States of America | Q30 | 328000000 | 2019-01-01T00:00:00 | year | United States of America | POINT(37.090 -95.713) |
Production Data¶
Upload admin1-level (region-level) food production data. In this example the units (M quintal) are written in line with the values. The units also could have been separated out in this own value_unit column, or included in the column header.
variable_id | main_subject | value | time | admin1 |
---|---|---|---|---|
production | maize | 1.182 M quintal | 2016-01-01T00:00:00 | Oromia |
production | teff | 2.345 M quintal | 2016-01-01T00:00:00 | Oromia |
production | maize | 2.234 M quintal | 2017-01-01T00:00:00 | Oromia |
production | teff | 3.356 M quintal | 2017-01-01T00:00:00 | Oromia |
Download food production data with optional identifier columns.
dataset_id | variable_id | main_subject | main_subject_id | value | value_unit | time | time_precision | country | country_id | admin1 | admin1_id |
---|---|---|---|---|---|---|---|---|---|---|---|
Wikidata | production | maize | Q25618328 | 1.182 | M quintal | 2016-01-01T00:00:00 | year | Ethiopia | Q115 | Oromia | Q202107 |
Wikidata | production | teff | Q843942 | 2.345 | M quintal | 2016-01-01T00:00:00 | year | Ethiopia | Q115 | Oromia | Q202107 |
Wikidata | production | maize | Q25618328 | 2.234 | M quintal | 2017-01-01T00:00:00 | year | Ethiopia | Q115 | Oromia | Q202107 |
Wikidata | production | teff | Q843942 | 3.356 | M quintal | 2017-01-01T00:00:00 | year | Ethiopia | Q115 | Oromia | Q202107 |
Grid Data¶
Upload grid data of precipitation for the month of June 2016. In this example, the precipitation unit is included in the column header.
variable | value in mm | time | time_precision | coordinate |
---|---|---|---|---|
rain | 543 | 2016-06 | month | POINT(9.15 40.70) |
rain | 490 | 2016-06 | month | POINT(9.25 40.70) |
rain | 550 | 2016-06 | month | POINT(9.15 40.80) |
rain | 528 | 2016-06 | month | POINT(9.25 40.80) |