Skip to content

Commit

Permalink
Reworded and hopefully clarified in a few places.
Browse files Browse the repository at this point in the history
  • Loading branch information
danbri committed Aug 1, 2019
1 parent bc51fe7 commit c1a53f9
Showing 1 changed file with 17 additions and 8 deletions.
25 changes: 17 additions & 8 deletions docs/data-and-datasets.html
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,10 @@ <h1>

<h1>Data and Datasets overview</h1>

<p>This note provides some background on the various notions of "data" and "dataset" related to Schema.org.</p>
<p>This note provides some background on various notions of "data" and "dataset" related to Schema.org.</p>

<p>Schema.org as a project, and as a collection of terms, is entirely devoted to data. We define types
such as <a href="/Event">Event</a>, <a href="/NewsArticle">NewsArticle</a>, <a href="/Review">Review</a>, <a href="/Person">Person</a>, as well as properties that characterize and interlink instances of
these types. For example, the "alumni" property links <a href="/Person">people</a> with <a href="/EducationalOrganization">educational organizations</a>.
<p>Schema.org as a project, and as a collection of terms, is entirely devoted to data. In other words, it <em>always</em> provides, characterises, describes, or encodes some form of data. Schema.org defines particular types
such as <a href="/Event">Event</a>, <a href="/NewsArticle">NewsArticle</a>, <a href="/Review">Review</a>, <a href="/Person">Person</a>, as well as properties that characterize and interlink instances of these types. For example, the <a href="/alumni">alumni</a> property links <a href="/Person">people</a> with <a href="/EducationalOrganization">educational organizations</a>. The <a href="/alumni">alumni</a> property exists to provide information about people being alumni of organizations; <a href="/Volcano">Volcano</a> exists to provide information about volcanoes, and so on. However, there can sometimees be confusion when the thing we are providing information about, is itself thought of as (typically a bundle of) data.
</p>

<p>Schema.org itself also contains some dedicated vocabulary that can be used in applications which publish,
Expand All @@ -98,10 +97,11 @@ <h1>Data and Datasets overview</h1>
collection of structured data schemas, and complements numerous other data-related formats and standards.
</p>

<p>In particular, schema.org defines vocabulary for providing Dataset metadata, alongside (proposed) vocabulary for describing aggregate statistics:</p>
<ul>
<li>
When describing collections of data, for example as published in scientific, scholarly or governmental
"open data" repositories, the <a href="/Dataset">Dataset</a> type can be used, alongside <a href="/DataCatalog">DataCatalog</a> to indicate the larger collection, and <a href="/DataDownload">DataDownload</a> for specific representations of a dataset.
When describing collections of packaged data, for example as published in scientific, scholarly or governmental
"open data" repositories, the <a href="/Dataset">Dataset</a> type can be used, alongside <a href="/DataCatalog">DataCatalog</a> to indicate the overall collection, and <a href="/DataDownload">DataDownload</a> for specific representations of a dataset.
These "datasets", unlike typical use of Schema.org, can be in arbitrary formats. For example, they may include data that is stored in collections of spreadsheet files, or as digital images, or in dedicated scientific, geospatial and engineering file formats. Such diversity reflects the complexity of real-world data, but the use of diverse and often incompatible
formats also makes it hard to integrate the information that they encode, e.g. for use in unified "knowledge graphs" such as <a href="https://wikidata.org">Wikidata</a> and <a href="https://DataCommons.org">DataCommons.org</a>.
Schema.org's <a href="/Dataset">Dataset</a> vocabulary was originally based on <a href="https://en.wikipedia.org/wiki/Data_Catalog_Vocabulary">DCAT</a>, which in turn used <a href="https://www.w3.org/TR/vocab-dcat/#basic-example">used</a> <a href="http://dublincore.org/">Dublin Core</a> and <a href="http://xmlns.com/foaf/spec/">FOAF</a> terms.
Expand All @@ -112,14 +112,23 @@ <h1>Data and Datasets overview</h1>
</li>
</ul>

<p>
To take a specific example, the Volcano type in schema.org is useful for volcano data, but in a different way from a <a href="/Dataset">Dataset</a> type being used to describe a collection of data about volcanos (e.g. in CSV or XML format). Similarly, the Population / Observation types can be used to represent aggregate statistics of "populations" of volcanos.
While http://schema.org/Volcano can be used to directly provide information about specific volcanos; the http://schema.org/Dataset and http://schema.org/Observation types emphasise the data level of abstraction more directly.
</p>
<p>
Other related work includes W3C's <a href="https://www.w3.org/TR/tabular-data-primer/">CSVW</a>
and <a href="https://www.w3.org/TR/vocab-data-cube/">RDF Data Cube</a> specifications, as well as
the <a href="https://google.github.io/dspl/dspl2-spec.html">DSPL 2.0</a> specification. DSPL 2.0 combines Schema.org
for per-dataset metadata with the use of CSV files to represent code lists, enumerations and statistical observations.
These technologies all in turn depend on lower-level standards, such as for JSON-LD, RDFa, Microdata, XML, Unicode etc.,
DSPL2 provides an explicit high-fidelity representation of datasets in their own terms, rather than mapping everything into
Schema.org.</p>
<p>These technologies all in turn depend on lower-level standards, such as for JSON-LD, RDFa, Microdata, XML, Unicode etc.,
and share a broadly <a href="https://en.wikipedia.org/wiki/Resource_Description_Framework">RDF-like</a> approach to
representing information.
representing information. There are also related standards from W3C and elsewhere dedicated to lifting factual data out of
various kinds of dataset, into RDF statements that use vocabularies such as Schema.org. For examples, see <a href="https://www.w3.org/TR/r2rml/">R2RML</a>, which addresses this for SQL; <a href="/https://en.wikipedia.org/wiki/GRDDL">GRDDL</a> for XML via XSLT; the
<a href="https://www.w3.org/TR/csv2rdf/">CSVW to RDF mappings</a> for static tabular data, and JSON-LD's context mechanism for
certain forms of JSON data.
</p>

</div>
Expand Down

0 comments on commit c1a53f9

Please sign in to comment.