A presentation of our work on the Participatory Accounting of Social Impacts

A presentation of our work on the Participatory Accounting of Social Impacts (in Scotland) – PASI.

Click on the image below to open the presentation in your web browser, then use <, > or SPACE BAR to navigate through the slides (and s to see the speaker notes).

DCS PASI presentation title page screenshot

Data flow in the proof-of-concept implementation for PASI

We have been exploring the idea of building a platform for the Participatory Accounting of Social Impacts (in Scotland) – PASI.  Waste reduction is the social impact that we are focussing on for our proof-of-concept (PoC) implementation, and the diagram (below) shows the flow of waste reduction related data, through our PoC.

A few notes about this PoC:

  • Potentially any individual/organisaton can be a “participant” (a peer-actor) in the PASI information system. A participant might publish data into PASI’s open, distributed (RDF) data graph; or/and consume data from it.
    In our PoC, participants can…
  • Supply measurement/observational data (quantities, times, descriptions). E.g. the instances of reuse/recycling supplied by ACE, STCMF, FRSHR and STCIL.
  • Provide reference metrics (measuring and categorisation standards). E.g. the carbon impact metric provided by ZWS.
  • Contribute secondary data (joining data, secondary calculations). E.g. the source→reference mappings, and the  calculated standardised waste reduction data contributed by DCS.
  • Build apps which consume the data from the PASI information system. E.g. a webapp which provides a dashboard onto waste reduction, for the general public.
  • Directly use the data in the distributed PASI graph. E.g. a federated SPARQL query constructed by a data analyst.

Data flow in the PASI PoC

Linked Open Data using a low-code, GraphQL based approach

“Might GraphQL be an easy means to query Linked Open Data?  Moreover, might it even be a handy means for building it?”

We use a product called JUXT Site, in exploring these questions.

JUXT Site

JUXT Site is a software product which offers a low-code approach for building HTTP-accessible databases. It has an add-on which allows for those databases (data models with their read & write operations) to be defined using GraphQL schemas.

JUXT Site is built on top of a database engine called XTDB which natively supports (temporal) graph queries (in the Datalog language). It has an add-on which supports (a subset of) SPARQL.

JUXT Site is Open-source software. It is in the pre-alpha phase of development (although its XTDB substrate is production ready).

So, JUXT Site’s has components which make it a promising platform on which to explore the questions at the head of this article. Here’s a diagram which summaries these components:

The relevant components of JUXT Site

(We made a small modification site-mod-sparql to JUXT Site, to surface XTDB’s SPARQL support.)

Next, we use JUXT Site’s GraphQL to build then query our linked data model.

Using GraphQL to build our linked data model

We took a subset of the linked data model that we defined for carbon savings and defined it using a GraphQL schema. The following snippets provide a flavour of that GraphQL definition.

Defining a record type

A StcmfRedistributedFood record says that a batch (weight batchKg) of food material was (time period fromto) redistributed to a destination. In our linked data model, further information about this – such as how the food material gets repurposed at the destination, and the lookup tables to calculate carbon savings – may be found by following links to other nodes in the data graph.

Here is how a StcmfRedistributedFood record type, is defined in GraphQL (on JUXT Site):

""" A batch of redistributed food material """
type StcmfRedistributedFood {

  id: ID!

  " The start of the period, inclusive "
  from: Date! @site(a: "pasi:pred/from") (1)

  " The end of the period, exclusive "
  to: Date! @site(a: "pasi:pred/to")

  " How the food material got used "
  destination: StcmfDestination
    @site( (2)
      q: { find: [e]
           where: [[e {keyword: "pasi:pred/type"} "StcmfDestination"]
                   [object {keyword: "pasi:pred/destination"} e]]
         }
    )

  " The weight in kilograms of this batch of food material "
  batchKg: Float! @site(a: "pasi:pred/batchKg")
}
  1. In GraphQL, a directive (@(…​)) can be used to say how a field should be mapped to/from to the underlying system.

    In JUXT Site, @site(…​) directives are used to map to/from structures in the underlying XTDB database. On this specific line, a: says that the field named from at the GraphQL level, should be mapped from the field named pasi:pred/from at the XTDB level.

    At the XTDB level, we use names like pasi:pred/from for our fields because such names are IRI-compliant, which means that they can be used as RDF predicates and queried using SPARQL.

  2. In this directive, we use the Datalog language to code how to find the appropriate StcmfDestination record in the underlying XTDB database.

Defining a query

Here is how a query operation to return all StcmfRedistributedFood records, is defined in GraphQL (on JUXT Site):

type Query {

  """ Return all records about batches of redistributed food material """
  stcmfRedistributedFood: [StcmfRedistributedFood]! (1)
}
  1. Simply declare that this returns a list ([…​]) of StcmfRedistributedFood records, and JUXT Site will take care of the implementation details.

Defining a mutation

Here is how a mutation operation to create or update a StcmfRedistributedFood record, is defined in GraphQL (on JUXT Site):

type Mutation {

  """ Create or update a record about a batch of redistributed food material """
  upsertStcmfRedistributedFood(

    id: ID
      @site(
        a: "xt/id"
        gen: {
          type: TEMPLATE
          template: "pasi:ent/StcmfRedistributedFood/{{from}}/{{to}}/{{destination}}" (1)
        }
      )

      " The start of the period, inclusive "
      from: Date! @site(a: "pasi:pred/from")

      " The end of the period, exclusive "
      to: Date! @site(a: "pasi:pred/to")

      " How the food material got used "
      destination: String! (2)
      destinationRef: ID
        @site(
          a: "pasi:pred/destination"
          gen: {
            type: TEMPLATE
            template: "pasi:ent/StcmfDestination/{{destination}}"
          }
        )

      " The weight in kilograms of this batch of food material "
      batchKg: Float! @site(a: "pasi:pred/batchKg")

  ): StcmfRedistributedFood @site(mutation: "update")
}
  1. We specify that a StcmfRedistributedFood record is identified by an IRI-compliant, natural key, composed from the from, to and destination values. Uniqueness is enforced over ID values therefore that combination of from, to and destination values will identity one or zero existing record(s).

  2. On invocation, this mutation will be supplied with a String value for the destination parameter. The destination String value is used to construct the ID of the targeted StcmfDestination record, and this ID is stored in a field named pasi:pred/destinationin the underlying XTDB database.

Querying our linked data model

We used JUXT Site’s GraphQL to build our linked data model (in terms of data structures and operations). Now let’s see what querying our data model looks like – firstly using GraphQL, then using SPARQL.

We will query not only for our StcmfRedistributedFood records but also for the associated information that we would need to create a waste reduction report which includes estimates of carbon savings. (Although we haven’t discussed this associated information in this article, showing the queries for it will make this exploration more informative.)

Querying using GraphQL

The query:

query PASI {
  stcmfRedistributedFood {
    batchKg
    from
    to
    destination {
      name
      refDataConnectors { (1)
        fraction
        refMaterial {
          carbonWeighting
          wasteStream
        }
        refProcess {
          name
        }
        enabler {
          name
        }
      }
    }
  }
}
  1. We haven’t discussed it in this article but we introduced an artificial direct connection, called refDataConnectors, into our data model to allow a query to walk easily to the reference data records that are needed to report on carbon savings.

The query’s raw result (truncated):

{
  "data": {
    "stcmfRedistributedFood": [
      {
        "batchKg": 87.61,
        "from": "2021-01-28",
        "to": "2021-01-29",
        "destination": {
          "name": "Used for human-food, bio-etc & sanctuary",
          "refDataConnectors": [
            {
              "fraction": 0.2,
              "refMaterial": {
                "carbonWeighting": "2.7",
                "wasteStream": "Mixed Food and Garden Waste (dry AD)"
              },
              "refProcess": {
                "name": "recycling"
              },
              "enabler": {
                "name": "Stirling Community Food"
              }
            },
            {
              "fraction": 0.8,
              "refMaterial": {
                "carbonWeighting": "4.35",
                "wasteStream": "Food and Drink Waste (wet AD)"
              },
              "refProcess": {
                "name": "reusing"
              },
              "enabler": {
                "name": "Stirling Community Food"
              }
            }
          ]
        }
      },
      {
        "batchKg": 0.48,
        "from": "2021-01-28",
        "to": "2021-01-29",
        "destination": {
          "name": "Used for compost-indiv",
          "refDataConnectors": [
            {
              "fraction": 1,
              "refMaterial": {
                "carbonWeighting": "3.48",
                "wasteStream": "Food and Drink Waste (Composting)"
              },
              "refProcess": {
[TRUNCATED]

The query’s result after formatting into a tabular report and calculating the carbonSaving column:

:enabler

:from

:to

:batchKg

:foodDestination

:ref_process

:ref_wasteStream

:ref_carbonSavingCo2eKg

Stirling Community Food

2021-01-28

2021-01-29

0.48

Used for compost-indiv

recycling

Food and Drink Waste (Composting)

1.67

Stirling Community Food

2021-01-28

2021-01-29

17.52

Used for human-food, bio-etc & sanctuary

recycling

Mixed Food and Garden Waste (dry AD)

47.31

Stirling Community Food

2021-01-28

2021-01-29

70.09

Used for human-food, bio-etc & sanctuary

reusing

Food and Drink Waste (wet AD)

304.88

Stirling Community Food

2021-01-29

2021-01-30

8.00

Used for compost-indiv

recycling

Food and Drink Waste (Composting)

27.84

Stirling Community Food

2021-01-29

2021-01-30

56.02

Used for human-food, bio-etc & sanctuary

recycling

Mixed Food and Garden Waste (dry AD)

151.26

Stirling Community Food

2021-01-29

2021-01-30

224.10

Used for human-food, bio-etc & sanctuary

reusing

Food and Drink Waste (wet AD)

974.82

Querying using SPARQL

SPARQL is used extensively by the Open Data community to query RDF datasets/graph databases.

Our chosen platform, JUXT Site (with XTDB), supports (a subset of) SPARQL. And we have defined our GraphQL-built data model to include RDF/SPARQL compliant names (i.e. IRI names for records and predicates/fields). So we can use SPARQL to query our data.

Here’s the SPARQL (almost) equivalent of the above GraphQL query:

PREFIX pasi: <pasi:pred/> (1)
SELECT ?enabler ?from ?to ?batchKg ?foodDestination ?ref_process ?ref_wasteStream ?ref_carbonSavingCo2eKg
WHERE {
  ?stcmfRedistributedFood pasi:type "StcmfRedistributedFood" ; (2)
                          pasi:from ?from ;
                          pasi:to ?to ;
                          pasi:batchKg ?origBatchKg ;
                          pasi:destination ?destination .
  ?destination pasi:name ?foodDestination .
  ?opsAceToRefData pasi:type "OpsStcmfToRefData" ; (2)
                   pasi:destination ?destination ;
                   pasi:fraction ?fraction ;
                   pasi:refMaterial/pasi:wasteStream ?ref_wasteStream ;
                   pasi:refMaterial/pasi:carbonWeighting ?carbonWeighting ;
                   pasi:refProcess/pasi:name ?ref_process ;
                   pasi:enabler/pasi:name ?enabler .
  BIND((?origBatchKg * ?fraction) AS ?batchKg) (3)
  BIND((?batchKg * ?carbonWeighting) AS ?ref_carbonSavingCo2eKg) (3)
}
ORDER BY ?enabler ?from ?to"
  1. We use pasi as the scheme part of all our IRIs. PASI is our an abbreviation for the (waste reduction) case study whose data model we’ve sampled in this article. It’s kind-of our root-level namespace.

  2. This SPARQL query uses two graph entry points StcmfRedistributedFood and OpsStcmfToRefData in order to walk to all the required graph nodes. Whereas, in GraphQL, we introduced an artificial direct connection, (refDataConnectors) which allowed the query to seamlessly walk to all the required graph nodes from a single graph entry point.

  3. The carbonSavings calculation is performed in SPARQL query. Whereas, with GraphQL, we performed the calculation outside of the query. Although, we could add an explicit carbonSavings field into data model with a GraphQL directive which specifies how to perform the calculation.

This SPARQL query can support the same tabular report as that supported by the GraphQL query, so we won’t bother (re)displaying that tabular report here.

Conclusions

  • JUXT Site offers an appealing low-code, GraphQL based approach for defining transactional, linked data systems. It’s a pre-alpha. Its sweet spot will probably be to back websites where humans drive query and transaction volumes.

  • With its ability to support RDF data models and SPARQL, it is a promising platform for Open Data. Currently, it supports only a subset of SPARQL but (again) it is only a pre-alpha.

  • So, “might GraphQL be an easy means to query Linked Open Data?“.

    Well, GraphQL was designed to describe the services that apps use. But, its query syntax is easier to understand that SPARQL’s (compare the above GraphQL and SPARQL queries) – so there is something to be said for providing a GraphQL interface as a means to explore an open dataset. With the proviso that GraphQL is more abstract/less exact than SPARQL, and it doesn’t directly support federated queries.

    They are, of course, different beasts. But a platform which is capable of supporting both over the same data might be a great way of servicing the audience for both.

  • Also – and we’ve not addressed these in this article but – the XTDB database (used by JUXT Site) has a number of other features that are important for transacting Linked Open Data: immutable records, temporal queries, and upcoming data-level authorisation scheme.

  • We see JUXT Site as a candidate platform on which to prototype our ‘PASI‘ system which will allow organisations to: upload their social impact data (including waste reduction data); validate it; assure security and track provenance; compose and accumulate it; and publish it as open linked data.

“How is waste in my area?” – a regional dashboard

Introduction

Our aim in this piece of work is:

to surface facts of interest (maximums, minimums, trends, etc.) about waste in an area, to non-experts.

Towards that aim, we have built a prototype regional dashboard which is directly powered by our ‘easier datasets’ about waste.

The prototype is a webapp and it can be accessed here.

our prototype regional dashboard

Curiosities

Even this early prototype manages to surface some curiosities [1] …​

Inverclyde

Inverclyde is doing well.

Inverclyde’s household waste positions Inverclyde’s household waste generation Inverclyde’s household waste CO2e

In the latest data (2019), it generates the fewest tonnes of household waste (per citizen) of any of the council areas. And its same 1st position for CO2e indicates the close relation between the amount of waste generated and its carbon impact.

…​But why is Inverclyde doing so well?

Highland

Highland isn’t doing so well.

Highland’s household waste positions Highland’s household waste generation Highland’s household waste % recycled

In the latest data (2019), it generates the most (except for Argyll & Bute) tonnes of household waste (per citizen) of any of the council areas. And it has the worst trend for percentage recycled.

…​Why is Highland’s percentage recycled been getting worse since 2014?

Fife

Fife has the best trend for household waste generation. That said, it still has been generating an above the average amount of waste per citizen.

Fife’s household waste positions Fife’s household waste generation

The graphs for Fife business waste show that there was an acute reduction in combustion wastes in 2016.

Fife’s business waste

We investigated this anomaly before and discovered that it was caused by the closure of Fife’s coal fired power station (Longannet) on 24th March 2016.

Angus

In the latest two years of data (2018 & 2019), Angus has noticibly reduced the amount of household waste that it landfills.

Angus' household waste management

During the same period, Angus has increased the amount household waste that it processes as ‘other diversion’.

…​What underlies that difference in Angus’ waste processing?

Technologies

This prototype is built as a ‘static’ website with all content-dynamics occurring in the browser. This makes it simple and cheap to host, but results in heavier, more complex web pages.

  • The clickable map is implemented on Leaflet – with Open Street Map map tiles.
  • The charts are constructed using Vega-lite.
  • The content-dynamics are coded in ClojureScript – with Hiccup for HTML, and Reagent for events.
  • The website is hosted on GitHub.

Ideas for evolving this prototype

  1. Provide more qualitative information. This version is quite quantitative because, well, that is nature of the datasets that currently underlay it. So there’s a danger of straying into the “managment by KPI” approach when we should be supporting the “management by understanding” approach.
  2. Include more localised information, e.g. about an area’s re-use shops, or bin collection statistics.
  3. Support deeper dives, e.g. so that users can click on a CO2e trend to navigate to a choropleth map for CO2e.
  4. Allow users to download any of the displayed charts as (CSV) data or as (PNG) images.
  5. Enhance the support of comparisons by allowing users to multi-select regions and overlay their charts.
  6. Allow users to choose from a menu, what chart/data tiles to place on the page.
  7. Provide a what-if? tool. “What if every region reduced by 10% their landfilling of waste material xyz?” – where the tool has a good enough waste model to enable it to compute what-if? outcomes.

1. One of the original sources of data has been off-line due to a cyberattack so, at the time of writing, it has not been possible to double-check all figures from our prototype against original sources.

‘Easier’ open data about waste in Scotland

Objective

Several organisations are doing a very good job of curating & publishing open data about waste in Scotland but, the published data is not always “easy to use” for non-experts. We have see several references to this at open data conference events and on social media platforms:

Whilst statisticians/coders may think that it is reasonably simple to knead together these somewhat diverse datasets into a coherent knowledge, the interested layman doesn’t find it so easy.

One of the objectives of the Data Commons Scotland project is to address the “ease of use” issue over open data. The contents of this repository are the result of us re-working some of the existing source open data so that it is easier to use, understand, consume, parse, and all in one place. It may not be as detailed or have all the nuances as the source data – but aims to be better for the purposes of making the information accessible to non-experts.

We have processed the source data just enough to:

  • provide value-based cross-referencing between datasets
  • add a few fields whose values are generally useful but not easily derivable by a simple calculation (such as latitude & longitude)
  • make it available as simple CSV and JSON files in a Git repository.

We have not augmented the data with derived values that can be simply calculated, such as per-population amounts, averages, trends, totals, etc.

The 10 easier datasets

dataset (generated February 2021) source data (sourced January 2021)
name description file number of records creator supplier licence
household-waste The categorised quantities of the (‘managed’) waste generated by households. CSV JSON 19008 SEPA statistics.gov.scot URL OGL v3.0
household-co2e The carbon impact of the waste generated by households. CSV JSON 288 SEPA SEPA URL OGL v2.0
business-waste-by-region The categorised quantities of the waste generated by industry & commerce. CSV JSON 8976 SEPA SEPA URL OGL v2.0
business-waste-by-sector The categorised quantities of the waste generated by industry & commerce. CSV JSON 2640 SEPA SEPA URL OGL v2.0
waste-site The locations, services & capacities of waste sites. CSV JSON 1254 SEPA SEPA URL OGL v2.0
waste-site-io The categorised quantities of waste going in and out of waste sites. CSV 2667914 SEPA SEPA URL OGL v2.0
material-coding A mapping between the EWC codes and SEPA’s materials classification (as used in these datasets). CSV JSON 557 SEPA SEPA URL OGL v2.0
ewc-coding EWC (European Waste Classification) codes and descriptions. CSV JSON 973 European Commission of the EU Publications Office of the EU URL CC BY 4.0
households Occupied residential dwelling counts. Useful for calculating per-household amounts. CSV JSON 288 NRS statistics.gov.scot URL OGL v3.0
population People counts. Useful for calculating per-citizen amounts. CSV JSON 288 NRS statistics.gov.scot URL OGL v3.0

(The fuller, CSV version of the table above.)

The dimensions of the easier datasets

One of the things that makes these datasets easier to use, is that they use consistent dimensions values/controlled code-lists. This makes it easier to join/link datasets.

So we have tried to rectify the inconsistencies that occur in the source data (in particular, the inconsistent labelling of waste materials and regions). However, this is still “work-in-progress” and we yet to tease out & make consistent further useful dimensions.

dimension description dataset example value of dimension count of values of dimension min value of dimension max value of dimension
region The name of a council area. household-waste Falkirk 32
household-co2e Aberdeen City 32
business-waste-by-region Falkirk 34
waste-site North Lanarkshire 32
households West Dunbartonshire 32
population West Dunbartonshire 32
business-sector The label representing the business/economic sector. business-waste-by-sector Manufacture of food and beverage products 10
year The integer representation of a year. household-waste 2011 9 2011 2019
household-co2e 2013 9 2011 2019
business-waste-by-region 2011 8 2011 2018
business-waste-by-sector 2011 8 2011 2018
waste-site 2019 1 2019 2019
waste-site-io 2013 14 2007 2020
households 2011 9 2011 2019
population 2013 9 2011 2019
quarter The integer representation of the year’s quarter. waste-site-io 4 4
site-name The name of the waste site. waste-site Bellshill H/care Waste Treatment & Transfer 1246
permit The waste site operator’s official permit or licence. waste-site PPC/A/1180708 1254
waste-site-io PPC/A/1000060 1401
status The label indicating the open/closed status of the waste site in the record’s timeframe. waste-site Not applicable 4
latitude The signed decimal representing a latitude. waste-site 55.824871489601804 1227
longitude The signed decimal representing a longitude. waste-site -4.035165962797409 1227
io-direction The label indicating the direction of travel of the waste from the PoV of a waste site. waste-site-io in 2
material The name of a waste material in SEPA’s classification. household-waste Animal and mixed food waste 22
business-waste-by-region Spent solvents 33
business-waste-by-sector Spent solvents 33
material-coding Acid, alkaline or saline wastes 34
management The label indicating how the waste was managed/processed (i.e. what its end-state was). household-waste Other Diversion 3
ewc-code The code from the European Waste Classification hierarchy. waste-site-io 00 00 00 787
material-coding 11 01 06* 557
ewc-coding 01 973
ewc-description The description from the European Waste Classification hierarchy. ewc-coding WASTES RESULTING FROM EXPLORATION, MINING, QUARRYING, AND PHYSICAL AND CHEMICAL TREATMENT OF MINERALS 774
operator The name of the waste site operator. waste-site TRADEBE UK 753
activities The waste processing activities supported by the waste site. waste-site Other treatment 50
accepts The kinds of clients/wastes accepted by the waste site. waste-site Other special 42
population The population count as an integer. population 89800 21420 633120
households The households count as an integer. households 42962 9424 307161
tonnes The waste related quantity as a decimal. household-waste 0 0 183691
household-co2e 251386.54 24768.53 762399.92
business-waste-by-region 753 0 486432
business-waste-by-sector 54 0 1039179
waste-site-io 0 -8.56 2325652.83
tonnes-input The quantity of incoming waste as a decimal. waste-site 154.55 0 1476044
tonnes-treated-recovered The quantity of waste treated or recovered as a decimal. waste-site 133.04 0 1476044
tonnes-output The quantity of outgoing waste as a decimal. waste-site 152.8 0 235354.51

(The CSV version of the table above.)

Trialling Wikibase for our data layer

Introduction

The architectural proposal for our WCS platform, contains a data layer for collecting, linking, caching and making accessible the source datasets…​

bilayered architecture
Note
Our assumption is that, for our near-term aims, linked data provides the most useful foundation.

 

The idea that we’re trialling here, is to use Wikibase as the core component in our data layer…​

implementation using wikibase

The case for using Wikibase

Wikibase is a proven off-the-shelf solution that makes it easier to work with linked data. It provides:

  • A linked data store.
  • An interface for humans to view and manually edit linked data.
  • An API that can be used by computer programs to (bulk) edit linked data.
  • SPARQL support.

So why not just use Wikidata? (Wikidata is Wikibase’s common, public instance.) …​Ideally we would but:

  • Our domain specifics aren’t supported in Wikidata.
    • E.g. Wikidata doesn’t (yet) have a full vocabulary to describe waste management.
    • We want to experiment and move fast. Wikidata is sensibly cautious about change, therefore too slow for us.
  • We can still make use of Wikidata for some aspects of our work by referencing its data, using its vocabulary, and using it to store very general data.

Novelty…​  The use of a customised Wikibase instance is not novel but our intended specific customisation and application does have some novelty:

  • Wikibase provides easier to use, human-friendly access to linked data than typical triple stores.

    Will this facilitate more engagement and use, compared with sites with less human-oriented surface area? …Perhaps a worthwhile study.

    Also, the greater human-oriented surface area in this solution, should be direct help when it comes to implementing user-based features such as a recommender system and community forums.

  • By their nature, wiki solutions support crowd sourcing.

    Our platform could support a limited form of this by encouraging councils, recycling shops, etc. to contribute their data about waste; data which currently isn’t open or linked.

  • Our platform will be built using open & inexpensive (often free) components and services.

    It should be straightforward to apply the approach to other domains of open data for Scotland.

Hosting on WBStack

WBStack is an alpha (software-as-a-service) platform created by by Adam Shorland. It allows invitees to create their own, publicly accessible Wikibase instances.

Adam invited us to create our own Wikibase instance on his platform.
Our Wikibase is at https://waste-commons-scotland.wiki.opencura.com

screenshot wcs wikibase

Populating our Wikibase with data about waste

The datasets

In this trial, we want to populate our Wikibase with 4 datasets:

  1. area – reference data describing administrative areas
  2. population – reference data describing populations
  3. household waste – describing the tonnes of solid waste generated by households
  4. co2e – describing the tonnes of carbon equivalent from household waste

The data model

Representing a dataset record in Wikibase

Let’s consider a couple of records from the population dataset:

Aberdeen City 2018 227,560
Aberdeen City 2017 228,800

In Wikbase, we could represent each of those records as a statement on the “Aberdeen City “item. This is the approach that we took in our previous work about The usefulness of putting datasets into Wikidata?. This screenshot shows the resulting Wikidata statements …​

screenshot population statements wikidata

The problem with this approach is that it can result in an unwieldy amount of statements per single item.

The alternative approach we’ve taken for our Wikibase, is to represent each of those records as an item in its own right. So that first record is represented as the Wikibase item…​

screenshot popAc2018 wikibase

Some predicates and dimensions are common, they are used across most of the datasets.

Some predicates and dimensions are dataset specific. For example: the predicate has UK government code is used only to describe the area dataset; while the dimension end-state is used only to describe the household waste dataset.

Loading the data

I’ve hacked together a software script – dcs-wdt – which writes the datasets into our Wikibase. It is very rough’n’ready (however, it might be the seed of something more generic for automatically re-loading our datasets of interest). Its outline is:

/* order datasets & dataset-aspects, most independent first */
for each dataset in [base, area, population, household-waste, co2e]
  for each dataset-aspect in [class-item, predicates, supporting-dimensions, measurements]
    for each record in the dataset-aspect
      if the record is not already represented in the Wikibase
        write-to-wikibase a property or item to represent the record

Assessment

So, should we use a Wikibase as the core component in our data layer?

Pros

  • The bundled SPARQL query service and UI work well.

    There is an oddity w.r.t. implicit prefixes but this can be worked around by explicitly declaring the prefixes.

  • It has straight out-of-the-box search functionality which automatically indexes content, and provides a search feature (with ‘completion-suggestion’).
    screenshot search wikibase

    It is primarily configured for searching items by their labels but it does fall-through to providing a more full-text type search capability.

  • It has a baked-in API (in addition to the programmatically accessible SPARQL query service) which provides a very full and well documented HTTP-based API for reading & writing data.

    (The dcs-wdt script makes use of both its SPARQL query service and API.)

  • Its human-oriented web pages (UI) are sort of nice – making it easy to explore the data, and to perform data management tasks.
  • It comes with a raft of features for supporting community-contributed content, including: user accounts and permissions, discussion forums, and easy-ish to use bulk data uploads via QuickStatements. I haven’t explored these in any depth, but they are potentially useful if the project decides that supporting user content on the WCS platform, is in-scope.

Cons

  • It doesn’t come with all the bells’n’whistles I thought it would…​

    I think that I’ve been naive in thinking that many of the easy-to-use MediaWiki rendering features (especially over SPARQL queries) that I’ve read about (particularly those of LinkedWiki), would just-be-there. Unfortunately those are all extras…​the LinkedWiki extension and its transitive dependencies need to be installed; the relevant templates imported; OpenStreetMap etc. access keys must be configured.

    Those bells’n’whistles are not supported by WBStack and the installation of them would take some expertise.

  • WbStack’s service has been running for one year now but, as a free alpha, it provides no guarantees.

    For example, a recent update of some of its software stack caused a short outage and an ongoing problem with label rendering on our Wikibase instance.

Conclusions

For the project, the main reason for using Wikibase is two-fold:

  1. Out-of-the-box support for a simple linked data model that can be SPARQL-ed.
  2. The use of the wiki’s data-table, graphing & mapping widgets for the rapid prototyping of and inclusion in WCS web pages.

As it stands, the WBStack Wikibase is useful for (a) but not (b).

I’m thinking that we should keep it on the back burner for now – while we find out what the front-end needs. Its support of (a) might turn out to be a good enough reason to use it, although there are alternatives – including use of a standalone triple store; or, if we have just a few datasets, building our own linking software and file-based store. Not having (b) means extra work for us to build/configure widgets for graphing, mapping, etc.

The usefulness of putting datasets into Wikidata?

A week ago, I attended Ian Watt‘s workshop on Wikidata at the Scottish Open Data Unconference 2020. It was an interesting session and it got me thinking about how we might upload some our datasets of interest (e.g. amounts of waste generated & recycled per Scottish council area, ‘carbon impact’ figures) into Wikidata. Would having such datasets in Wikidata, be useful?

There is interest in “per council area” and “per citizen  waste data so I thought that I’d start by uploading into Wikidata, a dataset that describes the populations per Scottish council area per year (source: the Population Estimates data cube at statistics.gov.scot).

This executable notebook steps through the nitty-gritty of doing that. SPARQL is used to pull data from both Wikidata and statistics.gov.scot; the data is compared and the QuickStatements tool is used to help automate the creation and modification of Wikidata records. 2232 edits were executed against Wikidata through QuickStatements (taking about 30 mins). Unfortunately QuickStatements does not yet support a means to set the rank of a statement so I had to individually edit the 32 council area pages to mark, in each, its 2019 population value as the Preferred rank population value …​indicating that it is the most up-to-date population value.

But, is having this dataset in Wikidata useful?

The uploaded dataset can be pulled (de-referenced) into Wikipedia articles quite easily. As an example, I edited the Wikipedia article Council areas of Scotland to insert into its main table, the new column “Number of people (latest estimate)” whose values are pulled (each time the page is rendered) directly from the data that I uploaded into Wikidata:

Visualisations based on the upload dataset can be embedded into web pages quite easily. Here’s an example that fetches our dataset from Wikidata and renders it as a line graph, when this web page is loaded into your web browser:

 

Concerns, next steps, alternative approaches.

Interestingly, there is some discussion about the pros & cons of inserting Wikidata values into Wikipedia articles. The main argument against is the immaturity of Wikidata’s structure: therefore a concern about the durability of the references into its data structure. The counter point is that early use & evolution might be the best path to maturity.

The case study for our Data Commons Scotland project, is open data about waste in Scotland. So a next step for the project might be to upload into Wikidata, datasets that describe the amounts of household waste generated & recycled, and ‘carbon impact’ figures. These could also be linked to council areas – as we have done for the population dataset – to support per council area/per citizen statistics and visualisations. Appropriate properties do not yet exist in Wikidata for the description of such data about waste, so new ones would need to be ratified by the Wikidata community.

Should such datasets actually be uploaded into Wikidata?…​These are small datasets and they seem to fit well enough into Wikidata’s knowledge graph. Uploading them into Wikidata may make them easier to access, de-silo the data and help enrich Wikidata’s knowledge graph. But then, of course, there is the keeping it up-to-date issue to solve. Alternatively, those datasets could be pulled dynamically and directly from statistics.gov.scot into Wikipedia articles with the help of some new MediaWiki extensions.

 

 

The geography of household waste generation

Working on his human geography homework, Rory asks…

Which areas in Scotland are reducing their household waste?

This week, in a step towards supporting the above scenario, I investigated how we might generate choropleths to help us visualise the variations in the amounts of household-generated waste across geographic areas in Scotland.

The cube-to-chart executable notebook steps through the nitty-gritty of this experiment. The steps include:

    1. Running a SPARQL query against statistics.gov.scot’s very useful data cubes to find the waste tonnage generated per council citizen per year.
    2. For each council area, derive the 3 values:
      • recent – 2018’s tonnage of waste generated per council citizen.
      • average – 2011-2018’s average (mean) tonnage of waste generated per council citizen.
      • trend – 2011-2018’s trend in tonnage of waste generated per council citizen. Each trend value is calculated as the gradient of a linear approximation to the tonnage over the years. (A statistician might well suggest a more appropriate method for computing this trend value.)

      The derived data can be seen in this file.

    3. Use Vega to generate 3 choropleths which help visualise the statistical values from the above step, against the council-oriented geography of Scotland. (The geography data comes from Martin Chorely’s good curation work.)

The resulting choropleths can be seen on >> this page <<

Rory looks at the “2011-2018 trend in tonnage” choropleth, and thinks…

It’s good to see that most areas are reducing waste generation but why not all…?

Looking at the “2018 tonnage” and 2011-2018 average tonnage” choropleths, Niamh wonders…

I wonder why urban populations seem to generate less waste than rural ones?

Stirling Council’s waste-management dataset as linked open data 

Kudos to Stirling Council for being the only Scottish local authority to have published household waste collection data as open data. This data is contained in their waste-management dataset. It consists of: 

  • Core dataper year CSV files 
  • Metadata that includes a basic schema for the CSV files, maintenance information and a descriptive narrative. 

For that, Stirling Council have attained 3 stars on this openness measure.  

To reach 5 stars, that data would have to be turned into linked open data, i.e. gain the following: 

  • URIs denoting things. E.g. have a URI for each waste type, each collection route and each measurement. 
  • Links to other data to provide context. E.g. reference commonly accepted identifiers/URIs for dates, waste types and route geographies. 

This week I investigated aspects of what would be involved in gaining those extra two stars

This executable notebook steps through the nitty-gritty of doing that. The steps include: 

  1. Mapping the data into the vocabulary for the statistical data cube structure – as defined by the W3C and used by the Scottish government’s statistic office. 
  2. Mapping the date values to the date-time related vocabulary – as defined by the UK government. 
  3. Defining placeholder vocabularies for waste type and collection routes. Future work would be to: map waste types to (possibly “rolled-up” values) in a SEPA defined vocabulary; and map collection routes to a suitable geographic vocabulary. 
  4. Converting the CSV source data into RDF data in accordance to the above mappings. This results in a set of .ttl – RDF Turtle syntax – files.  
  5. Loading the .ttl files into a triplestore database so that their linked data graph can be queried easily. 
  6. Running a few SPARQL queries against the triplestore to sanity-check the linked data graph. 
  7. Creating an example infographic (showing the downward trend in missing bins) from the linked data graph:  

 Conclusions 

  • It took a not insignificant amount of consideration to convert the 3-star non-linked data to (almost) 5-star linked data. But I expect that the effort involved will tail off if we similarly converted further datasets, because of the experience and knowledge gained along the way. 
  • Having a linked data version of the waste-management dataset promises to make its information more explicit and more compostable. But for the benefits to be fully realised, more cross-linking needs to be carried out. In particular, we need to map waste types to a common (say, SEPA controlled) vocabulary; and map collection routes to a common geographic vocabulary. 
  • We might imagine that if such a linked dataset were to be published & maintained – with other local authorities contributing data into it – then SEPA would be able to directly and constantly harvest its information so, making period report preparation unnecessary.  
  • JimT and I have discussed how the Open Data Phase2 project might push for the publication of linked open data about waste, using common vocabularies, and how our Data Commons Project could aim to fuel its user interface using that linked open data. In order words, the linked open data layer is where the two project meet.