wikidata-chemistry-curation

Wikidata-based curation approaches

Wikidata items without SMILES

Wikipedia has a separate chemistry community, and while some Wikidata chemistry content is visible on Wikipedia, it also happens regularly that Wikipedia has a SMILES for a chemical compound, where Wikidata does not. DBpedia helps here [1].

The following SPARQL query finds ten thousand (the default limit in DBpedia) Wikipedia pages with a ChemBox and checks for those if Wikidata has a SMILES:

SPARQL sparql/missingSMILES.rq (run, edit)

PREFIX dbpedia2: <http://dbpedia.org/property/>
SELECT ?s ?article ?item ?itemLabel WITH {
  SELECT DISTINCT ?s ?article WHERE {
    SERVICE <https://dbpedia.org/sparql> {
      ?s dbpedia2:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:Chembox>.
      ?article_db foaf:primaryTopic ?s.
    }
    BIND (IRI(REPLACE(STR(?article_db), "http://", "https://", "i")) AS ?article)
  }
} AS %DBPEDIA WITH {
  SELECT DISTINCT ?s ?article ?item WHERE {
    INCLUDE %DBPEDIA
    ?article schema:about ?item .
    MINUS { ?item wdt:P233 [] }
    MINUS { ?item wdt:P2017 [] }
    MINUS { ?item wdt:P10718 [] }
  }
} AS %CHEMICALS WHERE {
  INCLUDE %CHEMICALS
  VALUES ?chemicals { wd:Q113145171 wd:Q59199015 }
  ?item wdt:P31 ?chemicals.
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

The results look like this:

s article item
http://dbpedia.org/resource/Vanadium(V)_chloride_chlorimide https://en.wikipedia.org/wiki/Vanadium(V)_chloride_chlorimide vanadium(V) chloride chlorimide (edit)
http://dbpedia.org/resource/Potassium_tetracarbonyliron_hydride https://en.wikipedia.org/wiki/Potassium_tetracarbonyliron_hydride potassium tetracarbonyliron hydride (edit)
http://dbpedia.org/resource/(Triphenylphosphine)iron_tetracarbonyl https://en.wikipedia.org/wiki/(Triphenylphosphine)iron_tetracarbonyl (triphenylphosphine)iron tetracarbonyl (edit)
sparql/missingSMILES.rq

Polymers without CXSMILES

Many polymers can have a CXSMILES property and the following query lists those that do not have this property:

SPARQL sparql/polymersWithoutCXSMILES.rq (run, edit)

SELECT ?cmp ?cmpLabel WHERE {
  ?cmp wdt:P31 wd:Q81163 .
  MINUS { ?cmp wdt:P10718 [] }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
}

This returns values like this:

cmp
polyisoprene (edit)
Styrene-acrylonitrile resin (edit)
chondroitin sulfate (edit)
sparql/polymersWithoutCXSMILES.rq

Functional groups without CXSMILES

We can do the same thing for functional groups:

SPARQL sparql/functionalGroupsWithoutCXSMILES.rq (run, edit)

SELECT ?fg ?fgLabel ?cxsmiles WHERE {
  ?fg wdt:P31/wdt:P279* wd:Q170409 .
  MINUS { ?fg wdt:P10718 ?cxsmiles }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
}

Here too, the list provides a list of curation opportunities:

fg cxsmiles
acetamido group (edit)
peroxyacetyl group (edit)
benzhydryl (edit)
sparql/functionalGroupsWithoutCXSMILES.rq

References

  1. Auer S, Bizer C, Kobilarov G, Lehmann J, Cyganiak R, Ives Z. DBpedia: A Nucleus for a Web of Open Data. In: The Semantic Web. 2007. p. 722–35. doi:10.1007/978-3-540-76298-0_52 (Scholia)