wikidata-chemistry-curation

Wikidata property constraints

One way to keep control over the content of Wikidata is provided by Wikidata itself. These are the property constraints. For example, the property for the isomeric SMILES requires that each item in Wikidata has a different value (distinct-values constraint) and can have only one (best) value (single-best-value constraint). Many properties also use a regular expression to describe allowed values (format constraint).

Using these constraints, automated tests are routinely run resulting in reports. Not all contraint violation, however, is a true error, and the mechanism does allow for exceptions to be recorded. For example, for the MassBank accession ID, format violations are reported, unique value violations, and more.

These reports provide both curators and users to get an idea of the status of the chemistry in Wikidata.

For up-to-date information about properties used for chemicals: https://www.wikidata.org/wiki/Wikidata:WikiProject_Chemistry/Properties#Up-to-date_list_of_properties_about_chemical

Single value

Many external identifiers are expected to only be found on a single Wikidata items. This is particularly the case when the external database uses the InChIKey has uniqueness criterion, like Wikidata.

All Wikidata properties have discussion pages that list constraint violations, and most of these have matching SPARQL queries. For example, the ChEMBL ID has this following SPARQL query to find ChEMBL identifiers used on more than one Wikidata item:

SPARQL sparql/P592UniqueValue.rq (run, edit)

SELECT
    ?value (SAMPLE(?ct) AS ?ct)
    (GROUP_CONCAT(DISTINCT(STRAFTER(STR(?item), "/entity/")); separator=", ") AS ?items)
    (GROUP_CONCAT(DISTINCT(?itemLabel); separator=", ") AS ?labels)
WHERE
{
  	{ 	SELECT ?value (COUNT(DISTINCT ?item) as ?ct)
  		WHERE
  		{
  			?item wdt:P592 ?value
		}
    	GROUP BY ?value HAVING (?ct>1)
    	ORDER BY DESC(?ct)
    	LIMIT 100
	}
  	?item wdt:P592 ?value .
	SERVICE wikibase:label {
    	bd:serviceParam wikibase:language "en, mul" .
    	?item rdfs:label ?itemLabel .
  	}
}
GROUP BY ?value
ORDER BY DESC(?ct)

While this query shows a few false positives caused by tautomerism, it provides a useful list to regularly check:

value ct items labels
CHEMBL521177 2 Q6469057, Q105287434 lactucin, 4-epi-lactucin
CHEMBL1206440 2 Q2130929, Q27124801 cyclamic acid, cyclamate
CHEMBL2303614 2 Q408014, Q74511001 chondroitin sulfate, (2S,3S,4S,5R,6S)-6-[[(2R,3R,4S,5R,6S)-3-Acetamido-2,6-bis(hydroxymethyl)-5-(sulfomethyl)oxan-4-yl]methoxymethyl]-4,5-dihydroxy-3-(hydroxymethyl)oxane-2-carboxylic acid
CHEMBL1433 2 Q422442, Q82982262 doxycycline, doxycycline tautomer
CHEMBL91 2 Q410534, Q75163056 miconazole, rac-miconazole
CHEMBL1201341 2 Q4352952, Q27077150 echothiophate
CHEMBL1201668 2 Q6997373, Q66360952 nesiritide, brain natriuretic peptide
CHEMBL2110884 2 Q27284343, Q76005793 cetocycline, cetocycline tautomer
sparql/P592UniqueValue.rq

Uniqe value

Smilarly, we can use SPARQL to find Wikidata items with more than one ChEMBL identifier:

SPARQL sparql/P592SingleValue.rq (run, edit)

SELECT DISTINCT ?itemLabel ?itemLabelURL ?count ?sample1 ?sample2 ?exception
WITH {
	SELECT ?formatter WHERE {
		OPTIONAL { wd:P592 wdt:P1630 ?formatter }
	} LIMIT 1
} AS %formatter
WHERE
{
	{
		SELECT ?item (COUNT(?value) AS ?count) (MIN(?value) AS ?sample1) (MAX(?value) AS ?sample2) {
			?item p:P592 [ ps:P592 ?val; wikibase:rank ?rank ] .
			FILTER( ?rank != wikibase:DeprecatedRank ) .
			INCLUDE %formatter .
			BIND( IF( BOUND( ?formatter ), URI( REPLACE( ?formatter, '\\$1', ?val ) ), ?val ) AS ?value ) .
		} GROUP BY ?item HAVING ( ?count > 1 ) LIMIT 100
	} .
	OPTIONAL {
		wd:P592 p:P2302 [ ps:P2302 wd:Q19474404; pq:P2303 ?exc ] .
		FILTER( ?exc = ?item ) .
	} .
	BIND( BOUND( ?exc ) AS ?exception ) .
    BIND (?item AS ?itemLabelURL)
	SERVICE wikibase:label { bd:serviceParam wikibase:language "en,mul" } .
}
ORDER BY DESC(?count)

This gives this list of Wikidata items with more than one ChEMBL ID:

itemLabelURL count sample1 sample2 exception
http://www.wikidata.org/entity/Q425293 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3039598/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3306578/ false
http://www.wikidata.org/entity/Q4008670 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2103975/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL264186/ false
http://www.wikidata.org/entity/Q417219 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL2079587/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3182301/ false
http://www.wikidata.org/entity/Q417227 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL1286/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL150361/ false
http://www.wikidata.org/entity/Q420532 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3039593/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL553025/ false
http://www.wikidata.org/entity/Q7784695 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL10247/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL298827/ false
http://www.wikidata.org/entity/Q422301 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL1201488/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL3039582/ false
http://www.wikidata.org/entity/Q75830 2 https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL18041/ https://www.ebi.ac.uk/chembl/compound_report_card/CHEMBL28992/ false
sparql/P592SingleValue.rq

References