wikidata-chemistry-curation

Comparing against databases

Wikidata uses the chemical graph as important central criterion and with that the InChI and InChIKey: a different InChI basically means a different entry. However, because of phenomena like tautomerism, different Wikidata items can actually have the same InChI. But this makes the InChIKey also a powerful tool to compare consistency with other databases. The results of such a comparision is useful input for manual curation efforts, e.g. when there is an inconsistency, it is not defined where the cause of the inconsistency is.

This chapter covers example of these kind of checks. Most checks were performed using InChIKeys calculated from SMILES between records from the external database and Wikidata. If the chemical graph is the same (and thus the InChIKey), then external identifiers in Wikidata should also match.

Chemical Entities of Biological Interest (ChEBI)

ChEBI [1] identifiers comparison was performed as described above, based on InChIs, not InChIKeys. The ChEBI ID property also uses mapping relation type property as qualifier. Most of the time, the matches are exact match [2]. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_chebi_ids.R

Common Chemistry

One systematic comparison that was performed was with the CAS Common Chemistry database [3]. CAS Common Chemistry and Wikidata were compared, because Wikidata has sitelinks to Wikipedia, the paper also looked at CAS registry numbers in Wikipedia. A similar effort was done for checking CAS registry numbers in the HMDB [4]. The three scripts used to do these analyses are provided in the supplementary information.

DrugBank

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_drugbank_ids.R

DSSTox

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_dsstox_ids.R

HMDB

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_hmdb_ids.R

KNApSAcK

No open dump available

nmrshiftdb2

No open dump available

NP Atlas

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_npatlas_ids.R

PubChem

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_pubchem_ids.R

SureChEMBL

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_surechembl_ids.R

SwissLipids

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_swisslipids_ids.R

Unique Ingredient Identifier (UNII)

The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_unii_ids.R

References

  1. ChEBI [Internet]. ELIXIR EMBL-EBI Node. Cambridge, United Kingdom: EMBL’s European Bioinformatics Institute; Available from: http://www.ebi.ac.uk/chebi
  2. exact match.
  3. Jacobs A, Williams D, Hickey K, Patrick N, Williams AJ, Chalk S, et al. CAS Common Chemistry in 2021: Expanding Access to Trusted Chemical Information for the Scientific Community. JCIM. 2022 May 13; doi:10.1021/ACS.JCIM.2C00268 (Scholia)
  4. Wishart D, Guo A, Oler E, Wang F, Anjum A, Peters H, et al. HMDB 5.0: the Human Metabolome Database for 2022. NAR. 2022 Jan 1;50(D1):D622–31. doi:10.1093/NAR/GKAB1062 (Scholia)