Wikidata uses the chemical graph as important central criterion and with that the InChI and InChIKey: a different InChI basically means a different entry. However, because of phenomena like tautomerism, different Wikidata items can actually have the same InChI. But this makes the InChIKey also a powerful tool to compare consistency with other databases. The results of such a comparision is useful input for manual curation efforts, e.g. when there is an inconsistency, it is not defined where the cause of the inconsistency is.
This chapter covers example of these kind of checks. Most checks were performed using InChIKeys calculated from SMILES between records from the external database and Wikidata. If the chemical graph is the same (and thus the InChIKey), then external identifiers in Wikidata should also match.
ChEBI [1] identifiers comparison was performed as described above, based on InChIs, not InChIKeys.
The ChEBI ID property also uses mapping relation type property as qualifier.
Most of the time, the matches are exact match [2].
The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_chebi_ids.R
One systematic comparison that was performed was with the CAS Common Chemistry database [3]. CAS Common Chemistry and Wikidata were compared, because Wikidata has sitelinks to Wikipedia, the paper also looked at CAS registry numbers in Wikipedia. A similar effort was done for checking CAS registry numbers in the HMDB [4]. The three scripts used to do these analyses are provided in the supplementary information.
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_drugbank_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_dsstox_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_hmdb_ids.R
No open dump available
No open dump available
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_npatlas_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_pubchem_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_surechembl_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_swisslipids_ids.R
The same methodology as described above was used. The script used is available at https://github.com/Adafede/wd-curation-r/tree/main/inst/scripts/wd_add_unii_ids.R