Chemistry Development Kit-based
Because SPARQL makes it very easy to extract data from Wikidata, it makes it easy to find
inconsistencies. For example, we can download all SMILES strings and parse the SMILES with
a library like the Chemistry Development Kit [1], for example,
using Bacting [2].
Unparsable SMILES
Example code is avialable from
checkSMILES.groovy.
This script runs a SPARQL query to get all SMILES, and then tries to parse the string.
This will filter out many unparsable SMILES.
RDkit-based
…
References
- Willighagen E, Mayfield JW, Alvarsson J, Berg A, Carlsson L, Jeliazkova N, et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J Cheminform. 2017 Jun 6;9(1). doi:10.1186/S13321-017-0220-4 (Scholia)
- Willighagen E. Bacting: a next generation, command line version of Bioclipse. JOSS. 2021 Jun 23;6(62):2558. doi:10.21105/JOSS.02558 (Scholia)