ChemSpider, a New Kind of Chemical Search Engine

Harry E. Pence
SUNY Distinguished Teaching Professor Emeritus

What kind of questions does a chemist expect a search engine to answer? Probably, she wants to ask about physical and chemical properties, molecular structure, spectral data, synthetic methods, reactions, safety information, or alternate names for a compound of interest. It would, of course, be helpful if the search engine could reply to queries based on molecular structure, since that is a basic language of chemistry. Even the best general purpose search engine, like Google or Bing, can only respond to part of these queries, and getting even these answers may require visiting several different web sites. There is, however, a search engine that is designed to answer exactly these kinds of questions, ChemSpider

ChemSpider is a free, online chemical data base, which offers the type of information described in the previous paragraph for over twenty-two million compounds. In some ways ChemSpider resembles Wikipedia, since it depends upon users, so-called crowd sourcing, to provide a lot of the information for the index. ChemSpider was originally created by Antony Williams and a few associates, but recently it has been purchased by the Royal Society of Chemistry (RSC). This purchase makes good sense. ChemSpider gains major financial and infrastructure support and will include material from the wealth of supplemental material that is collected by RSC Journals. The RSC can use the ChemSpider database to continue to expand the features in Project Prospect, a relatively new initiative in journal publication which has already won the 2007 ALPSP/Charlesworth Award for Publishing Innovation.

At the recent ACS meeting in Washington, DC, when listening to a presentation about ChemSpider by Antony Williams, I was reminded of the old adage that, "You've got to have connections!" The difference in the 21st Century is that the important connection is not necessarily an uncle who knows the company President, but rather a set of hyperlinks that connect you to the required information. Williams explained that ChemSpider is linked to a number of other sources for chemical data, including chemical databases, like Wikipedia, PubChem, ChEBI (Chemical Entities of Biological Interest) and KEGG (The Kyoto Encyclopedia of Genes and Genomes), chemical vendors, a patent database, and open access chemistry journals. Patent searches are provided by SureChem, which uses not only text search but also chemical structures and substructures. Basic SureChem searches are free, and an advanced version, SureChem Pro, is available for a modest fee. At the conference Antony announced the new integration to PubMed whereby chemical records on ChemSpider are directly linked to PubMed articles referencing that compound.

Curation, insuring the accuracy of the data in a digital database, is an essential problem for any reference source that is based on crowd sourcing. ChemSpider allows anyone to enter information and annotate the records, but to prevent anonymous vandalism one must log in to make changes. Currently the standard chemical resource is the Chemical Abstracts Service, which has been in the business of aggregating chemistry-related data for 102 years in order to create the CAS registry. CAS just recorded the 50 millionth chemical structure (C&E News, Sept. 14, 2009, pg. 3). ChemSpider has aggregated over 20 million unique chemical entities in less than 3 years and, according to Antony's comments at the ACS meeting, they have over 10 million new compounds from various sources presently being deduplicated onto the database. Many of compounds in the current database have already been curated, and the process is ongoing. CAS includes ChemSpider content with hundreds of thousands of unique compounds indexed already, suggesting that the curation already done by the ChemSpider team has been impressive.

What free online sources might be compared to ChemSpider? Searching the WWW with Google, or even one of the new types of search engines, like Wolfram Alpha or Google Squared, is clearly less reliable than ChemSpider, since there is no overall curation of the web pages that serve as a basis for all these services. Perhaps the closest comparison is the Chemistry listings in Wikipedia, but Wikipedia has information on far fewer compounds and it is based only on text searching.

Chemical structures are a particular curation problem, since humans often assign incorrect structures in articles and even patents. Structural chemical searches on ChemSpider are based on the use of the International Chemical Identifier (InChI which is pronounced Inchee) developed in collaboration by IUPAC and NIST. InChIs are preferred to the older SMILES notation because, unlike SMILES, every molecular structure has a unique InChI representation. InChIs have some advantages over CAS registry numbers because InChIs are open source and are calculated by a software program rather than being assigned by some group. Although humans may be able to read an InChI after some practice, there is also an InChIKey (or hashed InChI) which has a fixed length and is only readable by machine. The majority of structure drawing packages, both commercial and open source, offer the facility to convert chemical structures to InChIs and vice versa. One popular resource for interconverting structures and InChIs is the popular ACD/ChemSketch program, which is available as a free download.

The latest addition to the ChemSpider Universe is ChemMantis, where Mantis stands for Markup And Nomenclature Transformation Integrated System. ChemMantis was used as a platform to host the ChemSpider Journal of Chemistry to provide access to "chemistry-related materials including synthesis, analytical chemistry, solid state chemistry, chemical biology, cheminformatics, and molecular modeling." ChemMantis is a semantic markup platform for marking up chemistry related documents. This includes both manual and automated markup of the documents according to a series of dictionaries and algorithms. In particular, ChemMantis identifies and converts chemical names to structures in chemistry articles and links to the ChemSpider database to create a document markup system that will connect users to the rich content in ChemSpider. Thus, any compound name that appears in an article can be automatically linked to ChemSpider, where a reader can obtain further information. Since ChemMantis is a publishing platform the resulting articles can be very semantically rich and the approach parallels that of Project Prospect introduced previously by the RSC but with the added advantage of the ChemSpider integration. Project Prospect and ChemMantis will probably become more integrated in the near future.

ChemSpider has a number of advantages over a simple Google search. The variety of information about a compound provided at ChemSpider is hard to match on any other free web site, and the user can be sure that the data is provided by practicing chemists and in many cases has been reviewed for accuracy. Frequently, ChemSpider provides links to other online sources for further information. As the association with the RSC develops, the site should be increasingly linked to supplemental data in the RSC archives, which contain a wealth of chemical information that has not previously been readily accessible. In summary, ChemSpider appears to be an excellent resource that chemists should consider adding to their short list of favorite web resources.

Acknowledgement: The author wish to thank Antony Williams for his help in clarifying a number of points about ChemSpider.

Return to Fall 2009 CCCE Newsletter