Introduction¶
The ChemDataExtractor Toolkit¶
The ChemDataExtractor(CDE) toolkit is an advanced natural language processing pipeline for extracting chemical property information from the scientific literature. The theory behind the toolkit can be found in the following papers:
Original paper outlining the CDE workflow:
Swain, M. C., & Cole, J. M. “ChemDataExtractor: A Toolkit for Automated Extraction of Chemical Information from the Scientific Literature”, J. Chem. Inf. Model. 2016, 56 (10), pp 1894–1904 10.1021/acs.jcim.6b00207.
Paper describing the Snowball algorithm:
Court, C. J., & Cole, J. M. (2018). Auto-generated materials database of Curie and Néel temperatures via semi-supervised relationship extraction. Scientific data, 5, 180111. 10.1038/sdata.2018.111
Paper describing the enhancements made in CDE 2.0 and TableDataExtractor:
Juraj Mavračić, Callum J. Court, Taketomo Isazawa, Stephen R. Elliott, Jacqueline M. Cole: ChemDataExtractor 2.0: Auto-Populated Ontologies for Materials Science”, J. Chem. Inf. Model. 2021, 61(9), pp 4280–4289 10.1021/acs.jcim.1c00446
Paper describing the new Named Entity Recognition system:
Taketomo Isazawa, Jacqueline M. Cole: “Single Model for Organic and Inorganic Chemical Named Entity Recognition in ChemDataExtractor”, J. Chem. Inf. Model. 2022, 62 (5), pp 1207-1213 10.1021/acs.jcim.1c01199
and the associated website https://chemdataextractor2.org.
The general process for extracting information from scientific text is as follows:
Break a document down into its consituent elements (Title, Paragraphs, Sentences, Tables, Figures…)
Tokenize the text to isolate individual tokens
Apply Part-of-speech tagging to identify the semantic role of each token
Detect Chemical Named Entites using machine learning
Parse text and tables with nested rules to identify chemical relationships
Resolve the interdependencies between the different elements
Output a set of mutually consistent chemical records
This pipeline enables ChemDataExtractor to extract chemical information in a completely domain-independent manner.
Installation¶
To get up and running with ChemDataExtractor, you will need to install the python toolkit and then download the necessary data files. There are a few different ways to download and install the ChemDataExtractor toolkit.
Note
A Python version earlier than 3.8 must be used with ChemDataExtractor.
Option 1. Using pip
If you already have Python installed, it’s easiest to install the ChemDataExtractor package using pip. At the command line, run:
$ pip install chemdataextractor2
On Windows, this will require the Microsoft Visual C++ Build Tools to be installed. If you don’t already have pip installed, you can install it using get-pip.py
.
Option 2. Download the Latest Release
Alternatively, download the latest release manually from github or the ChemDataExtractor website and install it yourself by running:
$ cd chemdataextractor
$ python setup.py install
The setup.py
command will install ChemDataExtractor in your site-packages folder so it is automatically available to all your python scripts.
You can also get the latest release by cloning the git source code repository from https://github.com/CambridgeMolecularEngineering/chemdataextractor2/.
Getting the Data Files
In order to function, ChemDataExtractor requires a variety of data files, such as machine learning models, dictionaries, and word clusters. While previous versions of ChemDataExtractor required you to run a command to download these files, version 2.1 and above will automatically download only those files which are necessary when required by ChemDataExtractor, so you do not need to run any commands to download data. To disable this option, you should set:
chemdataextractor.data.AUTO_DOWNLOAD = False
To check where the files are and how many files have been downloaded, run:
$ cde data where
Updating
Upgrade your installation to the latest version at any time using conda or pip, matching the method you used originally to install it. For conda, run:
$ conda update -c chemdataextractor chemdataextractor
For pip, run:
$ pip install --upgrade ChemDataExtractor
“Where do I find…?”¶
The most common questions about ChemDataExtractor usually involve trying to find functionality or asking where best to put new functionality. Below is a list of the general roles each of the packages perform:
biblio
: Misc tools for parsing bibliographic information such as bibtex files, author names etc.
cli
: Command line interfact tools
doc
: Logic for reading/creating documents. That is, splitting documents down into its various elements
model
: Tools for model representation
nlp
: Tools for performing the NLP stages, such as POS tagging, Word clustering, CNER, Abbreviation detection
parse
: Chemical property parsers
reader
: Document readers
relex
: Anything related to Snowball
scrape
: Scrapers for the various data sources
text
: Useful tools for processing text