Natural Language Processing

ChemDataExtractor also includes state of the art Natural Language Processing (NLP) facilities, as described here.

Tokenization

Sentence Tokenization

Use the sentences property on a text-based document element to perform sentence segmentation:

>>> from chemdataextractor.doc import Paragraph
>>> para = Paragraph('1,4-Dibromoanthracene was prepared from 1,4-diaminoanthraquinone. 1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
>>> para.sentences
[Sentence('1,4-Dibromoanthracene was prepared from 1,4-diaminoanthraquinone.', 0, 65),
 Sentence('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.', 66, 135)]

Each sentence object is a document element in itself, and additionally contains the start and end character offsets within it’s parent element.

Word Tokenization

Use the tokens property to get the word tokens:

>>> para.tokens
[[Token('1,4-Dibromoanthracene', 0, 21),
  Token('was', 22, 25),
  Token('prepared', 26, 34),
  Token('from', 35, 39),
  Token('1,4-diaminoanthraquinone', 40, 64),
  Token('.', 64, 65)],
 [Token('1H', 66, 68),
  Token('NMR', 69, 72),
  Token('spectra', 73, 80),
  Token('were', 81, 85),
  Token('recorded', 86, 94),
  Token('on', 95, 97),
  Token('a', 98, 99),
  Token('300', 100, 103),
  Token('MHz', 104, 107),
  Token('BRUKER', 108, 114),
  Token('DPX300', 115, 121),
  Token('spectrometer', 122, 134),
  Token('.', 134, 135)]]

This also works on an individual sentence:

>>> para.sentences[0].tokens
[Token('1,4-Dibromoanthracene', 0, 21),
 Token('was', 22, 25),
 Token('prepared', 26, 34),
 Token('from', 35, 39),
 Token('1,4-diaminoanthraquinone', 40, 64),
 Token('.', 64, 65)]

There are also raw_sentences and raw_tokens properties that return strings instead of Sentence and Token objects.

Using Tokenizers Directly

All tokenizers have a tokenize method that takes a text string and returns a list of tokens:

>>> from chemdataextractor.nlp.tokenize import ChemWordTokenizer
>>> cwt = ChemWordTokenizer()
>>> cwt.tokenize('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
['1H', 'NMR', 'spectra', 'were', 'recorded', 'on', 'a', '300', 'MHz', 'BRUKER', 'DPX300', 'spectrometer', '.']

There is also a span_tokenize method that returns the start and end offsets of the tokens in terms of the characters in the original string:

>>> cwt.span_tokenize('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
[(0, 2), (3, 6), (7, 14), (15, 19), (20, 28), (29, 31), (32, 33), (34, 37), (38, 41), (42, 48), (49, 55), (56, 68), (68, 69)]

Part-of-speech Tagging

ChemDataExtractor contains a chemistry-aware Part-of-speech tagger. Use the pos_tagged_tokens property on a document element to get the tagged tokens:

>>> s = Sentence('1H NMR spectra were recorded on a 300 MHz BRUKER DPX300 spectrometer.')
>>> s.pos_tagged_tokens
[('1H', 'NN'),
 ('NMR', 'NN'),
 ('spectra', 'NNS'),
 ('were', 'VBD'),
 ('recorded', 'VBN'),
 ('on', 'IN'),
 ('a', 'DT'),
 ('300', 'CD'),
 ('MHz', 'NNP'),
 ('BRUKER', 'NNP'),
 ('DPX300', 'NNP'),
 ('spectrometer', 'NN'),
 ('.', '.')]

Using Taggers Directly

All taggers have a tag method that takes a list of RichToken instances and returns a list of (token, tag) tuples. For more information on how to use these taggers directly, see the documentation for BaseTagger.

Lexicon

As ChemDataExtractor processes documents, it adds each unique word that it encounters to the Lexicon as a Lexeme. Each Lexeme stores various word features, so they don’t have to be re-calculated for every occurrence of that word.

You can access the Lexeme for a token using the lex property:

>>> s = Sentence('Sulphur and Oxygen.')
>>> s.tokens[0]
Token('Sulphur', 0, 7)
>>> s.tokens[0].lex.normalized
'sulfur'
>>> s.tokens[0].lex.is_hyphenated
False
>>> s.tokens[0].lex.cluster
'11011101100110'

Abbreviation Detection

Abbreviation detection is done using a method based on the algorithm in Schwartz & Hearst 2003:

>>> p = Paragraph(u'Dye-sensitized solar cells (DSSCs) with ZnTPP = Zinc tetraphenylporphyrin.')
>>> p.abbreviation_definitions
[([u'ZnTPP'], [u'Zinc', u'tetraphenylporphyrin'], u'CM'),
 ([u'DSSCs'], [u'Dye', u'-', u'sensitized', u'solar', u'cells'], None)]

Abbreviation definitions are returned as tuples containing the abbreviation, the long name, and an entity tag. The entity tag is CM if the abbreviation is for a chemical entity, otherwise it is None.