Using the Config file¶

Config files are managed by the config.py module

First import the module, and create a Config instance. This takes an argument to a yaml config file you have made.

In [1]:

from chemdataextractor.config import Config
import os
example_path = os.path.join('tests', 'data', 'test_config.yml')
c = Config(example_path)

The example file looks like this:

Example Config for testing¶

PARSERS:

Paragraph: [CompoundParser, ChemicalLabelParser, NmrParser]
Table: [CompoundTableParser, UvvisAbsQuantumYieldTableParser, UvvisEmiQuantumYieldTableParser]
Title: [NmrParser]
Heading: [NmrParser]
Footnote: [ChemicalLabelParser]
Caption: [NmrParser]

POS_TAGGER: CrfPosTagger

NER_TAGGER: CiDictCemTagger

LEXICON: Lexicon

SENTENCE_TOKENIZER: SentenceTokenizer

WORD_TOKENIZER: WordTokenizer

Parsers are presented as a list, where each element specifies the particular parsers to be used for each CDE type. For example, this file will only use CompoundParser, ChemicalLabelParser and NmrParser to interptre paragraph object. Choosing only those parsers relevant to your work can help speed up the CDE extraction process.

The file also allows specification of the various tokenizer and tagger objects used in CDE. For example, the WORD_TOKENIZER has now been set to the standard WordTokenizer object, which does not support tokenization of chemical specific text.

You can import these setting when you create a document:

In [2]:

from chemdataextractor import Document
from chemdataextractor.doc.text import Paragraph
doc = Document(Paragraph('Testing'), config=c)

You can check your parsers have been loaded correctly by looking at the appropriate objects. For example:

In [3]:

print(doc.paragraphs[0].parsers)
print(doc.paragraphs[0].word_tokenizer)

[<chemdataextractor.parse.cem.CompoundParser object at 0x7fb9773f70f0>, <chemdataextractor.parse.cem.ChemicalLabelParser object at 0x7fb9773f7128>, <chemdataextractor.parse.nmr.NmrParser object at 0x7fb9773f71d0>]
<chemdataextractor.nlp.tokenize.WordTokenizer object at 0x7fb9773ebac8>

In [ ]: