Using the Config file¶
Config files are managed by the config.py
module
First import the module, and create a Config
instance. This takes an
argument to a yaml config file you have made.
In [1]:
from chemdataextractor.config import Config
import os
example_path = os.path.join('tests', 'data', 'test_config.yml')
c = Config(example_path)
The example file looks like this:
Example Config for testing¶
PARSERS:
Paragraph: [CompoundParser, ChemicalLabelParser, NmrParser]
Table: [CompoundTableParser, UvvisAbsQuantumYieldTableParser, UvvisEmiQuantumYieldTableParser]
Title: [NmrParser]
Heading: [NmrParser]
Footnote: [ChemicalLabelParser]
Caption: [NmrParser]
POS_TAGGER: CrfPosTagger
NER_TAGGER: CiDictCemTagger
LEXICON: Lexicon
SENTENCE_TOKENIZER: SentenceTokenizer
WORD_TOKENIZER: WordTokenizer
Parsers are presented as a list, where each element specifies the
particular parsers to be used for each CDE type. For example, this file
will only use CompoundParser
, ChemicalLabelParser
and
NmrParser
to interptre paragraph object. Choosing only those parsers
relevant to your work can help speed up the CDE extraction process.
The file also allows specification of the various tokenizer and tagger
objects used in CDE. For example, the WORD_TOKENIZER
has now been
set to the standard WordTokenizer
object, which does not support
tokenization of chemical specific text.
You can import these setting when you create a document:
In [2]:
from chemdataextractor import Document
from chemdataextractor.doc.text import Paragraph
doc = Document(Paragraph('Testing'), config=c)
You can check your parsers have been loaded correctly by looking at the appropriate objects. For example:
In [3]:
print(doc.paragraphs[0].parsers)
print(doc.paragraphs[0].word_tokenizer)
[<chemdataextractor.parse.cem.CompoundParser object at 0x7fb9773f70f0>, <chemdataextractor.parse.cem.ChemicalLabelParser object at 0x7fb9773f7128>, <chemdataextractor.parse.nmr.NmrParser object at 0x7fb9773f71d0>]
<chemdataextractor.nlp.tokenize.WordTokenizer object at 0x7fb9773ebac8>
In [ ]: