.nlp

Tools for performing the NLP stages, such as POS tagging, Word clustering, CNER, Abbreviation detection

Chemistry-aware natural language processing framework.

.nlp.abbrev

Abbreviation detection.

class chemdataextractor.nlp.abbrev.AbbreviationDetector(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Bases: object

Detect abbreviation definitions in a list of tokens.

Similar to the algorithm in Schwartz & Hearst 2003.

__init__(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Initialize self. See help(type(self)) for accurate signature.

abbr_min = 3

Minimum abbreviation length

abbr_max = 10

Maximum abbreviation length

abbr_equivs = []

String equivalents to use when detecting abbreviations.

detect(tokens)[source]

Return a (abbr, long) pair for each abbreviation definition.

detect_spans(tokens)[source]

Return (abbr_span, long_span) pair for each abbreviation definition.

abbr_span and long_span are (int, int) spans defining token ranges.

class chemdataextractor.nlp.abbrev.ChemAbbreviationDetector(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Bases: chemdataextractor.nlp.abbrev.AbbreviationDetector

Chemistry-aware abbreviation detector.

This abbreviation detector has an additional list of string equivalents (e.g. Silver = Ag) that improve abbreviation detection on chemistry texts.

abbr_min = 3

Minimum abbreviation length

abbr_max = 14
abbr_equivs = [('silver', 'Ag'), ('gold', 'Au'), ('mercury', 'Hg'), ('lead', 'Pb'), ('tin', 'Sn'), ('tungsten', 'W'), ('iron', 'Fe'), ('sodium', 'Na'), ('potassium', 'K'), ('copper', 'Cu'), ('sulfate', 'SO4'), ('methanol', 'MeOH'), ('ethanol', 'EtOH'), ('hydroxy', 'OH'), ('hexadecyltrimethylammonium bromide', 'CTAB'), ('cytarabine', 'Ara-C'), ('hydroxylated', 'OH'), ('hydrogen peroxide', 'H2O2'), ('quartz', 'SiO2'), ('amino', 'NH2'), ('amino', 'NH2'), ('ammonia', 'NH3'), ('ammonium', 'NH4'), ('methyl', 'CH3'), ('nitro', 'NO2'), ('potassium carbonate', 'K2CO3'), ('carbonate', 'CO3'), ('borohydride', 'BH4'), ('triethylamine', 'NEt3'), ('triethylamine', 'Et3N')]

String equivalents to use when detecting abbreviations.

.nlp.cem

Named entity recognition (NER) for Chemical entity mentions (CEM).

This was the default NER system up to version 2.0, while the new NER is included in new_cem.

chemdataextractor.nlp.cem.IGNORE_SUFFIX = ['-', "'s", '-activated', '-adequate', '-affected'...

Token endings to ignore when considering stopwords and deriving spans

chemdataextractor.nlp.cem.IGNORE_PREFIX = ['fluorophore-', 'low-', 'high-', 'single-', 'odd-...

Token beginnings to ignore when considering stopwords and deriving spans

chemdataextractor.nlp.cem.STRIP_END = ['groups', 'group', 'colloidal', 'dyes', 'dye', 'p...

Final tokens to remove from entity matches

chemdataextractor.nlp.cem.STRIP_START = ['anhydrous', 'elemental', 'amorphous', 'conjugate...

First tokens to remove from entity matches

chemdataextractor.nlp.cem.STOP_TOKENS = {'.cdx', '.sk2', '10.1021', '10.1039', '10.1186', ...

Disallowed tokens in chemical entity mentions (discard if any single token has exact case-insensitive match)

chemdataextractor.nlp.cem.STOP_SUB = {' brand of ', ' oil', ' with ', '!', '%', ', ', '...

Disallowed substrings in chemical entity mentions (only used when filtering to construct the dictionary?)

chemdataextractor.nlp.cem.STOPLIST = {'(gaba)ergic', '1,3-dpma', '1,5-dpma', '12mg', '3...

Disallowed chemical entity mentions (discard if exact case-insensitive match)

chemdataextractor.nlp.cem.STOP_RES = ['^(http|ftp)://', '\\.(com|uk|eu|org|net)$', '^\\...

the entity text is passed as lowercase.

Type:

Regular expressions that define disallowed chemical entity mentions. Note

chemdataextractor.nlp.cem.SPLITS = ['^(actinium|aluminium|aluminum|americium|antimony...

Regular expressions defining collections of words that should be split if joined by hyphens or -to-

class chemdataextractor.nlp.cem.CiDictCemTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.DictionaryTagger

Case-insensitive CEM dictionary tagger.

tag_type = 'ner_tag'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
model = 'models/cem_dict-1.0.pickle'
class chemdataextractor.nlp.cem.CsDictCemTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.DictionaryTagger

Case-sensitive CEM dictionary tagger.

tag_type = 'ner_tag'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
model = 'models/cem_dict_cs-1.0.pickle'
case_sensitive = True
class chemdataextractor.nlp.cem.CrfCemTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.CrfTagger

tag_type = 'ner_tag'
model = 'models/cem_crf_chemdner_cemp-1.0.pickle'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
clusters = True
params = {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 200}
legacy_tag(tokens)[source]
Parameters:

tokens) tokens (list(obj) – Tokens to tag

Returns (list(obj), obj):

legacy_tag corresponds to the tag method in ChemDataExtractor 2.0 and earlier. This has been renamed legacy_tag due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.

tag(tokens)[source]
class chemdataextractor.nlp.cem.LegacyCemTagger(*args, **kwargs)[source]

Bases: chemdataextractor.nlp.tag.EnsembleTagger

Return the combined output of a number of chemical entity taggers.

label_type = 'ner_tag'

The individual chemical entity taggers to use.

taggers = [<chemdataextractor.nlp.cem.CrfCemTagger object>, <chemdataextractor.nlp.cem.CiDictCemTagger object>, <chemdataextractor.nlp.cem.CsDictCemTagger object>]
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
legacy_tag(tokens)[source]
Parameters:

tokens) tokens (list(obj) – Tokens to tag

Returns (list(obj), obj):

legacy_tag corresponds to the tag method in ChemDataExtractor 2.0 and earlier. This has been renamed legacy_tag due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.

tag(tokens)[source]

Run individual chemical entity mention taggers and return union of matches, with some postprocessing.

.nlp.new_cem

New and improved named entity recognition (NER) for Chemical entity mentions (CEM).

class chemdataextractor.nlp.new_cem.BertFinetunedCRFCemTagger(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]

Bases: chemdataextractor.nlp.allennlpwrapper.AllenNlpWrapperTagger

A Chemical Entity Mention tagger using a finetuned BERT model with a CRF to constrain the outputs.

tag_type = 'ner_tag'
indexers = {'bert': <allennlp.data.token_indexers.wordpiece_indexer.PretrainedBertIndexer object>}
model = 'models/bert_finetuned_crf_model-1.0a'
overrides = {'model.text_field_embedder.token_embedders.bert.pretrained_model': '/home/docs/.local/share/ChemDataExtractor/models/scibert_cased_weights-1.0.tar.gz'}
process(tag)[source]

Process the given tag. This can be used for example if the names of tags in training are different from what ChemDataExtractor expects.

Parameters:

str (tag) – The raw string output from the predictor.

Returns:

A processed version of the tag

Return type:

str

class chemdataextractor.nlp.new_cem.CemTagger(*args, **kwargs)[source]

Bases: chemdataextractor.nlp.tag.EnsembleTagger

A state of the art Named Entity Recognition tagger for both organic and inorganic materials that uses a tagger based on BERT with a Conditional Random Field to constrain the outputs. More details in the paper (https://pubs.acs.org/doi/full/10.1021/acs.jcim.1c01199).

taggers = [<chemdataextractor.nlp.allennlpwrapper._AllenNlpTokenTagger object>, <chemdataextractor.nlp.allennlpwrapper.ProcessedTextTagger object>, <chemdataextractor.nlp.new_cem.BertFinetunedCRFCemTagger object>]

.nlp.allennlpwrapper

Tagger wrappers that wrap AllenNLP functionality. Used for and named entity recognition.

class chemdataextractor.nlp.allennlpwrapper.ProcessedTextTagger[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Class to process text before the text is fed into any other taggers. This class is designed to be used with AllenNlpWrapperTagger and replaces any single-number tokens with <nUm> in accordance with the training data.

tag_type = 'processed_text'
number_pattern = re.compile('([\\+\\-–−]?\\d+(([\\.・,\\d])+)?)')
number_string = '<nUm>'
tag(tokens)[source]
class chemdataextractor.nlp.allennlpwrapper.AllenNlpWrapperTagger(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

A wrapper for an AllenNLP model. Tested with a CRF Tagger but should work with any sequence labeller trained in allennlp.

model = None
__init__(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]
Parameters:
  • (dict(str, TokenIndexer), optional) (indexers) – A dictionary of all the AllenNLP indexers to be used with the taggers. Please refer to their documentation for more detail.

  • (str, optional) (archive_location) – Location for weights. Corresponds to weights_file parameter for the load_archive function from AllenNLP.

  • (int, optional) (max_allowed_length) – The ID for the GPU to be used. If None is passed in, ChemDataExtractor will automatically detect if a GPU is available and use that. To explicitly use the CPU, pass in a value of -1.

  • (str, optional) – The location where the model is archived. Corresponds to the archive_file parameter in the load_archive function from AllenNLP. Alternatively, you can set this parameter to None and set the class property model, which will then search for the model inside of ChemDataExtractor’s default model directory.

  • (obj, optional) (tag_type) – Override the class’s tag type. Refer to the documentation for BaseTagger for more information on how to use tag types.

  • (int, optional) – The minimum batch size to use when predicting. Default 100.

  • (int, optional) – The maximum batch size to use when predicting. Default 200.

  • (int, optional) – The maximum allowed length of a sentence when predicting. Default 220. Any sentences longer than this will be split into multiple smaller sentences via a sliding window approach and the results will be collected. Needs to be a multiple of 4 for correct predictions.

tag_type = None
indexers = None
overrides = None
process(tag)[source]

Process the given tag. This can be used for example if the names of tags in training are different from what ChemDataExtractor expects.

Parameters:

str (tag) – The raw string output from the predictor.

Returns:

A processed version of the tag

Return type:

str

predictor

The AllenNLP predictor for this tagger.

tag(tokens)[source]
batch_tag(sents)[source]
Parameters:

sents (chemdataextractor.doc.text.RichToken) –

Returns:

list(list(~chemdataextractor.doc.text.RichToken, obj))

Take a list of lists of all the tokens from all the elements in a document, and return a list of lists of (token, tag) pairs. One thing to note is that the resulting list of lists of (token, tag) pairs need not be in the same order as the incoming list of lists of tokens, as sorting is done so that we can bucket sentences by their lengths. More information can be found in the BaseTagger documentation, and in this guide.

.nlp.corpus

Tools for reading and writing text corpora.

class chemdataextractor.nlp.corpus.LazyCorpusLoader(name, reader_cls, *args, **kwargs)[source]

Bases: object

Derived from NLTK LazyCorpusLoader.

__init__(name, reader_cls, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

chemdataextractor.nlp.corpus.wsj = <BracketParseCorpusReader in '.../corpora/wsj_trai...

Penn Treebank Revised, LDC2015T13)

Type:

Entire WSJ corpus (English News Text Treebank

chemdataextractor.nlp.corpus.wsj_training = <BracketParseCorpusReader in '.../corpora/wsj_trai...

Penn Treebank Revised, LDC2015T13)

Type:

WSJ corpus sections 0-18 (English News Text Treebank

chemdataextractor.nlp.corpus.wsj_development = <BracketParseCorpusReader in '.../corpora/wsj_deve...

Penn Treebank Revised, LDC2015T13)

Type:

WSJ corpus sections 19-21 (English News Text Treebank

chemdataextractor.nlp.corpus.wsj_evaluation = <BracketParseCorpusReader in '.../corpora/wsj_eval...

Penn Treebank Revised, LDC2015T13)

Type:

WSJ corpus sections 22-24 (English News Text Treebank

chemdataextractor.nlp.corpus.treebank2_training = <ChunkedCorpusReader in '.../corpora/treebank2_tra...

WSJ corpus sections 0-18 (treebank2)

chemdataextractor.nlp.corpus.treebank2_development = <ChunkedCorpusReader in '.../corpora/treebank2_dev...

WSJ corpus sections 19-21 (treebank2)

chemdataextractor.nlp.corpus.treebank2_evaluation = <ChunkedCorpusReader in '.../corpora/treebank2_eva...

WSJ corpus sections 22-24 (treebank2)

chemdataextractor.nlp.corpus.genia_training = <TaggedCorpusReader in '.../corpora/genia_training...

First 80% of GENIA POS-tagged corpus

chemdataextractor.nlp.corpus.genia_evaluation = <TaggedCorpusReader in '.../corpora/genia_evaluati...

Last 20% of GENIA POS-tagged corpus

chemdataextractor.nlp.corpus.medpost = <TaggedCorpusReader in '.../corpora/medpost' (not ...
chemdataextractor.nlp.corpus.medpost_training = <TaggedCorpusReader in '.../corpora/medpost_traini...
chemdataextractor.nlp.corpus.medpost_evaluation = <TaggedCorpusReader in '.../corpora/medpost_evalua...
chemdataextractor.nlp.corpus.cde_tokensc = <PlaintextCorpusReader in '.../corpora/cde_tokensc...
chemdataextractor.nlp.corpus.chemdner_training = <PlaintextCorpusReader in '.../corpora/chemdner_tr...

.nlp.lexicon

Cache features of previously seen words.

class chemdataextractor.nlp.lexicon.Lexeme(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]

Bases: object

__init__(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]

Initialize self. See help(type(self)) for accurate signature.

text

Original Lexeme text.

cluster

The Brown Word Cluster for this Lexeme.

normalized

Normalized text, using the Lexicon Normalizer.

lower

Lowercase text.

first

First character.

suffix

Three-character suffix

shape

Word shape. Derived by replacing every number with ‘d’, every greek letter with ‘g’, and every latin letter with ‘X’ or ‘x’ for uppercase and lowercase respectively.

length

Lexeme length.

upper_count

Count of uppercase characters.

lower_count

Count of lowercase characters.

digit_count

Count of digits.

is_alpha

Whether the text is entirely alphabetical characters.

is_ascii

Whether the text is entirely ASCII characters.

is_digit

Whether the text is entirely digits.

is_lower

Whether the text is entirely lowercase.

is_upper

Whether the text is entirely uppercase.

is_title

Whether the text is title cased.

is_punct

Whether the text is entirely punctuation characters.

is_hyphenated

Whether the text is hyphenated.

like_url

Whether the text looks like a URL.

like_number

Whether the text looks like a number.

class chemdataextractor.nlp.lexicon.Lexicon[source]

Bases: object

normalizer = <chemdataextractor.text.normalize.Normalizer object>

The Normalizer for this Lexicon.

clusters_path = None

Path to the Brown clusters model file for this Lexicon.

__init__()[source]
add(text)[source]

Add text to the lexicon.

Parameters:

text (string) – The text to add.

cluster(text)[source]
normalized(text)[source]
lower(text)[source]
first(text)[source]
suffix(text)[source]
shape(text)[source]
length(text)[source]
digit_count(text)[source]
upper_count(text)[source]
lower_count(text)[source]
is_alpha(text)[source]
is_ascii(text)[source]
is_digit(text)[source]
is_lower(text)[source]
is_upper(text)[source]
is_title(text)[source]
is_punct(text)[source]
is_hyphenated(text)[source]
like_url(text)[source]
like_number(text)[source]
class chemdataextractor.nlp.lexicon.ChemLexicon[source]

Bases: chemdataextractor.nlp.lexicon.Lexicon

A Lexicon that is pre-configured with a Chemistry-aware Normalizer and Brown word clusters derived from a chemistry corpus.

normalizer = <chemdataextractor.text.normalize.ChemNormalizer object>
clusters_path = 'models/clusters_chem1500-1.0.pickle'

.nlp.pos

Part-of-speech tagging.

chemdataextractor.nlp.pos.TAGS = ['NN', 'IN', 'NNP', 'DT', 'NNS', 'JJ', ',', '.', '...

Complete set of POS tags. Ordered by decreasing frequency in WSJ corpus.

class chemdataextractor.nlp.pos.ApPosTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.tag.ApTagger

Greedy Averaged Perceptron POS tagger trained on WSJ corpus.

model = 'models/pos_ap_wsj_nocluster-1.0.pickle'
tag_type = 'pos_tag'
clusters = False
class chemdataextractor.nlp.pos.ChemApPosTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.pos.ApPosTagger

Greedy Averaged Perceptron POS tagger trained on both WSJ and GENIA corpora.

Uses features based on word clusters from chemistry text.

model = 'models/pos_ap_wsj_genia-1.0.pickle'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
tag_type = 'pos_tag'
clusters = True
class chemdataextractor.nlp.pos.CrfPosTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.CrfTagger

model = 'models/pos_crf_wsj_nocluster-1.0.pickle'
tag_type = 'pos_tag'
clusters = False
class chemdataextractor.nlp.pos.ChemCrfPosTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.pos.CrfPosTagger

model = 'models/pos_crf_wsj_genia-1.0.pickle'
tag_type = 'pos_tag'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
clusters = True

.nlp.tag

Tagger implementations. Used for part-of-speech tagging and named entity recognition.

class chemdataextractor.nlp.tag.BaseTagger[source]

Bases: object

Abstract tagger class from which all taggers inherit.

Subclasses must implement at least one of the following sets of methods for tagging:

  • legacy_tag()

  • tag()

  • batch_tag()

  • can_tag() and tag_for_type()

  • can_tag() and can_batch_tag() and batch_tag_for_type()

The above interface is called when required by classes including Sentence or Document, depending on whether only the tag for a sentence is required or for the whole document.

If the user has implemented more than one of the combinations above, the order of presedence for the tagging methods is as follows:

  • batch_tag_for_type()

  • tag_for_type()

  • batch_tag()

  • tag()

  • legacy_tag()

Most users should not have to implement the top two options, and the default impelementations are discussed in the documentation for EnsembleTagger instead of here.

An implementation of the other tagging methods should have the following signatures and should be implemented in the following cases:

  • tag(self, list( RichToken ) tokens) -> list( RichToken , obj) Take a list of all the tokens from an element, and return a list of (token, tag) pairs. This should be the default implementation for any new tagger. More information on how to create a new tagger can be found at in this guide.

  • batch_tag(self, list(list( RichToken )) sents) -> list(list( RichToken , obj)) Take a list of lists of all the tokens from all the elements in a document, and return a list of lists of (token, tag) pairs. One thing to note is that the resulting list of lists of (token, tag) pairs need not be in the same order as the incoming list of lists of tokens, so some sorting can be done if, for example, bucketing of sentences by their lengths is desired. In addition to tag, the batch_tag method should be implemented instead of the tag method in cases where the taggers rely on backends that are more performant when tagging multiple sentences, and the tagger will be called for every element. More information can be found in in this guide.

    Note

    If a tagger only has batch_tag implemented, the tagger will fail when applied to an element that does not belong to a document.

  • legacy_tag(self, list(obj tokens) -> (list(obj), obj) legacy_tag corresponds to the tag method in ChemDataExtractor 2.0 and earlier. This has been renamed legacy_tag due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.

To express intent to the ChemDataExtractor framework that the tagger can tag for a certain tag type, you should implement the can_tag method, which takes a tag type and returns a boolean. The default implementation, provided by this class, looks at the tag_type attribute of the tagger and returns True if it matches the tag type provided.

Warning

While the RichToken class maintains backwards compatibility in most cases, e.g. parsers by assigning the 1 key in dictionary-style lookup with the combined PoS and NER tag, calling this key in an NER or PoS tagger will cause your script to crash. To avoid this, please change any previous bits of code such as token[1] to token["ner_tag"] or token.ner_tag.

tag_type = ''

The tag type for this tagger. When this tag type is asked for from the token, as described in RichToken, this tagger will be called.

tag_sents(sentences)[source]

Apply the tag method to each sentence in sentences.

Deprecated since version 2.1: Deprecated in conjunction with the deprecation of the legacy_tag function. Please write equivalent functionality to use RichTokens.

evaluate(gold)[source]

Evaluate the accuracy of this tagger using a gold standard corpus.

Parameters:

str))) gold (list(list(tuple(str,) – The list of tagged sentences to score the tagger on.

Returns:

Tagger accuracy value.

Return type:

float

can_tag(tag_type)[source]

Whether this tagger can tag the given tag type.

Parameters:

tag_type (obj) – The tag type which the system wants to tag. Usually a string.

Returns:

True if this parser can tag the given tag type

Return type:

bool

can_batch_tag(tag_type)[source]

Whether this tagger can batch tag the given tag type.

Parameters:

tag_type (obj) – The tag type which the system wants to batch tag. Usually a string.

Returns:

True if this parser can tag the given tag type

Return type:

bool

class chemdataextractor.nlp.tag.EnsembleTagger(*args, **kwargs)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

A class for taggers which act on the results of multiple other taggers. This could also be done by simply adding each tagger to the sentence and having the taggers each act on the results from the other taggers by accessing RichToken attributes, but an EnsembleTagger allows for the user to add one tagger instead, cleaning up the interface.

The EnsembleTagger is also useful in collating the results from multiple taggers of the same type, as can be seen in the case of CemTagger which collects multiple types of NER labellers (a CRF and multiple dictionary taggers), to create a single coherent NER label.

tag_type = ''
taggers = []
__init__(*args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

tag_for_type(tokens, tag_type)[source]

This method will be called if the EnsembleTagger has previously claimed that it can tag the given tag type via the can_tag() method. The appropriate tagger within EnsembleTagger is called and the results returned.

Note

This method can handle having legacy taggers mixed in with newer taggers.

Parameters:
Returns:

A list of tuples of the given tokens and the corresponding tags.

Return type:

list(tuple(RichToken, obj))

batch_tag_for_type(sents, tag_type)[source]

This method will be called if the EnsembleTagger has previously claimed that it can batch tag the given tag type via the can_batch_tag() method. The appropriate tagger within EnsembleTagger is called and the results returned.

Parameters:
  • tokens (list(RichToken)) – The tokens which should be tagged

  • tag_type (obj) – The tag type for which EnsembleTagger should tag the tokens.

Returns:

A list of tuples of the given tokens and the corresponding tags.

Return type:

list(tuple(RichToken, obj))

can_batch_tag(tag_type)[source]

Whether this tagger can batch tag the given tag type.

Parameters:

tag_type (obj) – The tag type which the system wants to batch tag. Usually a string.

Returns:

True if this parser can tag the given tag type

Return type:

bool

can_tag(tag_type)[source]

Whether this tagger can tag the given tag type.

Parameters:

tag_type (obj) – The tag type which the system wants to tag. Usually a string.

Returns:

True if this parser can tag the given tag type

Return type:

bool

class chemdataextractor.nlp.tag.NoneTagger(tag_type=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Tag every token with None.

__init__(tag_type=None)[source]

Initialize self. See help(type(self)) for accurate signature.

tag(tokens)[source]
class chemdataextractor.nlp.tag.RegexTagger(patterns=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Regular Expression Tagger.

__init__(patterns=None, lexicon=None)[source]
Parameters:

string)) patterns (list(tuple(string,) – List of (regex, tag) pairs.

patterns = [('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('(The|the|A|a|An|an)$', 'AT'), ('.*able$', 'JJ'), ('.*ness$', 'NN'), ('.*ly$', 'RB'), ('.*s$', 'NNS'), ('.*ing$', 'VBG'), ('.*ed$', 'VBD'), ('.*', 'NN')]

Regular expression patterns in (regex, tag) tuples.

lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>

The lexicon to use

tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

class chemdataextractor.nlp.tag.AveragedPerceptron[source]

Bases: object

Averaged Perceptron implementation.

Based on implementation by Matthew Honnibal, released under the MIT license.

See more:

http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger

__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

predict(features)[source]

Dot-product the features and current weights and return the best label.

update(truth, guess, features)[source]

Update the feature weights.

average_weights()[source]

Average weights from all iterations.

save(path)[source]

Save the pickled model weights.

load(path)[source]

Load the pickled model weights.

class chemdataextractor.nlp.tag.ApTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Greedy Averaged Perceptron tagger, based on implementation by Matthew Honnibal, released under the MIT license.

See more:

http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger

START = ['-START-', '-START2-']
__init__(model=None, lexicon=None, clusters=None)[source]
lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>
clusters = False
legacy_tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

train(sentences, nr_iter=5)[source]

Train a model from sentences.

Parameters:
  • sentences – A list of sentences, each of which is a list of (token, tag) tuples.

  • nr_iter – Number of training iterations.

save(f)[source]

Save pickled model to file.

load(model)[source]

Load pickled model.

class chemdataextractor.nlp.tag.CrfTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Tagger that uses Conditional Random Fields (CRF).

__init__(model=None, lexicon=None, clusters=None, params=None)[source]
lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>
clusters = False
params = {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 50}

//www.chokkan.org/software/crfsuite/manual.html

Type:

Parameters to pass to training algorithm. See http

load(model)[source]
legacy_tag(tokens)[source]

Return a list of ((token, tag), label) tuples for a given list of (token, tag) tuples.

train(sentences, model)[source]

Train the CRF tagger using CRFSuite.

Params sentences:

Annotated sentences.

Params model:

Path to save pickled model.

class chemdataextractor.nlp.tag.DictionaryTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Dictionary Tagger. Tag tokens based on inclusion in a DAWG.

delimiters = re.compile('(^.|\\b|\\s|\\W|.$)')

Delimiters that define where matches are allowed to start or end.

__init__(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]
Parameters:

words (list(list(string))) – list of words, each of which is a list of tokens.

model = None

DAWG model file path.

entity = 'CM'

Optional no B/I?

Type:

Entity tag. Matches will be tagged like ‘B-CM’ and ‘I-CM’ according to IOB scheme. TODO

case_sensitive = False

Whether dictionary matches are case sensitive.

lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>

The lexicon to use.

load(model)[source]

Load pickled DAWG from disk.

save(path)[source]

Save pickled DAWG to disk.

build(words)[source]

Construct dictionary DAWG from tokenized words.

legacy_tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

.nlp.tokenize

Word and sentence tokenizers.

class chemdataextractor.nlp.tokenize.BaseTokenizer[source]

Bases: object

Abstract base class from which all Tokenizer classes inherit.

Subclasses must implement a span_tokenize(text) method that returns a list of integer offset tuples that identify tokens in the text.

tokenize(s)[source]

Return a list of token strings from the given sentence.

Parameters:

s (string) – The sentence string to tokenize.

Return type:

iter(str)

Deprecated since version 2.0: Deprecated in favour of looking at the tokens from the Sentence object.

span_tokenize(s)[source]

Return a list of integer offsets that identify tokens in the given sentence.

Parameters:

s (string) – The sentence string to tokenize.

Return type:

iter(tuple(int, int))

chemdataextractor.nlp.tokenize.regex_span_tokenize(s, regex)[source]

Return spans that identify tokens in s split using regex.

class chemdataextractor.nlp.tokenize.SentenceTokenizer(model=None)[source]

Bases: chemdataextractor.nlp.tokenize.BaseTokenizer

Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006).

__init__(model=None)[source]

Initialize self. See help(type(self)) for accurate signature.

model = 'models/punkt_english.pickle'
get_sentences(text)[source]
span_tokenize(s)[source]

Return a list of integer offsets that identify sentences in the given text.

Parameters:

s (string) – The text to tokenize into sentences.

Return type:

iter(tuple(int, int))

class chemdataextractor.nlp.tokenize.ChemSentenceTokenizer(model=None)[source]

Bases: chemdataextractor.nlp.tokenize.SentenceTokenizer

Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006), trained on chemistry text.

model = 'models/punkt_chem-1.0.pickle'
class chemdataextractor.nlp.tokenize.WordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.BaseTokenizer

Standard word tokenizer for generic English text.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '−', '‒', '+', '±']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_NO_DIGIT = [':', ',']

Split around these sequences unless they are followed by a digit

SPLIT_START_WORD = ["''", '``', "'"]

Split after these sequences if they start a word

SPLIT_END_WORD = ["'s", "'m", "'d", "'ll", "'re", "'ve", "n't", "''", "'", '’s', '’m', '’d', '’ll', '’re', '’ve', 'n’t', '’', '’’']

Split before these sequences if they end a word

NO_SPLIT_STOP = ['...', 'al.', 'Co.', 'Ltd.', 'Pvt.', 'A.D.', 'B.C.', 'B.V.', 'S.D.', 'U.K.', 'U.S.', 'r.t.']

Don’t split full stop off last token if it is one of these sequences

CONTRACTIONS = [('cannot', 3), ("d'ye", 1), ('d’ye', 1), ('gimme', 3), ('gonna', 3), ('gotta', 3), ('lemme', 3), ("mor'n", 3), ('mor’n', 3), ('wanna', 3), ("'tis", 2), ("'twas", 2)]

Split these contractions at the specified index

NO_SPLIT = {'mm-hm', 'mm-mm', 'o-kay', 'uh-huh', 'uh-oh', 'wanna-be'}

Don’t split these sequences.

NO_SPLIT_PREFIX = {'a', 'agro', 'ante', 'anti', 'arch', 'be', 'bi', 'bio', 'co', 'counter', 'cross', 'cyber', 'de', 'e', 'eco', 'ex', 'extra', 'inter', 'intra', 'macro', 'mega', 'micro', 'mid', 'mini', 'multi', 'neo', 'non', 'over', 'pan', 'para', 'peri', 'post', 'pre', 'pro', 'pseudo', 'quasi', 're', 'semi', 'sub', 'super', 'tri', 'u', 'ultra', 'un', 'uni', 'vice', 'x'}

Don’t split around hyphens with these prefixes

NO_SPLIT_SUFFIX = {'-o-torium', 'esque', 'ette', 'fest', 'fold', 'gate', 'itis', 'less', 'most', 'rama', 'wise'}

Don’t split around hyphens with these suffixes.

NO_SPLIT_CHARS = '0123456789,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗'

Don’t split around hyphens if only these characters before or after.

__init__(split_last_stop=True)[source]

Initialize self. See help(type(self)) for accurate signature.

split_last_stop = None

Whether to split off the final full stop (unless preceded by NO_SPLIT_STOP). Default True.

get_word_tokens(sentence, additional_regex=None)[source]
get_additional_regex(sentence)[source]

Any additional regex to further split the tokens. These regular expressions may be supplied by the sentence contexually and on the fly. For example, a sentence may have certain models associated with it and dimensions associated with these models. These dimensions can inform the tokenizer what to do with high confidence; for example, if given a string like “12K”, then if a temperature is desired, then the tokenizer will automatically split this given the information provided.

Parameters:

sentence (chemdataextractor.doc.text.Sentence) – The sentence for which to get additional regex

Returns:

Expression to further split the tokens

Return type:

re.expression

handle_additional_regex(s, span, nextspan, additional_regex)[source]
span_tokenize(s, additional_regex=None)[source]
class chemdataextractor.nlp.tokenize.ChemWordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.WordTokenizer

Word Tokenizer for chemistry text.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '<', ').', '.(', '–', '—', '―', '~', '⁓', '∼', '°', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_END = [':', ',', '(TM)', '(R)', '(®)', '(™)', '(■)', '(◼)', '(●)', '(▲)', '(○)', '(◆)', '(▼)', '(⧫)', '(△)', '(◇)', '(▽)', '(⬚)', '(×)', '(□)', '(•)', '’', '°C']

Split before these sequences if they end a token

SPLIT_END_NO_DIGIT = ['(aq)', '(aq.)', '(s)', '(l)', '(g)']

Split before these sequences if they end a token, unless preceded by a digit

NO_SPLIT_SLASH = ['+', '-', '−']

Don’t split around slash when both preceded and followed by these characters

QUANTITY_RE = re.compile('^((?P<split>\\d\\d\\d)g|(?P<_split1>[-−]?\\d+\\.\\d+|10[-−]\\d+)(g|s|m|N|V)([-−]?[1-4])?|(?P<_split2>\\d*[-−]?\\d+\\.?\\d*)([pnµμm]A|[µμmk]g|[kM]J|m[lL]|[nµμm]?M|[nµμmc]m|kN|[mk]V|[mkMG]?W|[mnpμµ]s|H)

Regular expression that matches a numeric quantity with units

NO_SPLIT_PREFIX_ENDING = re.compile('(^\\(.*\\)|^[\\d,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗Α-Ωα-ω]+|ano|ato|azo|boc|bromo|cbz|chloro|eno|fluoro|fmoc|ido|ino|io|iodo|mercapto|nitro|ono|oso|oxalo|oxo|oxy|phospho|telluro|tms|yl|ylen|ylene|yliden|ylidene|yl)

Don’t split on hyphen if the prefix matches this regular expression

NO_SPLIT_CHEM = re.compile('([\\-α-ω]|\\d+,\\d+|\\d+[A-Z]|^d\\d\\d?$|acetic|acetyl|acid|acyl|anol|azo|benz|bromo|carb|cbz|chlor|cyclo|ethan|ethyl|fluoro|fmoc|gluc|hydro|idyl|indol|iene|ione|iodo|mercapto|n,n|nitro|noic|o,o|oxal, re.IGNORECASE)

Don’t split on hyphen if prefix or suffix match this regular expression

NO_SPLIT_PREFIX = {'a', 'aci', 'adeno', 'agro', 'aldehydo', 'allo', 'alpha', 'altro', 'ambi', 'ante', 'anti', 'aorto', 'arachno', 'arch', 'as', 'be', 'beta', 'bi', 'bio', 'bis', 'catena', 'centi', 'chi', 'chiro', 'circum', 'cis', 'closo', 'co', 'colo', 'conjuncto', 'conta', 'contra', 'cortico', 'cosa', 'counter', 'cran', 'cross', 'crypto', 'cyber', 'cyclo', 'de', 'deca', 'deci', 'delta', 'demi', 'di', 'dis', 'dl', 'e', 'eco', 'electro', 'endo', 'ennea', 'ent', 'epi', 'epsilon', 'erythro', 'eta', 'ex', 'exo', 'extra', 'ferro', 'galacto', 'gamma', 'gastro', 'giga', 'gluco', 'glycero', 'graft', 'gulo', 'hemi', 'hepta', 'hexa', 'homo', 'hydro', 'hypho', 'hypo', 'ideo', 'idio', 'in', 'infra', 'inter', 'intra', 'iota', 'iso', 'judeo', 'kappa', 'keto', 'kis', 'lambda', 'lyxo', 'macro', 'manno', 'medi', 'mega', 'meso', 'meta', 'micro', 'mid', 'milli', 'mini', 'mono', 'mu', 'muco', 'multi', 'musculo', 'myo', 'nano', 'neo', 'neuro', 'nido', 'nitro', 'non', 'nona', 'nor', 'novem', 'novi', 'nu', 'octa', 'octi', 'octo', 'omega', 'omicron', 'ortho', 'over', 'paleo', 'pan', 'para', 'pelvi', 'penta', 'peri', 'pheno', 'phi', 'pi', 'pica', 'pneumo', 'poly', 'post', 'pre', 'preter', 'pro', 'pseudo', 'psi', 'quadri', 'quasi', 'quater', 'quinque', 're', 'recto', 'rho', 'ribo', 'salpingo', 'scyllo', 'sec', 'semi', 'sept', 'septi', 'sero', 'sesqui', 'sexi', 'sigma', 'sn', 'soci', 'sub', 'super', 'supra', 'sur', 'sym', 'syn', 'talo', 'tau', 'tele', 'ter', 'tera', 'tert', 'tetra', 'theta', 'threo', 'trans', 'tri', 'triangulo', 'tris', 'u', 'uber', 'ultra', 'un', 'uni', 'unsym', 'upsilon', 'veno', 'ventriculo', 'vice', 'x', 'xi', 'xylo', 'zeta'}

Don’t split on hyphen if the prefix is one of these sequences

SPLIT_SUFFIX = {'absorption', 'abstinent', 'abstraction', 'abuse', 'accelerated', 'accepting', 'acclimated', 'acclimation', 'acid', 'activated', 'activation', 'active', 'activity', 'addition', 'adducted', 'adducts', 'adequate', 'adjusted', 'administrated', 'adsorption', 'affected', 'aged', 'alcohol', 'alcoholic', 'algae', 'alginate', 'alkaline', 'alkylated', 'alkylation', 'alkyne', 'analogous', 'anesthetized', 'appended', 'armed', 'aromatic', 'assay', 'assemblages', 'assisted', 'associated', 'atom', 'atoms', 'attenuated', 'attributed', 'backbone', 'base', 'based', 'bearing', 'benzylation', 'binding', 'biomolecule', 'biotic', 'blocking', 'blood', 'bond', 'bonded', 'bonding', 'bonds', 'boosted', 'bottle', 'bottled', 'bound', 'bridge', 'bridged', 'buffer', 'buffered', 'caged', 'cane', 'capped', 'capturing', 'carrier', 'carrying', 'catalysed', 'catalyzed', 'cation', 'caused', 'centered', 'challenged', 'chelating', 'cleaving', 'coated', 'coating', 'coenzyme', 'competing', 'competitive', 'complex', 'complexes', 'compound', 'compounds', 'concentration', 'conditioned', 'conditions', 'conducting', 'configuration', 'confirmed', 'conjugate', 'conjugated', 'conjugates', 'connectivity', 'consuming', 'contained', 'containing', 'contaminated', 'control', 'converting', 'coordinate', 'coordinated', 'copolymer', 'copolymers', 'core', 'cored', 'cotransport', 'coupled', 'covered', 'crosslinked', 'cyclized', 'damaged', 'dealkylation', 'decocted', 'decorated', 'deethylation', 'deficiency', 'deficient', 'defined', 'degrading', 'demethylated', 'demethylation', 'dendrimer', 'density', 'dependant', 'dependence', 'dependent', 'deplete', 'depleted', 'depleting', 'depletion', 'depolarization', 'depolarized', 'deprived', 'derivatised', 'derivative', 'derivatives', 'derivatized', 'derived', 'desorption', 'detected', 'devalued', 'dextran', 'dextrans', 'diabetic', 'dimensional', 'dimer', 'distribution', 'divalent', 'domain', 'dominated', 'donating', 'donor', 'dopant', 'doped', 'doping', 'dosed', 'dot', 'drinking', 'driven', 'drug', 'drugs', 'dye', 'edge', 'efficiency', 'electrodeposited', 'electrolyte', 'elevating', 'elicited', 'embedded', 'emersion', 'emitting', 'encapsulated', 'encapsulating', 'enclosed', 'enhanced', 'enhancing', 'enriched', 'enrichment', 'enzyme', 'epidermal', 'equivalents', 'etched', 'ethanolamine', 'evoked', 'exchange', 'excimer', 'excluder', 'expanded', 'experimental', 'exposed', 'exposure', 'expressing', 'extract', 'extraction', 'fed', 'finger', 'fixed', 'fixing', 'flanking', 'flavonoid', 'fluorescence', 'formation', 'forming', 'fortified', 'free', 'function', 'functionalised', 'functionalized', 'functionalyzed', 'fused', 'gas', 'gated', 'generating', 'glucuronidating', 'glycoprotein', 'glycosylated', 'glycosylation', 'gradient', 'grafted', 'group', 'groups', 'halogen', 'heterocyclic', 'homologues', 'hydrogel', 'hydrolyzing', 'hydroxylated', 'hydroxylation', 'hydroxysteroid', 'immersion', 'immobilized', 'immunoproteins', 'impregnated', 'imprinted', 'inactivated', 'increased', 'increasing', 'incubated', 'independent', 'induce', 'induced', 'inducible', 'inducing', 'induction', 'influx', 'inhibited', 'inhibitor', 'inhibitory', 'initiated', 'injected', 'insensitive', 'insulin', 'integrated', 'interlinked', 'intermediate', 'intolerant', 'intoxicated', 'ion', 'ions', 'island', 'isomer', 'isomers', 'knot', 'label', 'labeled', 'labeling', 'labelled', 'laden', 'lamp', 'laser', 'layer', 'layers', 'lesioned', 'ligand', 'ligated', 'like', 'limitation', 'limited', 'limiting', 'lined', 'linked', 'linker', 'lipid', 'lipids', 'lipoprotein', 'liposomal', 'liposomes', 'liquid', 'liver', 'loaded', 'loading', 'locked', 'loss', 'lowering', 'lubricants', 'luminance', 'luminescence', 'maintained', 'majority', 'making', 'mannosylated', 'material', 'mediated', 'metabolizing', 'metal', 'metallized', 'methylation', 'migrated', 'mimetic', 'mimicking', 'mixed', 'mixture', 'mode', 'model', 'modified', 'modifying', 'modulated', 'moiety', 'molecule', 'monoadducts', 'monomer', 'mutated', 'nanogel', 'nanoparticle', 'nanotube', 'need', 'negative', 'nitrosated', 'nitrosation', 'nitrosylation', 'nmr', 'noncompetitive', 'normalized', 'nuclear', 'nucleoside', 'nucleosides', 'nucleotide', 'nucleotides', 'nutrition', 'olefin', 'olefins', 'oligomers', 'omitted', 'only', 'outcome', 'overload', 'oxidation', 'oxidized', 'oxo-mediated', 'oxygenation', 'page', 'paired', 'pathway', 'patterned', 'peptide', 'permeabilized', 'permeable', 'phase', 'phospholipids', 'phosphopeptide', 'phosphorylated', 'pillared', 'placebo', 'planted', 'plasma', 'polymer', 'polymers', 'poor', 'porous', 'position', 'positive', 'postlabeling', 'precipitated', 'preferring', 'pretreated', 'primed', 'produced', 'producing', 'production', 'promoted', 'promoting', 'protected', 'protein', 'proteomic', 'protonated', 'provoked', 'purified', 'radical', 'reacting', 'reaction', 'reactive', 'reagents', 'rearranged', 'receptor', 'receptors', 'recognition', 'redistribution', 'redox', 'reduced', 'reducing', 'reduction', 'refractory', 'refreshed', 'regenerating', 'regulated', 'regulating', 'regulatory', 'related', 'release', 'releasing', 'replete', 'requiring', 'resistance', 'resistant', 'resitant', 'response', 'responsive', 'responsiveness', 'restricted', 'resulted', 'retinal', 'reversible', 'ribosylated', 'ribosylating', 'ribosylation', 'rich', 'right', 'ring', 'saturated', 'scanning', 'scavengers', 'scavenging', 'sealed', 'secreting', 'secretion', 'seeking', 'selective', 'selectivity', 'semiconductor', 'sensing', 'sensitive', 'sensitized', 'soluble', 'solution', 'solvent', 'sparing', 'specific', 'spiked', 'stabilised', 'stabilized', 'stabilizing', 'stable', 'stained', 'steroidal', 'stimulated', 'stimulating', 'storage', 'stressed', 'stripped', 'substituent', 'substituted', 'substitution', 'substrate', 'sufficient', 'sugar', 'sugars', 'supplemented', 'supported', 'suppressed', 'surface', 'susceptible', 'sweetened', 'synthesizing', 'tagged', 'target', 'telopeptide', 'terminal', 'terminally', 'terminated', 'termini', 'terminus', 'ternary', 'terpolymer', 'tertiary', 'tested', 'testes', 'tethered', 'tetrabrominated', 'tolerance', 'tolerant', 'toxicity', 'toxin', 'tracer', 'transfected', 'transfer', 'transition', 'transport', 'transporter', 'treated', 'treating', 'treatment', 'triggered', 'turn', 'type', 'unesterified', 'untreated', 'vacancies', 'vacancy', 'variable', 'water', 'yeast', 'yield', 'zwitterion'}

Split on hyphens followed by one of these sequences

NO_SPLIT = {'°c'}
get_additional_regex(sentence)[source]

Any additional regex to further split the tokens. These regular expressions may be supplied by the sentence contexually and on the fly. For example, a sentence may have certain models associated with it and dimensions associated with these models. These dimensions can inform the tokenizer what to do with high confidence; for example, if given a string like “12K”, then if a temperature is desired, then the tokenizer will automatically split this given the information provided.

Parameters:

sentence (chemdataextractor.doc.text.Sentence) – The sentence for which to get additional regex

Returns:

Expression to further split the tokens

Return type:

re.expression

class chemdataextractor.nlp.tokenize.FineWordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.WordTokenizer

Word Tokenizer that also split around hyphens and all colons.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '’', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '-', '−', '‒', '‐', '‑', '+', '±', ':']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_NO_DIGIT = [',']

Split before these sequences if they end a token

NO_SPLIT = {}
NO_SPLIT_PREFIX = {}

Don’t split around hyphens with these prefixes

NO_SPLIT_SUFFIX = {}

Don’t split around hyphens with these suffixes.

class chemdataextractor.nlp.tokenize.BertWordTokenizer(split_last_stop=True, path=None, lowercase=True)[source]

Bases: chemdataextractor.nlp.tokenize.ChemWordTokenizer

A word tokenizer for BERT with some additional allowances in case one wants to override its choices. Concrete overrides that are used in CDE include not splitting if it seems like a decimal point is in the middle of a number, and splitting values and units.

do_not_split = []
do_not_split_if_in_num = ['.', ',']
__init__(split_last_stop=True, path=None, lowercase=True)[source]

Initialize self. See help(type(self)) for accurate signature.

span_tokenize(s, additional_regex=None)[source]