.nlp¶
Tools for performing the NLP stages, such as POS tagging, Word clustering, CNER, Abbreviation detection
Chemistry-aware natural language processing framework.
.nlp.abbrev¶
Abbreviation detection.
-
class
chemdataextractor.nlp.abbrev.
AbbreviationDetector
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Bases:
object
Detect abbreviation definitions in a list of tokens.
Similar to the algorithm in Schwartz & Hearst 2003.
-
__init__
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
abbr_min
= 3¶ Minimum abbreviation length
-
abbr_max
= 10¶ Maximum abbreviation length
-
abbr_equivs
= []¶ String equivalents to use when detecting abbreviations.
-
-
class
chemdataextractor.nlp.abbrev.
ChemAbbreviationDetector
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Bases:
chemdataextractor.nlp.abbrev.AbbreviationDetector
Chemistry-aware abbreviation detector.
This abbreviation detector has an additional list of string equivalents (e.g. Silver = Ag) that improve abbreviation detection on chemistry texts.
-
abbr_min
= 3¶ Minimum abbreviation length
-
abbr_max
= 14¶
-
abbr_equivs
= [('silver', 'Ag'), ('gold', 'Au'), ('mercury', 'Hg'), ('lead', 'Pb'), ('tin', 'Sn'), ('tungsten', 'W'), ('iron', 'Fe'), ('sodium', 'Na'), ('potassium', 'K'), ('copper', 'Cu'), ('sulfate', 'SO4'), ('methanol', 'MeOH'), ('ethanol', 'EtOH'), ('hydroxy', 'OH'), ('hexadecyltrimethylammonium bromide', 'CTAB'), ('cytarabine', 'Ara-C'), ('hydroxylated', 'OH'), ('hydrogen peroxide', 'H2O2'), ('quartz', 'SiO2'), ('amino', 'NH2'), ('amino', 'NH2'), ('ammonia', 'NH3'), ('ammonium', 'NH4'), ('methyl', 'CH3'), ('nitro', 'NO2'), ('potassium carbonate', 'K2CO3'), ('carbonate', 'CO3'), ('borohydride', 'BH4'), ('triethylamine', 'NEt3'), ('triethylamine', 'Et3N')]¶ String equivalents to use when detecting abbreviations.
-
.nlp.cem¶
Named entity recognition (NER) for Chemical entity mentions (CEM).
This was the default NER system up to version 2.0, while the new NER is included in new_cem.
-
chemdataextractor.nlp.cem.
IGNORE_SUFFIX
= ['-', "'s", '-activated', '-adequate', '-affected'...¶ Token endings to ignore when considering stopwords and deriving spans
-
chemdataextractor.nlp.cem.
IGNORE_PREFIX
= ['fluorophore-', 'low-', 'high-', 'single-', 'odd-...¶ Token beginnings to ignore when considering stopwords and deriving spans
-
chemdataextractor.nlp.cem.
STRIP_END
= ['groups', 'group', 'colloidal', 'dyes', 'dye', 'p...¶ Final tokens to remove from entity matches
-
chemdataextractor.nlp.cem.
STRIP_START
= ['anhydrous', 'elemental', 'amorphous', 'conjugate...¶ First tokens to remove from entity matches
-
chemdataextractor.nlp.cem.
STOP_TOKENS
= {'.cdx', '.sk2', '10.1021', '10.1039', '10.1186', ...¶ Disallowed tokens in chemical entity mentions (discard if any single token has exact case-insensitive match)
-
chemdataextractor.nlp.cem.
STOP_SUB
= {' brand of ', ' oil', ' with ', '!', '%', ', ', '...¶ Disallowed substrings in chemical entity mentions (only used when filtering to construct the dictionary?)
-
chemdataextractor.nlp.cem.
STOPLIST
= {'(gaba)ergic', '1,3-dpma', '1,5-dpma', '12mg', '3...¶ Disallowed chemical entity mentions (discard if exact case-insensitive match)
-
chemdataextractor.nlp.cem.
STOP_RES
= ['^(http|ftp)://', '\\.(com|uk|eu|org|net)$', '^\\...¶ the entity text is passed as lowercase.
- Type:
Regular expressions that define disallowed chemical entity mentions. Note
-
chemdataextractor.nlp.cem.
SPLITS
= ['^(actinium|aluminium|aluminum|americium|antimony...¶ Regular expressions defining collections of words that should be split if joined by hyphens or -to-
-
class
chemdataextractor.nlp.cem.
CiDictCemTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.DictionaryTagger
Case-insensitive CEM dictionary tagger.
-
tag_type
= 'ner_tag'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
model
= 'models/cem_dict-1.0.pickle'¶
-
-
class
chemdataextractor.nlp.cem.
CsDictCemTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.DictionaryTagger
Case-sensitive CEM dictionary tagger.
-
tag_type
= 'ner_tag'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
model
= 'models/cem_dict_cs-1.0.pickle'¶
-
case_sensitive
= True¶
-
-
class
chemdataextractor.nlp.cem.
CrfCemTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.CrfTagger
-
tag_type
= 'ner_tag'¶
-
model
= 'models/cem_crf_chemdner_cemp-1.0.pickle'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
clusters
= True¶
-
params
= {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 200}¶
-
legacy_tag
(tokens)[source]¶ - Parameters:
tokens) tokens (list(obj) – Tokens to tag
- Returns (list(obj), obj):
legacy_tag
corresponds to thetag
method in ChemDataExtractor 2.0 and earlier. This has been renamedlegacy_tag
due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.
-
-
class
chemdataextractor.nlp.cem.
LegacyCemTagger
(*args, **kwargs)[source]¶ Bases:
chemdataextractor.nlp.tag.EnsembleTagger
Return the combined output of a number of chemical entity taggers.
-
label_type
= 'ner_tag'¶ The individual chemical entity taggers to use.
-
taggers
= [<chemdataextractor.nlp.cem.CrfCemTagger object>, <chemdataextractor.nlp.cem.CiDictCemTagger object>, <chemdataextractor.nlp.cem.CsDictCemTagger object>]¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
legacy_tag
(tokens)[source]¶ - Parameters:
tokens) tokens (list(obj) – Tokens to tag
- Returns (list(obj), obj):
legacy_tag
corresponds to thetag
method in ChemDataExtractor 2.0 and earlier. This has been renamedlegacy_tag
due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.
-
.nlp.new_cem¶
New and improved named entity recognition (NER) for Chemical entity mentions (CEM).
-
class
chemdataextractor.nlp.new_cem.
BertFinetunedCRFCemTagger
(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]¶ Bases:
chemdataextractor.nlp.allennlpwrapper.AllenNlpWrapperTagger
A Chemical Entity Mention tagger using a finetuned BERT model with a CRF to constrain the outputs.
-
tag_type
= 'ner_tag'¶
-
indexers
= {'bert': <allennlp.data.token_indexers.wordpiece_indexer.PretrainedBertIndexer object>}¶
-
model
= 'models/bert_finetuned_crf_model-1.0a'¶
-
overrides
= {'model.text_field_embedder.token_embedders.bert.pretrained_model': '/home/docs/.local/share/ChemDataExtractor/models/scibert_cased_weights-1.0.tar.gz'}¶
-
-
class
chemdataextractor.nlp.new_cem.
CemTagger
(*args, **kwargs)[source]¶ Bases:
chemdataextractor.nlp.tag.EnsembleTagger
A state of the art Named Entity Recognition tagger for both organic and inorganic materials that uses a tagger based on BERT with a Conditional Random Field to constrain the outputs. More details in the paper (https://pubs.acs.org/doi/full/10.1021/acs.jcim.1c01199).
-
taggers
= [<chemdataextractor.nlp.allennlpwrapper._AllenNlpTokenTagger object>, <chemdataextractor.nlp.allennlpwrapper.ProcessedTextTagger object>, <chemdataextractor.nlp.new_cem.BertFinetunedCRFCemTagger object>]¶
-
.nlp.allennlpwrapper¶
Tagger wrappers that wrap AllenNLP functionality. Used for and named entity recognition.
-
class
chemdataextractor.nlp.allennlpwrapper.
ProcessedTextTagger
[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Class to process text before the text is fed into any other taggers. This class is designed to be used with AllenNlpWrapperTagger and replaces any single-number tokens with <nUm> in accordance with the training data.
-
tag_type
= 'processed_text'¶
-
number_pattern
= re.compile('([\\+\\-–−]?\\d+(([\\.・,\\d])+)?)')¶
-
number_string
= '<nUm>'¶
-
-
class
chemdataextractor.nlp.allennlpwrapper.
AllenNlpWrapperTagger
(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
A wrapper for an AllenNLP model. Tested with a CRF Tagger but should work with any sequence labeller trained in allennlp.
-
model
= None¶
-
__init__
(indexers=None, weights_location=None, gpu_id=None, archive_location=None, tag_type=None, min_batch_size=None, max_batch_size=None, max_allowed_length=None)[source]¶ - Parameters:
(dict(str, TokenIndexer), optional) (indexers) – A dictionary of all the AllenNLP indexers to be used with the taggers. Please refer to their documentation for more detail.
(str, optional) (archive_location) – Location for weights. Corresponds to weights_file parameter for the load_archive function from AllenNLP.
(int, optional) (max_allowed_length) – The ID for the GPU to be used. If None is passed in, ChemDataExtractor will automatically detect if a GPU is available and use that. To explicitly use the CPU, pass in a value of -1.
(str, optional) – The location where the model is archived. Corresponds to the archive_file parameter in the load_archive function from AllenNLP. Alternatively, you can set this parameter to None and set the class property
model
, which will then search for the model inside of ChemDataExtractor’s default model directory.(obj, optional) (tag_type) – Override the class’s tag type. Refer to the documentation for
BaseTagger
for more information on how to use tag types.(int, optional) – The minimum batch size to use when predicting. Default 100.
(int, optional) – The maximum batch size to use when predicting. Default 200.
(int, optional) – The maximum allowed length of a sentence when predicting. Default 220. Any sentences longer than this will be split into multiple smaller sentences via a sliding window approach and the results will be collected. Needs to be a multiple of 4 for correct predictions.
-
tag_type
= None¶
-
indexers
= None¶
-
overrides
= None¶
-
process
(tag)[source]¶ Process the given tag. This can be used for example if the names of tags in training are different from what ChemDataExtractor expects.
- Parameters:
str (tag) – The raw string output from the predictor.
- Returns:
A processed version of the tag
- Return type:
-
predictor
¶ The AllenNLP predictor for this tagger.
-
batch_tag
(sents)[source]¶ - Parameters:
sents (chemdataextractor.doc.text.RichToken) –
- Returns:
list(list(~chemdataextractor.doc.text.RichToken, obj))
Take a list of lists of all the tokens from all the elements in a document, and return a list of lists of (token, tag) pairs. One thing to note is that the resulting list of lists of (token, tag) pairs need not be in the same order as the incoming list of lists of tokens, as sorting is done so that we can bucket sentences by their lengths. More information can be found in the
BaseTagger
documentation, and in this guide.
-
.nlp.corpus¶
Tools for reading and writing text corpora.
-
class
chemdataextractor.nlp.corpus.
LazyCorpusLoader
(name, reader_cls, *args, **kwargs)[source]¶ Bases:
object
Derived from NLTK LazyCorpusLoader.
-
chemdataextractor.nlp.corpus.
wsj
= <BracketParseCorpusReader in '.../corpora/wsj_trai...¶ Penn Treebank Revised, LDC2015T13)
- Type:
Entire WSJ corpus (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_training
= <BracketParseCorpusReader in '.../corpora/wsj_trai...¶ Penn Treebank Revised, LDC2015T13)
- Type:
WSJ corpus sections 0-18 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_development
= <BracketParseCorpusReader in '.../corpora/wsj_deve...¶ Penn Treebank Revised, LDC2015T13)
- Type:
WSJ corpus sections 19-21 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_evaluation
= <BracketParseCorpusReader in '.../corpora/wsj_eval...¶ Penn Treebank Revised, LDC2015T13)
- Type:
WSJ corpus sections 22-24 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
treebank2_training
= <ChunkedCorpusReader in '.../corpora/treebank2_tra...¶ WSJ corpus sections 0-18 (treebank2)
-
chemdataextractor.nlp.corpus.
treebank2_development
= <ChunkedCorpusReader in '.../corpora/treebank2_dev...¶ WSJ corpus sections 19-21 (treebank2)
-
chemdataextractor.nlp.corpus.
treebank2_evaluation
= <ChunkedCorpusReader in '.../corpora/treebank2_eva...¶ WSJ corpus sections 22-24 (treebank2)
-
chemdataextractor.nlp.corpus.
genia_training
= <TaggedCorpusReader in '.../corpora/genia_training...¶ First 80% of GENIA POS-tagged corpus
-
chemdataextractor.nlp.corpus.
genia_evaluation
= <TaggedCorpusReader in '.../corpora/genia_evaluati...¶ Last 20% of GENIA POS-tagged corpus
-
chemdataextractor.nlp.corpus.
medpost
= <TaggedCorpusReader in '.../corpora/medpost' (not ...¶
-
chemdataextractor.nlp.corpus.
medpost_training
= <TaggedCorpusReader in '.../corpora/medpost_traini...¶
-
chemdataextractor.nlp.corpus.
medpost_evaluation
= <TaggedCorpusReader in '.../corpora/medpost_evalua...¶
-
chemdataextractor.nlp.corpus.
cde_tokensc
= <PlaintextCorpusReader in '.../corpora/cde_tokensc...¶
-
chemdataextractor.nlp.corpus.
chemdner_training
= <PlaintextCorpusReader in '.../corpora/chemdner_tr...¶
.nlp.lexicon¶
Cache features of previously seen words.
-
class
chemdataextractor.nlp.lexicon.
Lexeme
(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]¶ Bases:
object
-
__init__
(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
text
¶ Original Lexeme text.
-
cluster
¶ The Brown Word Cluster for this Lexeme.
-
normalized
¶ Normalized text, using the Lexicon Normalizer.
-
lower
¶ Lowercase text.
-
first
¶ First character.
-
suffix
¶ Three-character suffix
-
shape
¶ Word shape. Derived by replacing every number with ‘d’, every greek letter with ‘g’, and every latin letter with ‘X’ or ‘x’ for uppercase and lowercase respectively.
-
length
¶ Lexeme length.
-
upper_count
¶ Count of uppercase characters.
-
lower_count
¶ Count of lowercase characters.
-
digit_count
¶ Count of digits.
-
is_alpha
¶ Whether the text is entirely alphabetical characters.
-
is_ascii
¶ Whether the text is entirely ASCII characters.
-
is_digit
¶ Whether the text is entirely digits.
-
is_lower
¶ Whether the text is entirely lowercase.
-
is_upper
¶ Whether the text is entirely uppercase.
-
is_title
¶ Whether the text is title cased.
-
is_punct
¶ Whether the text is entirely punctuation characters.
-
is_hyphenated
¶ Whether the text is hyphenated.
-
like_url
¶ Whether the text looks like a URL.
-
like_number
¶ Whether the text looks like a number.
-
-
class
chemdataextractor.nlp.lexicon.
Lexicon
[source]¶ Bases:
object
-
normalizer
= <chemdataextractor.text.normalize.Normalizer object>¶ The Normalizer for this Lexicon.
-
clusters_path
= None¶ Path to the Brown clusters model file for this Lexicon.
-
-
class
chemdataextractor.nlp.lexicon.
ChemLexicon
[source]¶ Bases:
chemdataextractor.nlp.lexicon.Lexicon
A Lexicon that is pre-configured with a Chemistry-aware Normalizer and Brown word clusters derived from a chemistry corpus.
-
normalizer
= <chemdataextractor.text.normalize.ChemNormalizer object>¶
-
clusters_path
= 'models/clusters_chem1500-1.0.pickle'¶
-
.nlp.pos¶
Part-of-speech tagging.
-
chemdataextractor.nlp.pos.
TAGS
= ['NN', 'IN', 'NNP', 'DT', 'NNS', 'JJ', ',', '.', '...¶ Complete set of POS tags. Ordered by decreasing frequency in WSJ corpus.
-
class
chemdataextractor.nlp.pos.
ApPosTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.tag.ApTagger
Greedy Averaged Perceptron POS tagger trained on WSJ corpus.
-
model
= 'models/pos_ap_wsj_nocluster-1.0.pickle'¶
-
tag_type
= 'pos_tag'¶
-
clusters
= False¶
-
-
class
chemdataextractor.nlp.pos.
ChemApPosTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.pos.ApPosTagger
Greedy Averaged Perceptron POS tagger trained on both WSJ and GENIA corpora.
Uses features based on word clusters from chemistry text.
-
model
= 'models/pos_ap_wsj_genia-1.0.pickle'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
tag_type
= 'pos_tag'¶
-
clusters
= True¶
-
-
class
chemdataextractor.nlp.pos.
CrfPosTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.CrfTagger
-
model
= 'models/pos_crf_wsj_nocluster-1.0.pickle'¶
-
tag_type
= 'pos_tag'¶
-
clusters
= False¶
-
-
class
chemdataextractor.nlp.pos.
ChemCrfPosTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.pos.CrfPosTagger
-
model
= 'models/pos_crf_wsj_genia-1.0.pickle'¶
-
tag_type
= 'pos_tag'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
clusters
= True¶
-
.nlp.tag¶
Tagger implementations. Used for part-of-speech tagging and named entity recognition.
-
class
chemdataextractor.nlp.tag.
BaseTagger
[source]¶ Bases:
object
Abstract tagger class from which all taggers inherit.
Subclasses must implement at least one of the following sets of methods for tagging:
legacy_tag()
tag()
batch_tag()
can_tag()
andtag_for_type()
can_tag()
andcan_batch_tag()
andbatch_tag_for_type()
The above interface is called when required by classes including
Sentence
orDocument
, depending on whether only the tag for a sentence is required or for the whole document.If the user has implemented more than one of the combinations above, the order of presedence for the tagging methods is as follows:
batch_tag_for_type()
tag_for_type()
batch_tag()
tag()
legacy_tag()
Most users should not have to implement the top two options, and the default impelementations are discussed in the documentation for
EnsembleTagger
instead of here.An implementation of the other tagging methods should have the following signatures and should be implemented in the following cases:
tag(self, list(
RichToken
) tokens) -> list(RichToken
, obj) Take a list of all the tokens from an element, and return a list of (token, tag) pairs. This should be the default implementation for any new tagger. More information on how to create a new tagger can be found at in this guide.batch_tag(self, list(list(
RichToken
)) sents) -> list(list(RichToken
, obj)) Take a list of lists of all the tokens from all the elements in a document, and return a list of lists of (token, tag) pairs. One thing to note is that the resulting list of lists of (token, tag) pairs need not be in the same order as the incoming list of lists of tokens, so some sorting can be done if, for example, bucketing of sentences by their lengths is desired. In addition totag
, thebatch_tag
method should be implemented instead of thetag
method in cases where the taggers rely on backends that are more performant when tagging multiple sentences, and the tagger will be called for every element. More information can be found in in this guide.Note
If a tagger only has
batch_tag
implemented, the tagger will fail when applied to an element that does not belong to a document.legacy_tag(self, list(obj tokens) -> (list(obj), obj)
legacy_tag
corresponds to thetag
method in ChemDataExtractor 2.0 and earlier. This has been renamedlegacy_tag
due to its complexity in that it could be called with either a list of strings or a list of (token, PoS tag) pairs. This made it incompatible with the new taggers in their current form. ChemDataExtractor 2.1 will call this method with a list of strings instead of a list of (token, PoS tag) pairs. This should only be used for converting previously written taggers with as few code changes as possible, as shown in the migration guide.To express intent to the ChemDataExtractor framework that the tagger can tag for a certain tag type, you should implement the
can_tag
method, which takes a tag type and returns a boolean. The default implementation, provided by this class, looks at thetag_type
attribute of the tagger and returns True if it matches the tag type provided.Warning
While the
RichToken
class maintains backwards compatibility in most cases, e.g. parsers by assigning the1
key in dictionary-style lookup with the combined PoS and NER tag, calling this key in an NER or PoS tagger will cause your script to crash. To avoid this, please change any previous bits of code such astoken[1]
totoken["ner_tag"]
ortoken.ner_tag
.-
tag_type
= ''¶ The tag type for this tagger. When this tag type is asked for from the token, as described in
RichToken
, this tagger will be called.
-
tag_sents
(sentences)[source]¶ Apply the
tag
method to each sentence insentences
.Deprecated since version 2.1: Deprecated in conjunction with the deprecation of the legacy_tag function. Please write equivalent functionality to use RichTokens.
-
can_tag
(tag_type)[source]¶ Whether this tagger can tag the given tag type.
- Parameters:
tag_type (obj) – The tag type which the system wants to tag. Usually a string.
- Returns:
True if this parser can tag the given tag type
- Return type:
-
can_batch_tag
(tag_type)[source]¶ Whether this tagger can batch tag the given tag type.
- Parameters:
tag_type (obj) – The tag type which the system wants to batch tag. Usually a string.
- Returns:
True if this parser can tag the given tag type
- Return type:
-
class
chemdataextractor.nlp.tag.
EnsembleTagger
(*args, **kwargs)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
A class for taggers which act on the results of multiple other taggers. This could also be done by simply adding each tagger to the sentence and having the taggers each act on the results from the other taggers by accessing RichToken attributes, but an EnsembleTagger allows for the user to add one tagger instead, cleaning up the interface.
The EnsembleTagger is also useful in collating the results from multiple taggers of the same type, as can be seen in the case of
CemTagger
which collects multiple types of NER labellers (a CRF and multiple dictionary taggers), to create a single coherent NER label.-
tag_type
= ''¶
-
taggers
= []¶
-
tag_for_type
(tokens, tag_type)[source]¶ This method will be called if the EnsembleTagger has previously claimed that it can tag the given tag type via the
can_tag()
method. The appropriate tagger within EnsembleTagger is called and the results returned.Note
This method can handle having legacy taggers mixed in with newer taggers.
- Parameters:
tokens (list(chemdataextractor.doc.text.RichToken)) – The tokens which should be tagged
tag_type (obj) – The tag type for which EnsembleTagger should tag the tokens.
- Returns:
A list of tuples of the given tokens and the corresponding tags.
- Return type:
-
batch_tag_for_type
(sents, tag_type)[source]¶ This method will be called if the EnsembleTagger has previously claimed that it can batch tag the given tag type via the
can_batch_tag()
method. The appropriate tagger within EnsembleTagger is called and the results returned.
-
can_batch_tag
(tag_type)[source]¶ Whether this tagger can batch tag the given tag type.
- Parameters:
tag_type (obj) – The tag type which the system wants to batch tag. Usually a string.
- Returns:
True if this parser can tag the given tag type
- Return type:
-
can_tag
(tag_type)[source]¶ Whether this tagger can tag the given tag type.
- Parameters:
tag_type (obj) – The tag type which the system wants to tag. Usually a string.
- Returns:
True if this parser can tag the given tag type
- Return type:
-
class
chemdataextractor.nlp.tag.
NoneTagger
(tag_type=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Tag every token with None.
-
class
chemdataextractor.nlp.tag.
RegexTagger
(patterns=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Regular Expression Tagger.
-
patterns
= [('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('(The|the|A|a|An|an)$', 'AT'), ('.*able$', 'JJ'), ('.*ness$', 'NN'), ('.*ly$', 'RB'), ('.*s$', 'NNS'), ('.*ing$', 'VBG'), ('.*ed$', 'VBD'), ('.*', 'NN')]¶ Regular expression patterns in (regex, tag) tuples.
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶ The lexicon to use
-
-
class
chemdataextractor.nlp.tag.
AveragedPerceptron
[source]¶ Bases:
object
Averaged Perceptron implementation.
Based on implementation by Matthew Honnibal, released under the MIT license.
-
class
chemdataextractor.nlp.tag.
ApTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Greedy Averaged Perceptron tagger, based on implementation by Matthew Honnibal, released under the MIT license.
- See more:
http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger
-
START
= ['-START-', '-START2-']¶
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶
-
clusters
= False¶
-
class
chemdataextractor.nlp.tag.
CrfTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Tagger that uses Conditional Random Fields (CRF).
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶
-
clusters
= False¶
-
params
= {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 50}¶ //www.chokkan.org/software/crfsuite/manual.html
- Type:
Parameters to pass to training algorithm. See http
-
-
class
chemdataextractor.nlp.tag.
DictionaryTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Dictionary Tagger. Tag tokens based on inclusion in a DAWG.
-
delimiters
= re.compile('(^.|\\b|\\s|\\W|.$)')¶ Delimiters that define where matches are allowed to start or end.
-
model
= None¶ DAWG model file path.
-
entity
= 'CM'¶ Optional no B/I?
- Type:
Entity tag. Matches will be tagged like ‘B-CM’ and ‘I-CM’ according to IOB scheme. TODO
-
case_sensitive
= False¶ Whether dictionary matches are case sensitive.
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶ The lexicon to use.
-
.nlp.tokenize¶
Word and sentence tokenizers.
-
class
chemdataextractor.nlp.tokenize.
BaseTokenizer
[source]¶ Bases:
object
Abstract base class from which all Tokenizer classes inherit.
Subclasses must implement a
span_tokenize(text)
method that returns a list of integer offset tuples that identify tokens in the text.
-
chemdataextractor.nlp.tokenize.
regex_span_tokenize
(s, regex)[source]¶ Return spans that identify tokens in s split using regex.
-
class
chemdataextractor.nlp.tokenize.
SentenceTokenizer
(model=None)[source]¶ Bases:
chemdataextractor.nlp.tokenize.BaseTokenizer
Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006).
-
model
= 'models/punkt_english.pickle'¶
-
-
class
chemdataextractor.nlp.tokenize.
ChemSentenceTokenizer
(model=None)[source]¶ Bases:
chemdataextractor.nlp.tokenize.SentenceTokenizer
Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006), trained on chemistry text.
-
model
= 'models/punkt_chem-1.0.pickle'¶
-
-
class
chemdataextractor.nlp.tokenize.
WordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.BaseTokenizer
Standard word tokenizer for generic English text.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '−', '‒', '+', '±']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_NO_DIGIT
= [':', ',']¶ Split around these sequences unless they are followed by a digit
-
SPLIT_START_WORD
= ["''", '``', "'"]¶ Split after these sequences if they start a word
-
SPLIT_END_WORD
= ["'s", "'m", "'d", "'ll", "'re", "'ve", "n't", "''", "'", '’s', '’m', '’d', '’ll', '’re', '’ve', 'n’t', '’', '’’']¶ Split before these sequences if they end a word
-
NO_SPLIT_STOP
= ['...', 'al.', 'Co.', 'Ltd.', 'Pvt.', 'A.D.', 'B.C.', 'B.V.', 'S.D.', 'U.K.', 'U.S.', 'r.t.']¶ Don’t split full stop off last token if it is one of these sequences
-
CONTRACTIONS
= [('cannot', 3), ("d'ye", 1), ('d’ye', 1), ('gimme', 3), ('gonna', 3), ('gotta', 3), ('lemme', 3), ("mor'n", 3), ('mor’n', 3), ('wanna', 3), ("'tis", 2), ("'twas", 2)]¶ Split these contractions at the specified index
-
NO_SPLIT
= {'mm-hm', 'mm-mm', 'o-kay', 'uh-huh', 'uh-oh', 'wanna-be'}¶ Don’t split these sequences.
-
NO_SPLIT_PREFIX
= {'a', 'agro', 'ante', 'anti', 'arch', 'be', 'bi', 'bio', 'co', 'counter', 'cross', 'cyber', 'de', 'e', 'eco', 'ex', 'extra', 'inter', 'intra', 'macro', 'mega', 'micro', 'mid', 'mini', 'multi', 'neo', 'non', 'over', 'pan', 'para', 'peri', 'post', 'pre', 'pro', 'pseudo', 'quasi', 're', 'semi', 'sub', 'super', 'tri', 'u', 'ultra', 'un', 'uni', 'vice', 'x'}¶ Don’t split around hyphens with these prefixes
-
NO_SPLIT_SUFFIX
= {'-o-torium', 'esque', 'ette', 'fest', 'fold', 'gate', 'itis', 'less', 'most', 'rama', 'wise'}¶ Don’t split around hyphens with these suffixes.
-
NO_SPLIT_CHARS
= '0123456789,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗'¶ Don’t split around hyphens if only these characters before or after.
-
__init__
(split_last_stop=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
split_last_stop
= None¶ Whether to split off the final full stop (unless preceded by NO_SPLIT_STOP). Default True.
-
get_additional_regex
(sentence)[source]¶ Any additional regex to further split the tokens. These regular expressions may be supplied by the sentence contexually and on the fly. For example, a sentence may have certain models associated with it and dimensions associated with these models. These dimensions can inform the tokenizer what to do with high confidence; for example, if given a string like “12K”, then if a temperature is desired, then the tokenizer will automatically split this given the information provided.
- Parameters:
sentence (chemdataextractor.doc.text.Sentence) – The sentence for which to get additional regex
- Returns:
Expression to further split the tokens
- Return type:
re.expression
-
-
class
chemdataextractor.nlp.tokenize.
ChemWordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.WordTokenizer
Word Tokenizer for chemistry text.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '<', ').', '.(', '–', '—', '―', '~', '⁓', '∼', '°', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_END
= [':', ',', '(TM)', '(R)', '(®)', '(™)', '(■)', '(◼)', '(●)', '(▲)', '(○)', '(◆)', '(▼)', '(⧫)', '(△)', '(◇)', '(▽)', '(⬚)', '(×)', '(□)', '(•)', '’', '°C']¶ Split before these sequences if they end a token
-
SPLIT_END_NO_DIGIT
= ['(aq)', '(aq.)', '(s)', '(l)', '(g)']¶ Split before these sequences if they end a token, unless preceded by a digit
-
NO_SPLIT_SLASH
= ['+', '-', '−']¶ Don’t split around slash when both preceded and followed by these characters
-
QUANTITY_RE
= re.compile('^((?P<split>\\d\\d\\d)g|(?P<_split1>[-−]?\\d+\\.\\d+|10[-−]\\d+)(g|s|m|N|V)([-−]?[1-4])?|(?P<_split2>\\d*[-−]?\\d+\\.?\\d*)([pnµμm]A|[µμmk]g|[kM]J|m[lL]|[nµμm]?M|[nµμmc]m|kN|[mk]V|[mkMG]?W|[mnpμµ]s|H)¶ Regular expression that matches a numeric quantity with units
-
NO_SPLIT_PREFIX_ENDING
= re.compile('(^\\(.*\\)|^[\\d,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗Α-Ωα-ω]+|ano|ato|azo|boc|bromo|cbz|chloro|eno|fluoro|fmoc|ido|ino|io|iodo|mercapto|nitro|ono|oso|oxalo|oxo|oxy|phospho|telluro|tms|yl|ylen|ylene|yliden|ylidene|yl)¶ Don’t split on hyphen if the prefix matches this regular expression
-
NO_SPLIT_CHEM
= re.compile('([\\-α-ω]|\\d+,\\d+|\\d+[A-Z]|^d\\d\\d?$|acetic|acetyl|acid|acyl|anol|azo|benz|bromo|carb|cbz|chlor|cyclo|ethan|ethyl|fluoro|fmoc|gluc|hydro|idyl|indol|iene|ione|iodo|mercapto|n,n|nitro|noic|o,o|oxal, re.IGNORECASE)¶ Don’t split on hyphen if prefix or suffix match this regular expression
-
NO_SPLIT_PREFIX
= {'a', 'aci', 'adeno', 'agro', 'aldehydo', 'allo', 'alpha', 'altro', 'ambi', 'ante', 'anti', 'aorto', 'arachno', 'arch', 'as', 'be', 'beta', 'bi', 'bio', 'bis', 'catena', 'centi', 'chi', 'chiro', 'circum', 'cis', 'closo', 'co', 'colo', 'conjuncto', 'conta', 'contra', 'cortico', 'cosa', 'counter', 'cran', 'cross', 'crypto', 'cyber', 'cyclo', 'de', 'deca', 'deci', 'delta', 'demi', 'di', 'dis', 'dl', 'e', 'eco', 'electro', 'endo', 'ennea', 'ent', 'epi', 'epsilon', 'erythro', 'eta', 'ex', 'exo', 'extra', 'ferro', 'galacto', 'gamma', 'gastro', 'giga', 'gluco', 'glycero', 'graft', 'gulo', 'hemi', 'hepta', 'hexa', 'homo', 'hydro', 'hypho', 'hypo', 'ideo', 'idio', 'in', 'infra', 'inter', 'intra', 'iota', 'iso', 'judeo', 'kappa', 'keto', 'kis', 'lambda', 'lyxo', 'macro', 'manno', 'medi', 'mega', 'meso', 'meta', 'micro', 'mid', 'milli', 'mini', 'mono', 'mu', 'muco', 'multi', 'musculo', 'myo', 'nano', 'neo', 'neuro', 'nido', 'nitro', 'non', 'nona', 'nor', 'novem', 'novi', 'nu', 'octa', 'octi', 'octo', 'omega', 'omicron', 'ortho', 'over', 'paleo', 'pan', 'para', 'pelvi', 'penta', 'peri', 'pheno', 'phi', 'pi', 'pica', 'pneumo', 'poly', 'post', 'pre', 'preter', 'pro', 'pseudo', 'psi', 'quadri', 'quasi', 'quater', 'quinque', 're', 'recto', 'rho', 'ribo', 'salpingo', 'scyllo', 'sec', 'semi', 'sept', 'septi', 'sero', 'sesqui', 'sexi', 'sigma', 'sn', 'soci', 'sub', 'super', 'supra', 'sur', 'sym', 'syn', 'talo', 'tau', 'tele', 'ter', 'tera', 'tert', 'tetra', 'theta', 'threo', 'trans', 'tri', 'triangulo', 'tris', 'u', 'uber', 'ultra', 'un', 'uni', 'unsym', 'upsilon', 'veno', 'ventriculo', 'vice', 'x', 'xi', 'xylo', 'zeta'}¶ Don’t split on hyphen if the prefix is one of these sequences
-
SPLIT_SUFFIX
= {'absorption', 'abstinent', 'abstraction', 'abuse', 'accelerated', 'accepting', 'acclimated', 'acclimation', 'acid', 'activated', 'activation', 'active', 'activity', 'addition', 'adducted', 'adducts', 'adequate', 'adjusted', 'administrated', 'adsorption', 'affected', 'aged', 'alcohol', 'alcoholic', 'algae', 'alginate', 'alkaline', 'alkylated', 'alkylation', 'alkyne', 'analogous', 'anesthetized', 'appended', 'armed', 'aromatic', 'assay', 'assemblages', 'assisted', 'associated', 'atom', 'atoms', 'attenuated', 'attributed', 'backbone', 'base', 'based', 'bearing', 'benzylation', 'binding', 'biomolecule', 'biotic', 'blocking', 'blood', 'bond', 'bonded', 'bonding', 'bonds', 'boosted', 'bottle', 'bottled', 'bound', 'bridge', 'bridged', 'buffer', 'buffered', 'caged', 'cane', 'capped', 'capturing', 'carrier', 'carrying', 'catalysed', 'catalyzed', 'cation', 'caused', 'centered', 'challenged', 'chelating', 'cleaving', 'coated', 'coating', 'coenzyme', 'competing', 'competitive', 'complex', 'complexes', 'compound', 'compounds', 'concentration', 'conditioned', 'conditions', 'conducting', 'configuration', 'confirmed', 'conjugate', 'conjugated', 'conjugates', 'connectivity', 'consuming', 'contained', 'containing', 'contaminated', 'control', 'converting', 'coordinate', 'coordinated', 'copolymer', 'copolymers', 'core', 'cored', 'cotransport', 'coupled', 'covered', 'crosslinked', 'cyclized', 'damaged', 'dealkylation', 'decocted', 'decorated', 'deethylation', 'deficiency', 'deficient', 'defined', 'degrading', 'demethylated', 'demethylation', 'dendrimer', 'density', 'dependant', 'dependence', 'dependent', 'deplete', 'depleted', 'depleting', 'depletion', 'depolarization', 'depolarized', 'deprived', 'derivatised', 'derivative', 'derivatives', 'derivatized', 'derived', 'desorption', 'detected', 'devalued', 'dextran', 'dextrans', 'diabetic', 'dimensional', 'dimer', 'distribution', 'divalent', 'domain', 'dominated', 'donating', 'donor', 'dopant', 'doped', 'doping', 'dosed', 'dot', 'drinking', 'driven', 'drug', 'drugs', 'dye', 'edge', 'efficiency', 'electrodeposited', 'electrolyte', 'elevating', 'elicited', 'embedded', 'emersion', 'emitting', 'encapsulated', 'encapsulating', 'enclosed', 'enhanced', 'enhancing', 'enriched', 'enrichment', 'enzyme', 'epidermal', 'equivalents', 'etched', 'ethanolamine', 'evoked', 'exchange', 'excimer', 'excluder', 'expanded', 'experimental', 'exposed', 'exposure', 'expressing', 'extract', 'extraction', 'fed', 'finger', 'fixed', 'fixing', 'flanking', 'flavonoid', 'fluorescence', 'formation', 'forming', 'fortified', 'free', 'function', 'functionalised', 'functionalized', 'functionalyzed', 'fused', 'gas', 'gated', 'generating', 'glucuronidating', 'glycoprotein', 'glycosylated', 'glycosylation', 'gradient', 'grafted', 'group', 'groups', 'halogen', 'heterocyclic', 'homologues', 'hydrogel', 'hydrolyzing', 'hydroxylated', 'hydroxylation', 'hydroxysteroid', 'immersion', 'immobilized', 'immunoproteins', 'impregnated', 'imprinted', 'inactivated', 'increased', 'increasing', 'incubated', 'independent', 'induce', 'induced', 'inducible', 'inducing', 'induction', 'influx', 'inhibited', 'inhibitor', 'inhibitory', 'initiated', 'injected', 'insensitive', 'insulin', 'integrated', 'interlinked', 'intermediate', 'intolerant', 'intoxicated', 'ion', 'ions', 'island', 'isomer', 'isomers', 'knot', 'label', 'labeled', 'labeling', 'labelled', 'laden', 'lamp', 'laser', 'layer', 'layers', 'lesioned', 'ligand', 'ligated', 'like', 'limitation', 'limited', 'limiting', 'lined', 'linked', 'linker', 'lipid', 'lipids', 'lipoprotein', 'liposomal', 'liposomes', 'liquid', 'liver', 'loaded', 'loading', 'locked', 'loss', 'lowering', 'lubricants', 'luminance', 'luminescence', 'maintained', 'majority', 'making', 'mannosylated', 'material', 'mediated', 'metabolizing', 'metal', 'metallized', 'methylation', 'migrated', 'mimetic', 'mimicking', 'mixed', 'mixture', 'mode', 'model', 'modified', 'modifying', 'modulated', 'moiety', 'molecule', 'monoadducts', 'monomer', 'mutated', 'nanogel', 'nanoparticle', 'nanotube', 'need', 'negative', 'nitrosated', 'nitrosation', 'nitrosylation', 'nmr', 'noncompetitive', 'normalized', 'nuclear', 'nucleoside', 'nucleosides', 'nucleotide', 'nucleotides', 'nutrition', 'olefin', 'olefins', 'oligomers', 'omitted', 'only', 'outcome', 'overload', 'oxidation', 'oxidized', 'oxo-mediated', 'oxygenation', 'page', 'paired', 'pathway', 'patterned', 'peptide', 'permeabilized', 'permeable', 'phase', 'phospholipids', 'phosphopeptide', 'phosphorylated', 'pillared', 'placebo', 'planted', 'plasma', 'polymer', 'polymers', 'poor', 'porous', 'position', 'positive', 'postlabeling', 'precipitated', 'preferring', 'pretreated', 'primed', 'produced', 'producing', 'production', 'promoted', 'promoting', 'protected', 'protein', 'proteomic', 'protonated', 'provoked', 'purified', 'radical', 'reacting', 'reaction', 'reactive', 'reagents', 'rearranged', 'receptor', 'receptors', 'recognition', 'redistribution', 'redox', 'reduced', 'reducing', 'reduction', 'refractory', 'refreshed', 'regenerating', 'regulated', 'regulating', 'regulatory', 'related', 'release', 'releasing', 'replete', 'requiring', 'resistance', 'resistant', 'resitant', 'response', 'responsive', 'responsiveness', 'restricted', 'resulted', 'retinal', 'reversible', 'ribosylated', 'ribosylating', 'ribosylation', 'rich', 'right', 'ring', 'saturated', 'scanning', 'scavengers', 'scavenging', 'sealed', 'secreting', 'secretion', 'seeking', 'selective', 'selectivity', 'semiconductor', 'sensing', 'sensitive', 'sensitized', 'soluble', 'solution', 'solvent', 'sparing', 'specific', 'spiked', 'stabilised', 'stabilized', 'stabilizing', 'stable', 'stained', 'steroidal', 'stimulated', 'stimulating', 'storage', 'stressed', 'stripped', 'substituent', 'substituted', 'substitution', 'substrate', 'sufficient', 'sugar', 'sugars', 'supplemented', 'supported', 'suppressed', 'surface', 'susceptible', 'sweetened', 'synthesizing', 'tagged', 'target', 'telopeptide', 'terminal', 'terminally', 'terminated', 'termini', 'terminus', 'ternary', 'terpolymer', 'tertiary', 'tested', 'testes', 'tethered', 'tetrabrominated', 'tolerance', 'tolerant', 'toxicity', 'toxin', 'tracer', 'transfected', 'transfer', 'transition', 'transport', 'transporter', 'treated', 'treating', 'treatment', 'triggered', 'turn', 'type', 'unesterified', 'untreated', 'vacancies', 'vacancy', 'variable', 'water', 'yeast', 'yield', 'zwitterion'}¶ Split on hyphens followed by one of these sequences
-
NO_SPLIT
= {'°c'}¶
-
get_additional_regex
(sentence)[source]¶ Any additional regex to further split the tokens. These regular expressions may be supplied by the sentence contexually and on the fly. For example, a sentence may have certain models associated with it and dimensions associated with these models. These dimensions can inform the tokenizer what to do with high confidence; for example, if given a string like “12K”, then if a temperature is desired, then the tokenizer will automatically split this given the information provided.
- Parameters:
sentence (chemdataextractor.doc.text.Sentence) – The sentence for which to get additional regex
- Returns:
Expression to further split the tokens
- Return type:
re.expression
-
-
class
chemdataextractor.nlp.tokenize.
FineWordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.WordTokenizer
Word Tokenizer that also split around hyphens and all colons.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '’', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '-', '−', '‒', '‐', '‑', '+', '±', ':']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_NO_DIGIT
= [',']¶ Split before these sequences if they end a token
-
NO_SPLIT
= {}¶
-
NO_SPLIT_PREFIX
= {}¶ Don’t split around hyphens with these prefixes
-
NO_SPLIT_SUFFIX
= {}¶ Don’t split around hyphens with these suffixes.
-
-
class
chemdataextractor.nlp.tokenize.
BertWordTokenizer
(split_last_stop=True, path=None, lowercase=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.ChemWordTokenizer
A word tokenizer for BERT with some additional allowances in case one wants to override its choices. Concrete overrides that are used in CDE include not splitting if it seems like a decimal point is in the middle of a number, and splitting values and units.
-
do_not_split
= []¶
-
do_not_split_if_in_num
= ['.', ',']¶
-
-