.nlp

Tools for performing the NLP stages, such as POS tagging, Word clustering, CNER, Abbreviation detection

Chemistry-aware natural language processing framework.

.nlp.abbrev

Abbreviation detection.

class chemdataextractor.nlp.abbrev.AbbreviationDetector(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Bases: object

Detect abbreviation definitions in a list of tokens.

Similar to the algorithm in Schwartz & Hearst 2003.

__init__(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Initialize self. See help(type(self)) for accurate signature.

abbr_min = 3

Minimum abbreviation length

abbr_max = 10

Maximum abbreviation length

abbr_equivs = []

String equivalents to use when detecting abbreviations.

detect(tokens)[source]

Return a (abbr, long) pair for each abbreviation definition.

detect_spans(tokens)[source]

Return (abbr_span, long_span) pair for each abbreviation definition.

abbr_span and long_span are (int, int) spans defining token ranges.

class chemdataextractor.nlp.abbrev.ChemAbbreviationDetector(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]

Bases: chemdataextractor.nlp.abbrev.AbbreviationDetector

Chemistry-aware abbreviation detector.

This abbreviation detector has an additional list of string equivalents (e.g. Silver = Ag) that improve abbreviation detection on chemistry texts.

abbr_min = 3

Minimum abbreviation length

abbr_max = 10

Maximum abbreviation length

abbr_equivs = [('silver', 'Ag'), ('gold', 'Au'), ('mercury', 'Hg'), ('lead', 'Pb'), ('tin', 'Sn'), ('tungsten', 'W'), ('iron', 'Fe'), ('sodium', 'Na'), ('potassium', 'K'), ('copper', 'Cu'), ('sulfate', 'SO4'), ('methanol', 'MeOH'), ('ethanol', 'EtOH'), ('hydroxy', 'OH'), ('hexadecyltrimethylammonium bromide', 'CTAB'), ('cytarabine', 'Ara-C'), ('hydroxylated', 'OH'), ('hydrogen peroxide', 'H2O2'), ('quartz', 'SiO2'), ('amino', 'NH2'), ('amino', 'NH2'), ('ammonia', 'NH3'), ('ammonium', 'NH4'), ('methyl', 'CH3'), ('nitro', 'NO2'), ('potassium carbonate', 'K2CO3'), ('carbonate', 'CO3'), ('borohydride', 'BH4'), ('triethylamine', 'NEt3'), ('triethylamine', 'Et3N')]

String equivalents to use when detecting abbreviations.

.nlp.cem

Named entity recognition (NER) for Chemical entity mentions (CEM).

chemdataextractor.nlp.cem.IGNORE_SUFFIX = ['-', "'s", '-activated', '-adequate', '-affected', '-anesthetized', '-based', '-binding', '-boosted', '-cane', '-conditioned', '-containing', '-covered', '-deficient', '-dependent', '-derived', '-electrolyte', '-enriched', '-exposed', '-flanking', '-free', '-fused', '-gated', '-glucuronosyltransferases', '-increasing', '-induced', '-inducible', '-l-tyrosine', '-labeled', '-lesioned', '-loaded', '-mediated', '-patterned', '-primed', '-reducing', '-regulated', '-releasing', '-resistant', '-response', '-rich', '-s-transferase', '-sensitive', '-soluble', '-stimulated', '-stressed', '-supplemented', '-terminal', '-transferase', '-treated', '-type', '-blood', '-specific', '-like', '-elicited', '-stripped', '-transfer', '-conjugate', '-coated', '-producing', '-oxidized', '-associated', '-related', '-converting', '-ligand', '-on-glass', '-seeking', '-hydrolyzing', '-o-deethylase', '-deethylase', '-o-depentylase', '-depentylase', '-n-demethylase', '-demethylase', '-o-methyltransferase', '-c-oxidase', '-oxidase', '-n-biosidase', '-biosidase', '-immunoproteins', '-spiked', '-lowering', '-page', '-depletion', '-formation', '-dealkylation', '-deethylation', '-alkylation', '-ribosylation', '-production', '-demethylation', '-oxidation', '-transition', '-glycosylation', '-zwitterion', '-benzylation', '-reduction', '-oxygenation', '-nitrosylation', '-evoked', '-mutated', '-doped', '-aged', '-increased', '-triggered', '-linked', '-fixed', '-injected', '-contaminated', '-depleted', '-enhanced', '-stained', '-modified', '-fed', '-demethylated', '-catalyzed', '-etched', '-labelled', '-conjugated', '-pretreated', '-ribosylated', '-phosphorylated', '-reduced', '-bonded', '-stabilised', '-crosslinked', '-mannosylated', '-capped', '-supported', '-initiated', '-integrated', '-accelerated', '-encapsulated', '-untreated', '-expanded', '-coupled', '-terminated', '-assisted', '-permeabilized', '-resulted', '-alkylated', '-functionalized', '-contained', '-buffered', '-caused', '-cyclized', '-substituted', '-modulated', '-inhibited', '-centered', '-promoted', '-confirmed', '-provoked', '-dominated', '-limited', '-challenged', '-tetrabrominated', '-unesterified', '-refreshed', '-bottled', '-protonated', '-incubated', '-tagged', '-damaged', '-bridged', '-maintained', '-impregnated', '-metabolizing', '-deprived', '-insensitive', '-dendrimer', '-receptor', '-tolerant', '-influx', '-administrated', '-requiring', '-permeable', '-transport', '-intoxicated', '-overload', '-derivatives', '-derivative', '-sweetened', '-transporter', '-bound', '-extract', '-bonding', '-bond', '-trna', '-redistribution', '-copolymers', '-copolymer', '-appended', '-susceptible', '-transfected', '-bearing', '-regenerating', '-induction', '-conducting', '-decorated', '-encapsulating', '-consuming', '-bridge', '-dependence', '-Pdots', '-only', '-carrying', '-treating', '-isomerase', '-ion', '-ions', '-coordinated', '-saturated', '-sparing', '-enclosed', '-stabilized', '-polymer', '-yeast', '-making', '-porous', '-independent', '-metallized', '-attenuated', '-liquid', '-caged', '-deficiency', '-sensing', '-recognition', '-responsiveness', '-embedded', '-connectivity', '-abuse', '-chelating', '-decocted', '-forming', '-nutrition', '-scavenging', '-preferring', '-mimicking', '-drugs', '-drug', '-lubricants', '-adsorption', '-ligated', '-detected', '-responsive', '-reacting', '-defined', '-capturing', '-group', '-abstinent', '-paired', '-devalued', '-need', '-cellulose', '-atpase', '-inactivated', '-β-glucosaminidase', '-glucosaminidase', '-dosed', '-imprinted', '-precipitated', '-monoadducts', '-vacancies', '-vacancy', '-attributed', '-depolarization', '-depolarized', '-liver', '-testes', '-reversible', '-active', '-reactive', '-dextran', '-fixing', '-synthesizing', '-inhibitory', '-cleaving', '-positive', '-activity', '-fluorescence', '-regulating', '-NPs', '-scanning', '-water', '-nmr', '-limiting', '-refractory', '-knot', '-variable', '-biomolecule', '-backbone', '-exchange', '-donating', '-coating', '-hydrogenase', '-hydrogenases', '-intolerant', '-deplete', '-poor', '-loading', '-enrichment', '-elevating', '-resitant', '-stabilizing', '-pathway', '-fortified', '-adjusted', '-restricted', '-dependant', '-locked', '-normalized', '-aromatic', '-hydroxylation', '-intermediate', '-6-phosphatase', '-phosphatase', '-linker', '-proteomic', '-mimetic', '-lipid', '-radical', '-receptors', '-substrate', '-conjugates', '-promoting', '-dye', '-functionalyzed', '-catalysed', '-reductase', '-QDs', '-complexes', '-placebo', '-transferases', '-alginate', '-competing', '-depleting', '-sensitized', '-protein', '-regulatory', '-target', '-toxin', '-yield', '-planted', '-produced', '-derivatized', '-secreting', '-modifying', '-DNA', '-bonds', '-assemblages', '-exposure', '-negative', '-sealed', '-atom', '-atoms', '-abstraction', '-concentration', '-doping', '-competitive', '-acclimation', '-acclimated', '-interlinked', '-suppressed', '-postlabeling', '-labeling', '-diabetic', '-omitted', '-sufficient', '-generating', '-terminus', '-adducts', '-compound', '-compounds', '-γ-lyase', '-γ-synthase', '-lyase', '-synthase', '-inhibitor', '-protected', '-multiwall', '-stripping', '-plasma', '-evolving']

Token endings to ignore when considering stopwords and deriving spans

chemdataextractor.nlp.cem.IGNORE_PREFIX = ['fluorophore-', 'low-', 'high-', 'single-', 'odd-', 'non-', 'high-', 'cross-', 'cellulose-', 'anti-', '-multiwall', 'globular-', 'plasma-', 'hybrid-', 'protein-', 'explicit-', 'cation-', 'water-', 'through-', 'starch-', 'rigid-', 'conjugated-', 'photoactivatable-', 'alginate-', 'nano-', 'dye-', 'ligand-', 'enzyme-', 'platelet-', 'photo-', 'total-', 'drug-', 'nanoparticle-', 'nanomaterial-', 'inter-', 'ion-', 'post-', 'one-']

Token beginnings to ignore when considering stopwords and deriving spans

chemdataextractor.nlp.cem.STRIP_END = ['groups', 'group', 'colloidal', 'dyes', 'dye', 'products', 'product', 'substances', 'substance', 'solution', 'derivatives', 'derivative', 'analog', 'salts', 'salt', 'minerals', 'mineral', 'anesthetic', 'tablet', 'tablets', 'preparation', 'atoms', 'atom', 'monomers', 'monomer', 'nanoparticles', 'nanoparticle', 'radicals', 'radical', 'dendrimers', 'dendrimer', 'ions', 'ion', 'particles', 'particle', 'anion', 'cation', 'foam', 'cellulose', 'dextran', '(', 'dust', 'herbicide', 'disease', 'diseases', 'and', 'or', ';', ',', '.']

Final tokens to remove from entity matches

chemdataextractor.nlp.cem.STRIP_START = ['anhydrous', 'elemental', 'amorphous', 'conjugated', 'colloidal', 'activated', 'water-soluble', 'total', 'superparamagnetic', 'molecular', 'high-density', 'synthetic', 'low-density', 'long-chain', 'fused', 'radioactive', 'reduced', 'anatase', 'dextran', ')', 'trisubstituted', 'deposited', 'herbicide', 'antagonist', 'agonist', 'and', 'or', 'metallic', 'embryotoxic', 'monoclinic']

First tokens to remove from entity matches

chemdataextractor.nlp.cem.STOP_TOKENS = {'.cdx', '.sk2', '10.1021', '10.1039', '10.1186', 'account', 'adenovirus', 'affiliation', 'affiliations', 'aldrich', 'allphar', 'alpharma', 'america', 'angeles', 'apotex', 'approach', 'april', 'article', 'articles', 'astrazeneca', 'august', 'aventis', 'azərbaycanca', 'background', 'bayer', 'behringer', 'berlin', 'bibliography', 'bibtex', 'biochemistry', 'bioniche', 'bipharma', 'books', 'bovine', 'bristol', 'bristol-myers', 'cambridge', 'ccdc', 'chauvin', 'chemistry', 'chemspider', 'chemworx', 'chicago', 'chicken', 'children', 'china', 'chocolate', 'chromatography', 'ciba-geigy', 'citation', 'citing', 'claim', 'claims', 'claire', 'cm–1', 'coffee', 'colored', 'conclusion', 'conclusions', 'contact', 'crossref', 'cytochrome', 'danielle', 'december', 'dielectric', 'discussion', 'docking', 'doctrine', 'doi', 'download', 'edinburgh', 'edit', 'editor', 'editorial', 'editors', 'ekins', 'email', 'energy', 'english', 'esi', 'español', 'esperanto', 'ethical', 'euskara', 'external', 'february', 'fig.', 'file', 'fluorochem', 'francisco', 'gene', 'genetical', 'genevrier', 'genzyme', 'glaxo', 'glaxosmithkline', 'glycoprotein', 'google', 'guidelines', 'having', 'help', 'horse', 'human', 'imaging', 'index', 'inhibitor', 'interpharm', 'introduction', 'ireland', 'isbn', 'italiano', 'january', 'journal', 'journals', 'july', 'june', 'latviešu', 'letters', 'libraries', 'link', 'linkedin', 'links', 'literature', 'london', 'magazine', 'mammalian', 'march', 'marinlit', 'masthead', 'measurements', 'medline', 'members', 'menu', 'merck', 'method', 'methods', 'more', 'nano-beads', 'nanobeads', 'napoleon', 'navigation', 'nordfriisk', 'novartis', 'november', 'novopharm', 'october', 'overdose', 'oxford', 'palestine', 'parameters', 'paris', 'permissions', 'personal', 'pfizer', 'pharmacia', 'pharmacology', 'phenomena', 'pig', 'policy', 'prior', 'priority', 'privacy', 'procter', 'production', 'profile', 'rachel', 'ratiopharm', 'recombinant', 'recombination', 'references', 'research', 'results', 'retention', 'roche', 'safety', 'salmon', 'schering', 'scientifique', 'september', 'sheep', 'sigma-aldrich', 'southampton', 'squibb', 'studies', 'syntheticpage', 'systematic', 'technical', 'test', 'tobacco', 'tokyo', 'upload', 'visfarm', 'volume', 'wikimedia', 'wiskott', 'york', 'zhang', '§', 'नेपाल भाषा', '†'}

Disallowed tokens in chemical entity mentions (discard if any single token has exact case-insensitive match)

chemdataextractor.nlp.cem.STOP_SUB = {' brand of ', ' oil', ' with ', '!', '%', ', ', ';', '?', '@', '\\', 'activating factor', 'adrenocorticotropic', 'anticodon', 'botulinum', 'coagulation factor', 'concanavalin', 'conductance', 'corticotrophin', 'corticotropin', 'exciton', 'factor ', 'fibroblast', 'follicle', 'freund', 'gene-related', 'glucagon', 'glucan', 'gramicidin', 'growth factor', 'hemoglobin', 'insulin', 'intercellular', 'interferon', 'interleukin', 'luteinizing', 'melanin', 'natriuretic', 'necrosis', 'necrosis factor', 'neurofilament', 'neuropeptide', 'oil of ', 'plasminogen', 'platelet', 'reactive', 'regulator', 'releasing factor', 'selectin', 'stimulating factor', 'transcription factor', 'transmembrane', '|'}

Disallowed substrings in chemical entity mentions (only used when filtering to construct the dictionary?)

chemdataextractor.nlp.cem.STOPLIST = {'(gaba)ergic', '1,3-dpma', '1,5-dpma', '12mg', '3 ps', '3ps', "5'-amp", '90th', 'Nucleophosmin', '[h2o2]', 'a', 'a chlorophyll', 'abbott', 'about', 'absolute ethanol', 'ac187', 'acacia', 'accelerate', 'accent', 'acs mobile', 'acs nano', 'acs omega', 'acth', 'actinin-4', 'activated carbon', 'activated charcoal', 'active carbon', 'activin', 'actomyosin', 'adage', 'adalimumab', 'adept', 'adipsin', 'adma', 'admire', 'adrenocorticotrophic hormone', 'adrenocorticotropic hormone', 'adrenodoxin', 'advance', 'advantage', 'aero', 'af-2', 'again', 'agar', 'agarose', 'agcg', 'agglutinin', 'akron', 'alamethicin', 'alcoholic', 'aldrich', 'alginate', 'all', 'allay', 'alliance', 'almost', 'alpen', 'alpha-actinin-4', 'alpha-t', 'also', 'although', 'alto', 'alum', 'always', 'am1', 'amaze', 'amberlite', 'ambush', 'amen', 'amitraz', 'ammo', 'among', 'amorphous carbon', 'amorphous silica', 'amphiregulin', 'amylin', 'amylopectin', 'amylose', 'an', 'an-152', 'and', 'android', 'angiotensin', 'angiotensin i', 'angiotensinogen', 'anion', 'anna', 'anon', 'another', 'anterior pituitary hormone', 'anti-stress', 'antidiuretic hormone', 'antitussive', 'any', 'aopp', 'apex', 'apolar', 'applaud', 'apron', 'aprotinin', 'aqua', 'arabinogalactan', 'arac', 'are', 'arena', 'aria', 'aromatic amine', 'arrow', 'arsenal', 'artemisinin', 'artist', 'as', 'ascophyllum', 'assert', 'assure', 'at', 'atpγs', 'atrium', 'aurora', 'auroxanthin', 'austin', 'authority', 'avastin', 'avenge', 'aversion', 'avicel', 'avicel cl611', 'avicel ph101', 'b(+)', 'b-dna', 'b13', 'bacp-2', 'bacteriorhodopsin', 'balance', 'banner', 'bantu', 'barnase', 'barnase-barstar', 'baron', 'baroque', 'barrage', 'barrels', 'barricade', 'barstar', 'battalion', 'bazooka', 'be', 'beast', 'because', 'been', 'before', 'being', 'belatacept', 'belt', 'benchmark', 'benet', 'bengal', 'beret', 'bernice', 'betaine', 'betula', 'between', 'bevacizumab', 'bide', 'bile', 'bionic', 'biopterin', 'bishop', 'bishop-kirtman', 'blazer', 'blizzard', 'bloc', 'blood coagulation factor x', 'blood sugar', 'bloom', 'blow', 'bnp-32', 'bont/a', 'borneo', 'both', 'botox', 'brace', 'brake', 'brass', 'bridal', 'brigade', 'briton', 'bromelain', 'bromelia', 'brs-3-ap', 'btx-a', 'bumetanide', 'bump', 'but', 'butter', 'by', 'c-15', 'c-peptide', 'c-reactive protein', 'cadherin 11', 'cadmium chloride (cdcl2)', 'calcined', 'calcitonin', 'calibre', 'calypso', 'cameo', 'campaign', 'can', 'candidate molecules', 'candy', 'cannon', 'canopy', 'capmul', 'caprine', 'capture', 'caramel', 'carbomer', 'carboxymethylcellulose', 'carob', 'carol', 'carotene', 'carotenoid', 'carotenoids', 'carrageenan', 'carrageenin', 'carrie', 'cascade', 'casein', 'castor oil', 'caviar', 'ccl3', 'ccl3(-/-)', 'cd2', 'cd2+', 'cd3(+)', 'cd3ε', 'cd4+', 'cd68', 'cecil', 'cellulase', 'cellulose', 'centurion', 'cetuximab', 'chamomile', 'charged', 'charlie', 'chemokine', 'chess', 'chitin', 'chitosan', 'cholecystokinin', 'cholera toxin', 'chondroitin', 'chondroitin sulfate', 'chondroitin sulphate', 'chopper', 'chorionic gonadotrophin', 'chorionic gonadotropin', 'chymotrypsin', 'cinch', 'citation', 'citizen', 'citrus pectin', 'classic', 'clathrin heavy chain', 'clay', 'clin', 'clipper', 'clout', 'clove oil', 'coca', 'cochineal', 'cocktail', 'cocoa butter', 'coke', 'cola', 'collagen', 'collagenase', 'collagens', 'colt', 'combat', 'comet', 'comfort', 'command', 'commando', 'commodore', 'compass', 'compendium', 'complement proteins', 'compound', 'concanavalin a', 'concise', 'conclusion', 'concord', 'confront', 'conjugated estrogens', 'conjugated linoleic acid', 'conserve', 'consist', 'cont', 'contest', 'cope-bd', 'coral', 'corn oil', 'corn starch', 'cornstarch', 'corsair', 'corticotrophin-releasing hormone', 'corticotropin', 'could', 'counter', 'counter-anion', 'counter-ion', 'crack', 'crackdown', 'crank', 'crap', 'cremophor el', 'crest', 'crossbow', 'crotoxin', 'crunch', 'crystal', 'crystallography', 'cubes', 'cultivate', 'curb', 'curcuma', 'cutlass', 'cyclin d1', 'cyclin d3', 'cyclones', 'cytochrome c', 'cytochrome p450', 'd250', 'daclizumab', 'dagger', 'dalteparin', 'dams', 'danshen', 'darbepoetin alfa', 'daren', 'dart', 'dash', 'ddds', 'defibrotide', 'deionized water', 'demon', 'denosumab', 'deoxyribonucleic acid', 'dept', 'dermatan sulfate', 'desmethyl-olanzapine', 'dextran', 'dextran sulfate sodium', 'dextrin', 'dial', 'diana', 'diane', 'dibs', 'did', 'dihydro', 'dinucleotide', 'discover', 'discussion', 'distilled water', 'diurnal', 'dividend', 'dixon', 'dm-10', 'dna double strand', 'dnase', 'dnase-i', 'do', 'does', 'dolly', 'done', 'dorado', 'dorm', 'dot-silica', 'double stranded dna', 'double-stranded dna', 'doyle', 'dpma', 'dragnet', 'drago', 'dragon', 'dreamer', 'due', 'duet', 'during', 'dwell', 'dynorphin', 'dynorphins', 'e-selectin', 'e-ssa', 'e3330', 'each', 'ecstasy', 'eculizumab', 'edge', 'either', 'elastin', 'elastomers', 'elevate', 'elite', 'elon', 'embark', 'emblem', 'emerald', 'eminent', 'emotion', 'empire', 'endeavour', 'endothelin-1', 'endurance', 'enforcer', 'enough', 'enoxaparin', 'epic', 'epidermal growth factor', 'epidermal growth factor (egf)', 'epoetin', 'epoetin alfa', 'epoetin beta', 'equity', 'eristostatin', 'erythropoietin', 'escort', 'especially', 'essex', 'estate', 'etc', 'ethylcellulose', 'excel', 'exciton', 'exenatide', 'exendin-4', 'exotoxin', 'expand', 'experimental', 'experimental procedures', 'facet', 'factor iia', 'factor v', 'factor vii', 'factor x', 'fenton', 'fenugreek', 'ferredoxin', 'ferritin', 'fetal hemoglobin', 'fgf2', 'fibrinogen', 'fibroin', 'finale', 'finish', 'first sign', 'flair', 'flake', 'flaxseed oil', 'flex', 'flint', 'flonase', 'flue gas', 'fly ash', 'follicle-stimulating hormone', 'for', 'fore', 'formulation', 'fortress', 'found', 'foxo1', 'fp-2', 'fractal', 'freedom', 'french green', 'fret-capture', 'from', 'fructose corn syrup', 'fucoidan', 'fulfill', 'furfural-water', 'furosemide', 'further', 'fury', 'fusarium toxin', 'galanin', 'gallant', 'gallery', 'galsulfase', 'gana', 'gastrin', 'gelatin', 'gelatine', 'gemini', 'general experimental', 'genesis', 'ghrp-2', 'ginseng', 'gleevec', 'glide', 'glp-1', 'glucagon', 'glucagon-like peptide-1', 'glucans', 'glucomannan', 'glucophage', 'glut', 'gluten proteins', 'glycerin', 'glycine', 'glycogen', 'glycopeptide', 'glycoprotein', 'glycoproteins', 'gm-csf', 'gold', 'gonadotropin releasing hormone', 'goon', 'gradual', 'gramicidin a', 'granite', 'grasp', 'green tea leaves', 'grenade', 'groundnut oil', 'growth hormone', 'growth hormone releasing hormone', 'gsno', 'gst-p(+)', 'gtpγs', 'guardian', 'gum arabic', 'gypsum', 'gypsum fibrosum', 'h3n2', 'had', 'hairy', 'halt', 'happy', 'harness', 'harry', 'has', 'have', 'having', 'headline', 'heat pre', 'heaven', 'hell', 'hemocyanin', 'hemoglobin', 'hemopexin', 'hemozoin', 'henna', 'heparan sulfate', 'heparin', 'herald', 'herceptin', 'here', 'heteroatoms', 'heterocyclic', 'hirudin', 'histone', 'hmqc', 'hocus', 'homogentisate', 'honey', 'horizon', 'how', 'however', 'human serum albumin', 'hyalgan', 'hyaluronan', 'hyaluronic acid', 'hyaluronidase', 'hydro', 'hydrogel', 'hydrolyzed polyacrylamide', 'hydroxyethylcellulose', 'hydroxypropyl methylcellulose', 'hydroxypropylcellulose', 'hydroxypropylmethylcellulose', 'hyperoxia', 'hypo', 'i', 'iberiotoxin', 'icatibant', 'icon', 'if', 'ifn-gamma', 'ifn-β', 'ifn-γ', 'igaba', 'igf-1', 'ignite', 'il-11', 'il-2', 'il10', 'il12', 'imperator', 'in', 'inas', 'indigo', 'inhalable', 'insular', 'insulin', 'insulin glargine', 'integrin', 'intense blue', 'interceptor', 'interferon', 'interferon-gamma', 'interferon-γ', 'into', 'introduction', 'inulin', 'invader', 'ion', 'ip-10', 'is', 'iscu', 'isomaltosaccharide', 'it', 'its', 'itself', 'iκb-α', 'iκbα', 'jasmonate', 'joker', 'jolt', 'joust', 'jumbo', 'junk', 'just', 'k-12', 'karate', 'kelp', 'keratin', 'kestrel', 'kg', 'km', 'kokan', 'kollidon', 'kudos', 'laba', 'lady', 'lama', 'lance', 'lancer', 'lasso', 'latex', 'latex particles', 'lats', 'lawson', 'lead', 'lead ion', 'leader', 'legend', 'liberty', 'light yellow', 'lignin', 'lignins', 'lignocellulose', 'limber', 'lime', 'linseed oil', 'linseed oils', 'lipid a', 'liposomal doxorubicin', 'lipoteichoic acid', 'liraglutide', 'lmwh', 'log in', 'lotion', 'lrp1', 'lsopc', 'ltb4', 'luteinising hormone', 'luteinizing hormone', 'm1-glucuronide', 'm41.4', 'maba', 'made', 'mag2', "mag2's", 'magnum', 'mainly', 'maintain a', 'make', 'malo', 'maltodextrin', 'manage', 'mandate', 'maneb', 'mannan', 'margarine', 'marksman', 'marshal', 'marshall', 'mascot', 'matador', 'match', 'may', 'maya', 'mega', 'melanin', 'melody', 'menopur', 'merit', 'merlin', 'meta', 'meta-analysis', 'metal-oxide', 'metallothionein', 'metallothioneins', 'methb', 'methemoglobin', 'methocel', 'method', 'methods', 'methylcellulose', 'metric', 'mg', 'mg-1', 'mibc', 'mica', 'microcrystalline cellulose', 'microdots', 'might', 'mighty', 'milk thistle', 'millie', 'minus', 'miracle', 'mirage', 'mist', 'ml', 'mm', 'mobile site', 'moesin', 'mol', 'molten', 'molybdate', 'moment', 'momentum', 'monarch', 'mops', 'morph', 'morpho', 'most', 'mostly', 'motilin', 'mpo-anca', 'mtcc', 'multiple', 'murabutide', 'musk', 'must', 'mustang', 'mustard', 'mustard oil', 'mutagen', 'n17', 'naglazyme', 'natalizumab', 'natural rubber', 'ndma', 'ndp-α-msh', 'nearly', 'neither', 'nerve agent', 'neurokinin a', 'neuropeptide y', 'new titles', 'nexus', 'nida', 'no', 'nociceptin', 'nodular', 'nor', 'nor-1', 'noxa', 'nucleobase', 'nucleophosmin', 'nucleotide', 'obtained', 'octadecaneuropeptide', 'octanol-air', 'octave', 'octreotide', 'of', 'often', 'oil', 'oil red', 'oil-in-water', 'oligonucleotide', 'olive oil', 'oliver', 'olympus', 'omalizumab', 'omega', 'on', 'opium', 'optimizer', 'orange', 'orbit', 'organometallic', 'organometallics', 'organometalloidal', 'orion', 'orphan', 'orphanin fq', 'osteocalcin', 'osteopontin', 'our', 'outflank', 'ovalbumin', 'overall', 'ovomucoid', 'p-selectin', 'p300', 'p450', 'pak1', 'paladin', 'pampa', 'pancreatin', 'papp-a', 'paraffin', 'parathyroid hormone', 'parkin', 'parlay', 'part 2', 'partially hydrolyzed polyacrylamide', 'pat4', 'patrol', 'pc-12', 'pc1', 'pc12', 'peace', 'pectin', 'pectins', 'pegasus', 'pegsunercept', 'pensive', 'peon', 'peony', 'peppermint oil', 'peptide e', 'percolate', 'perhaps', 'perk', 'perna', 'persian', 'petroleum ether', 'pgc1α', 'phaseolin', 'phosphor', 'phycocyanin', 'picket', 'picrate', 'pima', 'pink', 'piper', 'pirate', 'pivot', 'pla2', 'placental growth hormone', 'plasminogen', 'pledge', 'plumbago', 'pmid', 'polo', 'poloxamer', 'poly', 'poly(a)-poly(t)', 'poly(i:c)', 'polygon', 'polypeptide', 'polysaccharide', 'polystyrene latex', 'polyubiquitin', 'posse', 'potato starch', 'pounce', 'prep', 'preparation', 'press', 'preview', 'pride', 'prism', 'pristine', 'pro-opiomelanocortin', 'probate', 'probiotic', 'procure', 'prolactin', 'proopiomelanocortin', 'propolis', 'prosper', 'protamine', 'protanal', 'protein hydrolysate', 'prothrombin', 'prothrombinase', 'protide', 'protio', 'proton', 'provitamin a', 'prowl', 'pser-stat3', 'pseudo', 'psychogenic', 'puerarin', 'pullulan', 'punch', 'pursuit', 'pylon', 'pyrethrum', 'quark', 'quench', 'quite', 'racer', 'radar', 'radio', 'radixin', 'raid', 'raiser', 'rally', 'rampart', 'raptor', 'rather', 'ravage', 'raven', 'reactions', 'reactivity', 'really', 'recoil', 'reconcile', 'recruit', 'redeem', 'redskin', 'reduced hemoglobin', 'redux', 'references', 'regarding', 'regent', 'regulon', 'relax', 'relaxant', 'res', 'resilin', 'resovist', 'restful', 'results', 'retard', 'reticulin', 'revolution', 'rgd peptide', 'rhombic', 'ribonucleic acid', 'rice starch', 'ricin', 'rifle', 'ripcord', 'rival', 'rock', 'rogue', 'rosin', 'rotate', 'roundup', 'rubber', 'rufus', 'rugby', 'rutile', 's100', 'saber', 'saccharina', 'saccharum', 'safari', 'safety', 'saffron', 'saline', 'salix', 'salute', 'samp', 'sanction', 'sceptre', 'schisandra chinensis', 'scopolamine', 'scot', 'scourge', 'scout', 'scpa', 'scuffle', 'se-selectin', 'section', 'seem', 'seen', 'senna', 'sentry', 'sephadex g-75', 'sephadex lh-20', 'sepharose', 'serum albumin', 'sesame oil', 'several', 'shellac', 'sherpa', 'shiga toxin', 'shogun', 'should', 'show', 'showed', 'shown', 'shows', 'siamycin', 'significantly', 'silence', 'silicone', 'silybum marianum', 'since', 'singlet oxygen', 'sirius', 'slam', 'smack', 'smash', 'smear', 'snap', 'snap-25', 'snip', 'sniper', 'snort', 'snow', 'so', 'soda', 'solo', 'solvent', 'somatostatine', 'some', 'sonar', 'sonata', 'sophia', 'sp1', 'spectrin', 'spiegel', 'spirit', 'splendor', 'spme', 'spotless', 'spotlight', 'sprinkle', 'squad', 'squalamine', 'stability', 'stainless steel', 'stalker', 'stanza', 'staple', 'star', 'starch', 'starches', 'steel', 'stim', 'stipend', 'stomp', 'storm', 'streptavidin', 'strike', 'stuff', 'subdue', 'substance p', 'such', 'sultan', 'summit', 'sunshine', 'supra', 'supreme', 'surfer', 'surpass', 'suspend', 'sv2', 'sword', 'synacthen', 'syntheses', 'synthesis', 'synthol', 'synvisc', 't-47', 't-pa', 't140', 'ta98', 'tabloid', 'tace', 'tackle', 'talin', 'talon', 'tame', 'tara', 'tarragon', 'tattoo', 'taxus', 'tea catechin', 'tea polyphenol', 'teac', 'tell', 'telomerase', 'tenax', 'teriparatide', 'terminator', 'terpolymer', 'test mixture', 'textile', 'than', 'that', 'the', 'the-7', 'their', 'theirs', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thioredoxin', 'thioredoxins', 'this', 'those', 'thrombin', 'thromboplastin', 'through', 'thus', 'thymosin β4', 'thyroglobulin', 'thyroid stimulating hormone', 'tilt', 'tindal', 'titan', 'titus', 'tm-74', 'tnfα', 'to', 'toke', 'tomahawk', 'tonal', 'toot', 'tops', 'torpedo', 'total bilirubin', 'touchdown', 'tough', 'trails', 'tranquil', 'transdermal patch', 'transfer rna', 'transferrin', 'trastuzumab', 'triangle', 'trim', 'triticum', 'triumph', 'trypsin', 'trypsinogen', 'tsar', 'tsst-1', 'tunic', 'turbo', 'turmeric', 'turpentine', 'ubiquinone', 'ubiquitin', 'ubr2', 'ultimate', 'upon', 'ural', 'uranyl nitrate', 'urea nitrogen', 'urokinase', 'use', 'used', 'using', 'vacate', 'valiant', 'valosin-containing protein', 'vanilla', 'vanquish', 'various', 'vas1', 'vasal', 'vaseline', 'vengeance', 'verdict', 'vermin', 'versed', 'vertex', 'very', 'vicilin', 'vigil', 'vinca', 'vinculin', 'vinegar', 'vishnu', 'visor', 'vitamin', 'vitamins', 'volley', 'vortex', 'wander', 'wang', 'warf', 'was', 'water', 'water vapor', 'water-in-oil', 'waters', 'we', 'were', 'whack', 'what', 'wheat starch', 'when', 'which', 'while', 'whip', 'white light', 'with', 'within', 'without', 'would', 'x close', 'xanthan', 'xanthan gum', 'xanthium', 'xylan', 'xyloglucan', 'yellow', 'yellows', 'zest', 'ziconotide', 'zodiac', 'zymosan', 'Ω127', 'α-lactalbumin', 'α-msh', 'β-endorphin', 'β-nf', 'δr(1)', 'ω', '∑pcbs'}

Disallowed chemical entity mentions (discard if exact case-insensitive match)

chemdataextractor.nlp.cem.STOP_RES = ['^(http|ftp)://', '\\.(com|uk|eu|org|net)$', '^\\d{4}-\\d{3}[\\dx]$', '^[\\w\\-\\.\\+%]{4,} @ \\w[\\w\\-\\.]+\\.(com?|edu|gov|ac)(\\.[\\w\\-\\.]+)?$', '^[\\d,:\\- ]*\\d{4,}[\\d,:\\- ]*$', '\\d{3,} , \\d{3,}', '(\\d\\d+\\.\\d+|\\d\\.\\d\\d+)', '\\d and \\d', '^(\\[\\d+\\]\\s*)+$', '^\\d+$', '= \\d', '^\\+?\\d[ \\d-]$', 'cm-1', '^(compound|ligand|chemical|dye|derivative|complex|example|intermediate|product|formulae?)s? [a-z\\d]{1,3}', '(b3lyp|31g\\(d,p\\)|td-dft)', 'et al\\.?$', '^(ep|wo|us)\\s*\\d\\s*\\d\\d[\\d\\s]*([AB]\\d)?($|\\s*and)', '^(pre|post)-\\d\\d\\d\\d', '\\d ml$', '\\.(png|gif|jpg|txt|html|docx?|xlsx?)$', '^(tel|fax)\\s*:?\\s*\\+?\\s*\\d']

the entity text is passed as lowercase.

Type:Regular expressions that define disallowed chemical entity mentions. Note
chemdataextractor.nlp.cem.SPLITS = ['^(actinium|aluminium|aluminum|americium|antimony|argon|arsenic|astatine|barium|berkelium|beryllium|bismuth|bohrium|boron|bromine|cadmium|caesium|calcium|californium|carbon|cerium|cesium|chlorine|chromium|cobalt|copernicium|copper|curium|darmstadtium|dubnium|dysprosium|einsteinium|erbium|europium|fermium|flerovium|fluorine|francium|gadolinium|gallium|germanium|gold|hafnium|hassium|helium|holmium|hydrargyrum|hydrogen|indium|iodine|iridium|iron|kalium|krypton|lanthanum|lawrencium|lead|lithium|livermorium|lutetium|magnesium|manganese|meitnerium|mendelevium|mercury|molybdenum|natrium|neodymium|neon|neptunium|nickel|niobium|nitrogen|nobelium|osmium|oxygen|palladium|phosphorus|platinum|plumbum|plutonium|polonium|potassium|praseodymium|promethium|protactinium|radium|radon|rhenium|rhodium|roentgenium|rubidium|ruthenium|rutherfordium|samarium|scandium|seaborgium|selenium|silicon|silver|sodium|stannum|stibium|strontium|sulfur|tantalum|technetium|tellurium|terbium|thallium|thorium|thulium|tin|titanium|tungsten|ununoctium|ununpentium|ununseptium|ununtrium|uranium|vanadium|wolfram|xenon|ytterbium|yttrium|zinc|zirconium)$', '^(Ag|Al|Ar|Au|Br|Cd|Cl|Co|Cu|Fe|Gd|Ge|Hg|Kr|Li|Mg|Na|Ne|Ni|Pb|Pd|Pt|Ru|Sb|Si|Sn|Ti|Xe|Zn|Zr|Zn)$', '^(iodide|triiodide|nitrite|nitrate)$', '^(graphane|graphene|carbon|silica|glucose)$', '^(sugar|phospate)$', '^(azide|alkyne|alkene|alkane)$', '^(arginine|cysteine|glycine|aspartic acid|glutamate|dopamine|serotonin|acetone|methanol|ethanol|EtOH|MeOH|AcOEt|melatonin|leucine|alanine|histidine|isoleucine|lysine|threonine|tryptophan|nicotine|gentamicin|ATP|FITC|biotin|tamoxifen|catechin|asparagine)$', '^(Ala|Arg|Asn|Asp|Cys|Glu|Gln|Gly|His|Ile|Leu|Lys|Met|Phe|Pro|Ser|Thr|Trp|Tyr|Val)(?:\\(?\\d+\\)?)?$', '^(\\(?1\\)?H|\\(?1[45]\\)?N|\\(?1[234]\\)?C|\\(?19\\)?F)$', '^(F|Cl|Zn[OS]|H\\(?2\\)?O(\\(?2\\)?)?|Ni\\(OH\\)\\(?2\\)?|(NiF|SnO|TiO|NO)\\(?2\\)?|(Al|Y|Fe)\\(?2\\)?O\\(?3\\)?|CaCO\\(?3\\)?)$', '^(ester|amide)$']

Regular expressions defining collections of words that should be split if joined by hyphens or -to-

class chemdataextractor.nlp.cem.CiDictCemTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.DictionaryTagger

Case-insensitive CEM dictionary tagger.

lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
model = 'models/cem_dict-1.0.pickle'
class chemdataextractor.nlp.cem.CsDictCemTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.DictionaryTagger

Case-sensitive CEM dictionary tagger.

lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
model = 'models/cem_dict_cs-1.0.pickle'
case_sensitive = True
class chemdataextractor.nlp.cem.CrfCemTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.CrfTagger

model = 'models/cem_crf_chemdner_cemp-1.0.pickle'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
clusters = True
params = {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 200}
class chemdataextractor.nlp.cem.CemTagger[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Return the combined output of a number of chemical entity taggers.

taggers = [<chemdataextractor.nlp.cem.CrfCemTagger object>, <chemdataextractor.nlp.cem.CiDictCemTagger object>, <chemdataextractor.nlp.cem.CsDictCemTagger object>]

The individual chemical entity taggers to use.

lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
tag(tokens)[source]

Run individual chemical entity mention taggers and return union of matches, with some postprocessing.

.nlp.corpus

Tools for reading and writing text corpora.

class chemdataextractor.nlp.corpus.LazyCorpusLoader(name, reader_cls, *args, **kwargs)[source]

Bases: object

Derived from NLTK LazyCorpusLoader.

__init__(name, reader_cls, *args, **kwargs)[source]

Initialize self. See help(type(self)) for accurate signature.

chemdataextractor.nlp.corpus.wsj = <BracketParseCorpusReader in '.../corpora/wsj_training' (not loaded yet)>

Penn Treebank Revised, LDC2015T13)

Type:Entire WSJ corpus (English News Text Treebank
chemdataextractor.nlp.corpus.wsj_training = <BracketParseCorpusReader in '.../corpora/wsj_training' (not loaded yet)>

Penn Treebank Revised, LDC2015T13)

Type:WSJ corpus sections 0-18 (English News Text Treebank
chemdataextractor.nlp.corpus.wsj_development = <BracketParseCorpusReader in '.../corpora/wsj_development' (not loaded yet)>

Penn Treebank Revised, LDC2015T13)

Type:WSJ corpus sections 19-21 (English News Text Treebank
chemdataextractor.nlp.corpus.wsj_evaluation = <BracketParseCorpusReader in '.../corpora/wsj_evaluation' (not loaded yet)>

Penn Treebank Revised, LDC2015T13)

Type:WSJ corpus sections 22-24 (English News Text Treebank
chemdataextractor.nlp.corpus.treebank2_training = <ChunkedCorpusReader in '.../corpora/treebank2_training' (not loaded yet)>

WSJ corpus sections 0-18 (treebank2)

chemdataextractor.nlp.corpus.treebank2_development = <ChunkedCorpusReader in '.../corpora/treebank2_development' (not loaded yet)>

WSJ corpus sections 19-21 (treebank2)

chemdataextractor.nlp.corpus.treebank2_evaluation = <ChunkedCorpusReader in '.../corpora/treebank2_evaluation' (not loaded yet)>

WSJ corpus sections 22-24 (treebank2)

chemdataextractor.nlp.corpus.genia_training = <TaggedCorpusReader in '.../corpora/genia_training' (not loaded yet)>

First 80% of GENIA POS-tagged corpus

chemdataextractor.nlp.corpus.genia_evaluation = <TaggedCorpusReader in '.../corpora/genia_evaluation' (not loaded yet)>

Last 20% of GENIA POS-tagged corpus

chemdataextractor.nlp.corpus.medpost = <TaggedCorpusReader in '.../corpora/medpost' (not loaded yet)>
chemdataextractor.nlp.corpus.medpost_training = <TaggedCorpusReader in '.../corpora/medpost_training' (not loaded yet)>
chemdataextractor.nlp.corpus.medpost_evaluation = <TaggedCorpusReader in '.../corpora/medpost_evaluation' (not loaded yet)>
chemdataextractor.nlp.corpus.cde_tokensc = <PlaintextCorpusReader in '.../corpora/cde_tokensc' (not loaded yet)>
chemdataextractor.nlp.corpus.chemdner_training = <PlaintextCorpusReader in '.../corpora/chemdner_training' (not loaded yet)>

.nlp.lexicon

Cache features of previously seen words.

class chemdataextractor.nlp.lexicon.Lexeme(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]

Bases: object

__init__(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]

Initialize self. See help(type(self)) for accurate signature.

text

Original Lexeme text.

cluster

The Brown Word Cluster for this Lexeme.

normalized

Normalized text, using the Lexicon Normalizer.

lower

Lowercase text.

first

First character.

suffix

Three-character suffix

shape

Word shape. Derived by replacing every number with ‘d’, every greek letter with ‘g’, and every latin letter with ‘X’ or ‘x’ for uppercase and lowercase respectively.

length

Lexeme length.

upper_count

Count of uppercase characters.

lower_count

Count of lowercase characters.

digit_count

Count of digits.

is_alpha

Whether the text is entirely alphabetical characters.

is_ascii

Whether the text is entirely ASCII characters.

is_digit

Whether the text is entirely digits.

is_lower

Whether the text is entirely lowercase.

is_upper

Whether the text is entirely uppercase.

is_title

Whether the text is title cased.

is_punct

Whether the text is entirely punctuation characters.

is_hyphenated

Whether the text is hyphenated.

like_url

Whether the text looks like a URL.

like_number

Whether the text looks like a number.

class chemdataextractor.nlp.lexicon.Lexicon[source]

Bases: object

normalizer = <chemdataextractor.text.normalize.Normalizer object>

The Normalizer for this Lexicon.

clusters_path = None

Path to the Brown clusters model file for this Lexicon.

__init__()[source]
add(text)[source]

Add text to the lexicon.

Parameters:text (string) – The text to add.
cluster(text)[source]
normalized(text)[source]
lower(text)[source]
first(text)[source]
suffix(text)[source]
shape(text)[source]
length(text)[source]
digit_count(text)[source]
upper_count(text)[source]
lower_count(text)[source]
is_alpha(text)[source]
is_ascii(text)[source]
is_digit(text)[source]
is_lower(text)[source]
is_upper(text)[source]
is_title(text)[source]
is_punct(text)[source]
is_hyphenated(text)[source]
like_url(text)[source]
like_number(text)[source]
class chemdataextractor.nlp.lexicon.ChemLexicon[source]

Bases: chemdataextractor.nlp.lexicon.Lexicon

A Lexicon that is pre-configured with a Chemistry-aware Normalizer and Brown word clusters derived from a chemistry corpus.

normalizer = <chemdataextractor.text.normalize.ChemNormalizer object>
clusters_path = 'models/clusters_chem1500-1.0.pickle'

.nlp.pos

Part-of-speech tagging.

chemdataextractor.nlp.pos.TAGS = ['NN', 'IN', 'NNP', 'DT', 'NNS', 'JJ', ',', '.', 'CD', 'RB', 'VBD', 'VB', 'CC', 'VBN', 'VBZ', 'PRP', 'VBG', 'TO', 'VBP', 'HYPH', 'MD', 'POS', 'PRP$', '$', '``', "''", ':', 'WDT', 'JJR', 'RP', 'NNPS', 'WP', 'WRB', 'RBR', 'JJS', '-RRB-', '-LRB-', 'EX', 'RBS', 'PDT', 'SYM', 'FW', 'WP$', 'UH', 'LS', 'NFP', 'AFX']

Complete set of POS tags. Ordered by decreasing frequency in WSJ corpus.

class chemdataextractor.nlp.pos.ApPosTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.tag.ApTagger

Greedy Averaged Perceptron POS tagger trained on WSJ corpus.

model = 'models/pos_ap_wsj_nocluster-1.0.pickle'
clusters = False
class chemdataextractor.nlp.pos.ChemApPosTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.pos.ApPosTagger

Greedy Averaged Perceptron POS tagger trained on both WSJ and GENIA corpora.

Uses features based on word clusters from chemistry text.

model = 'models/pos_ap_wsj_genia-1.0.pickle'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
clusters = True
class chemdataextractor.nlp.pos.CrfPosTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.CrfTagger

model = 'models/pos_crf_wsj_nocluster-1.0.pickle'
clusters = False
class chemdataextractor.nlp.pos.ChemCrfPosTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.pos.CrfPosTagger

model = 'models/pos_crf_wsj_genia-1.0.pickle'
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
clusters = True

.nlp.tag

Tagger implementations. Used for part-of-speech tagging and named entity recognition.

class chemdataextractor.nlp.tag.BaseTagger[source]

Bases: object

Abstract tagger class from which all taggers inherit.

Subclasses must implement a tag() method.

tag(tokens)[source]

Return a list of (token, tag) tuples for the given list of token strings.

Parameters:tokens (list(str)) – The list of tokens to tag.
Return type:list(tuple(str, str))
tag_sents(sentences)[source]

Apply the tag method to each sentence in sentences.

evaluate(gold)[source]

Evaluate the accuracy of this tagger using a gold standard corpus.

Parameters:str))) gold (list(list(tuple(str,) – The list of tagged sentences to score the tagger on.
Returns:Tagger accuracy value.
Return type:float
class chemdataextractor.nlp.tag.NoneTagger[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Tag every token with None.

tag(tokens)[source]
class chemdataextractor.nlp.tag.RegexTagger(patterns=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Regular Expression Tagger.

__init__(patterns=None, lexicon=None)[source]
Parameters:string)) patterns (list(tuple(string,) – List of (regex, tag) pairs.
patterns = [('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('(The|the|A|a|An|an)$', 'AT'), ('.*able$', 'JJ'), ('.*ness$', 'NN'), ('.*ly$', 'RB'), ('.*s$', 'NNS'), ('.*ing$', 'VBG'), ('.*ed$', 'VBD'), ('.*', 'NN')]

Regular expression patterns in (regex, tag) tuples.

lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>

The lexicon to use

tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

class chemdataextractor.nlp.tag.AveragedPerceptron[source]

Bases: object

Averaged Perceptron implementation.

Based on implementation by Matthew Honnibal, released under the MIT license.

See more:
http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger
__init__()[source]

Initialize self. See help(type(self)) for accurate signature.

predict(features)[source]

Dot-product the features and current weights and return the best label.

update(truth, guess, features)[source]

Update the feature weights.

average_weights()[source]

Average weights from all iterations.

save(path)[source]

Save the pickled model weights.

load(path)[source]

Load the pickled model weights.

class chemdataextractor.nlp.tag.ApTagger(model=None, lexicon=None, clusters=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Greedy Averaged Perceptron tagger, based on implementation by Matthew Honnibal, released under the MIT license.

See more:
http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger
START = ['-START-', '-START2-']
__init__(model=None, lexicon=None, clusters=None)[source]
lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>
clusters = False
tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

train(sentences, nr_iter=5)[source]

Train a model from sentences.

Parameters:
  • sentences – A list of sentences, each of which is a list of (token, tag) tuples.
  • nr_iter – Number of training iterations.
save(f)[source]

Save pickled model to file.

load(model)[source]

Load pickled model.

class chemdataextractor.nlp.tag.CrfTagger(model=None, lexicon=None, clusters=None, params=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Tagger that uses Conditional Random Fields (CRF).

__init__(model=None, lexicon=None, clusters=None, params=None)[source]
lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>
clusters = False
params = {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 50}

//www.chokkan.org/software/crfsuite/manual.html

Type:Parameters to pass to training algorithm. See http
load(model)[source]
tag(tokens)[source]

Return a list of ((token, tag), label) tuples for a given list of (token, tag) tuples.

train(sentences, model)[source]

Train the CRF tagger using CRFSuite.

Params sentences:
 Annotated sentences.
Params model:Path to save pickled model.
class chemdataextractor.nlp.tag.DictionaryTagger(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]

Bases: chemdataextractor.nlp.tag.BaseTagger

Dictionary Tagger. Tag tokens based on inclusion in a DAWG.

delimiters = re.compile('(^.|\\b|\\s|\\W|.$)')

Delimiters that define where matches are allowed to start or end.

__init__(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]
Parameters:words (list(list(string))) – list of words, each of which is a list of tokens.
model = None

DAWG model file path.

entity = 'CM'

Optional no B/I?

Type:Entity tag. Matches will be tagged like ‘B-CM’ and ‘I-CM’ according to IOB scheme. TODO
case_sensitive = False

Whether dictionary matches are case sensitive.

lexicon = <chemdataextractor.nlp.lexicon.Lexicon object>

The lexicon to use.

load(model)[source]

Load pickled DAWG from disk.

save(path)[source]

Save pickled DAWG to disk.

build(words)[source]

Construct dictionary DAWG from tokenized words.

tag(tokens)[source]

Return a list of (token, tag) tuples for a given list of tokens.

.nlp.tokenize

Word and sentence tokenizers.

class chemdataextractor.nlp.tokenize.BaseTokenizer[source]

Bases: object

Abstract base class from which all Tokenizer classes inherit.

Subclasses must implement a span_tokenize(text) method that returns a list of integer offset tuples that identify tokens in the text.

tokenize(s)[source]

Return a list of token strings from the given sentence.

Parameters:s (string) – The sentence string to tokenize.
Return type:iter(str)
span_tokenize(s)[source]

Return a list of integer offsets that identify tokens in the given sentence.

Parameters:s (string) – The sentence string to tokenize.
Return type:iter(tuple(int, int))
chemdataextractor.nlp.tokenize.regex_span_tokenize(s, regex)[source]

Return spans that identify tokens in s split using regex.

class chemdataextractor.nlp.tokenize.SentenceTokenizer(model=None)[source]

Bases: chemdataextractor.nlp.tokenize.BaseTokenizer

Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006).

__init__(model=None)[source]

Initialize self. See help(type(self)) for accurate signature.

model = 'models/punkt_english.pickle'
get_sentences(text)[source]
span_tokenize(s)[source]

Return a list of integer offsets that identify sentences in the given text.

Parameters:s (string) – The text to tokenize into sentences.
Return type:iter(tuple(int, int))
class chemdataextractor.nlp.tokenize.ChemSentenceTokenizer(model=None)[source]

Bases: chemdataextractor.nlp.tokenize.SentenceTokenizer

Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006), trained on chemistry text.

model = 'models/punkt_chem-1.0.pickle'
class chemdataextractor.nlp.tokenize.WordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.BaseTokenizer

Standard word tokenizer for generic English text.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '−', '‒', '+', '±']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_NO_DIGIT = [':', ',']

Split around these sequences unless they are followed by a digit

SPLIT_START_WORD = ["''", '``', "'"]

Split after these sequences if they start a word

SPLIT_END_WORD = ["'s", "'m", "'d", "'ll", "'re", "'ve", "n't", "''", "'", '’s', '’m', '’d', '’ll', '’re', '’ve', 'n’t', '’', '’’']

Split before these sequences if they end a word

NO_SPLIT_STOP = ['...', 'al.', 'Co.', 'Ltd.', 'Pvt.', 'A.D.', 'B.C.', 'B.V.', 'S.D.', 'U.K.', 'U.S.', 'r.t.']

Don’t split full stop off last token if it is one of these sequences

CONTRACTIONS = [('cannot', 3), ("d'ye", 1), ('d’ye', 1), ('gimme', 3), ('gonna', 3), ('gotta', 3), ('lemme', 3), ("mor'n", 3), ('mor’n', 3), ('wanna', 3), ("'tis", 2), ("'twas", 2)]

Split these contractions at the specified index

NO_SPLIT = {'mm-hm', 'mm-mm', 'o-kay', 'uh-huh', 'uh-oh', 'wanna-be'}

Don’t split these sequences.

NO_SPLIT_PREFIX = {'a', 'agro', 'ante', 'anti', 'arch', 'be', 'bi', 'bio', 'co', 'counter', 'cross', 'cyber', 'de', 'e', 'eco', 'ex', 'extra', 'inter', 'intra', 'macro', 'mega', 'micro', 'mid', 'mini', 'multi', 'neo', 'non', 'over', 'pan', 'para', 'peri', 'post', 'pre', 'pro', 'pseudo', 'quasi', 're', 'semi', 'sub', 'super', 'tri', 'u', 'ultra', 'un', 'uni', 'vice', 'x'}

Don’t split around hyphens with these prefixes

NO_SPLIT_SUFFIX = {'-o-torium', 'esque', 'ette', 'fest', 'fold', 'gate', 'itis', 'less', 'most', 'rama', 'wise'}

Don’t split around hyphens with these suffixes.

NO_SPLIT_CHARS = '0123456789,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗'

Don’t split around hyphens if only these characters before or after.

__init__(split_last_stop=True)[source]

Initialize self. See help(type(self)) for accurate signature.

split_last_stop = None

Whether to split off the final full stop (unless preceded by NO_SPLIT_STOP). Default True.

get_word_tokens(sentence, additional_regex=None)[source]
get_additional_regex(sentence)[source]
handle_additional_regex(s, span, nextspan, additional_regex)[source]
span_tokenize(s, additional_regex=None)[source]
class chemdataextractor.nlp.tokenize.ChemWordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.WordTokenizer

Word Tokenizer for chemistry text.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '<', ').', '.(', '–', '—', '―', '~', '⁓', '∼', '°', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_END = [':', ',', '(TM)', '(R)', '(®)', '(™)', '(■)', '(◼)', '(●)', '(▲)', '(○)', '(◆)', '(▼)', '(⧫)', '(△)', '(◇)', '(▽)', '(⬚)', '(×)', '(□)', '(•)', '’', '°C']

Split before these sequences if they end a token

SPLIT_END_NO_DIGIT = ['(aq)', '(aq.)', '(s)', '(l)', '(g)']

Split before these sequences if they end a token, unless preceded by a digit

NO_SPLIT_SLASH = ['+', '-', '−']

Don’t split around slash when both preceded and followed by these characters

QUANTITY_RE = re.compile('^((?P<split>\\d\\d\\d)g|(?P<_split1>[-−]?\\d+\\.\\d+|10[-−]\\d+)(g|s|m|N|V)([-−]?[1-4])?|(?P<_split2>\\d*[-−]?\\d+\\.?\\d*)([pnµμm]A|[µμmk]g|[kM]J|m[lL]|[nµμm]?M|[nµμmc]m|kN|[mk]V|[mkMG]?W|[mnpμµ]s|H)

Regular expression that matches a numeric quantity with units

NO_SPLIT_PREFIX_ENDING = re.compile('(^\\(.*\\)|^[\\d,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗Α-Ωα-ω]+|ano|ato|azo|boc|bromo|cbz|chloro|eno|fluoro|fmoc|ido|ino|io|iodo|mercapto|nitro|ono|oso|oxalo|oxo|oxy|phospho|telluro|tms|yl|ylen|ylene|yliden|ylidene|yl)

Don’t split on hyphen if the prefix matches this regular expression

NO_SPLIT_CHEM = re.compile('([\\-α-ω]|\\d+,\\d+|\\d+[A-Z]|^d\\d\\d?$|acetic|acetyl|acid|acyl|anol|azo|benz|bromo|carb|cbz|chlor|cyclo|ethan|ethyl|fluoro|fmoc|gluc|hydro|idyl|indol|iene|ione|iodo|mercapto|n,n|nitro|noic|o,o|oxal, re.IGNORECASE)

Don’t split on hyphen if prefix or suffix match this regular expression

NO_SPLIT_PREFIX = {'a', 'aci', 'adeno', 'agro', 'aldehydo', 'allo', 'alpha', 'altro', 'ambi', 'ante', 'anti', 'aorto', 'arachno', 'arch', 'as', 'be', 'beta', 'bi', 'bio', 'bis', 'catena', 'centi', 'chi', 'chiro', 'circum', 'cis', 'closo', 'co', 'colo', 'conjuncto', 'conta', 'contra', 'cortico', 'cosa', 'counter', 'cran', 'cross', 'crypto', 'cyber', 'cyclo', 'de', 'deca', 'deci', 'delta', 'demi', 'di', 'dis', 'dl', 'e', 'eco', 'electro', 'endo', 'ennea', 'ent', 'epi', 'epsilon', 'erythro', 'eta', 'ex', 'exo', 'extra', 'ferro', 'galacto', 'gamma', 'gastro', 'giga', 'gluco', 'glycero', 'graft', 'gulo', 'hemi', 'hepta', 'hexa', 'homo', 'hydro', 'hypho', 'hypo', 'ideo', 'idio', 'in', 'infra', 'inter', 'intra', 'iota', 'iso', 'judeo', 'kappa', 'keto', 'kis', 'lambda', 'lyxo', 'macro', 'manno', 'medi', 'mega', 'meso', 'meta', 'micro', 'mid', 'milli', 'mini', 'mono', 'mu', 'muco', 'multi', 'musculo', 'myo', 'nano', 'neo', 'neuro', 'nido', 'nitro', 'non', 'nona', 'nor', 'novem', 'novi', 'nu', 'octa', 'octi', 'octo', 'omega', 'omicron', 'ortho', 'over', 'paleo', 'pan', 'para', 'pelvi', 'penta', 'peri', 'pheno', 'phi', 'pi', 'pica', 'pneumo', 'poly', 'post', 'pre', 'preter', 'pro', 'pseudo', 'psi', 'quadri', 'quasi', 'quater', 'quinque', 're', 'recto', 'rho', 'ribo', 'salpingo', 'scyllo', 'sec', 'semi', 'sept', 'septi', 'sero', 'sesqui', 'sexi', 'sigma', 'sn', 'soci', 'sub', 'super', 'supra', 'sur', 'sym', 'syn', 'talo', 'tau', 'tele', 'ter', 'tera', 'tert', 'tetra', 'theta', 'threo', 'trans', 'tri', 'triangulo', 'tris', 'u', 'uber', 'ultra', 'un', 'uni', 'unsym', 'upsilon', 'veno', 'ventriculo', 'vice', 'x', 'xi', 'xylo', 'zeta'}

Don’t split on hyphen if the prefix is one of these sequences

SPLIT_SUFFIX = {'absorption', 'abstinent', 'abstraction', 'abuse', 'accelerated', 'accepting', 'acclimated', 'acclimation', 'acid', 'activated', 'activation', 'active', 'activity', 'addition', 'adducted', 'adducts', 'adequate', 'adjusted', 'administrated', 'adsorption', 'affected', 'aged', 'alcohol', 'alcoholic', 'algae', 'alginate', 'alkaline', 'alkylated', 'alkylation', 'alkyne', 'analogous', 'anesthetized', 'appended', 'armed', 'aromatic', 'assay', 'assemblages', 'assisted', 'associated', 'atom', 'atoms', 'attenuated', 'attributed', 'backbone', 'base', 'based', 'bearing', 'benzylation', 'binding', 'biomolecule', 'biotic', 'blocking', 'blood', 'bond', 'bonded', 'bonding', 'bonds', 'boosted', 'bottle', 'bottled', 'bound', 'bridge', 'bridged', 'buffer', 'buffered', 'caged', 'cane', 'capped', 'capturing', 'carrier', 'carrying', 'catalysed', 'catalyzed', 'cation', 'caused', 'centered', 'challenged', 'chelating', 'cleaving', 'coated', 'coating', 'coenzyme', 'competing', 'competitive', 'complex', 'complexes', 'compound', 'compounds', 'concentration', 'conditioned', 'conditions', 'conducting', 'configuration', 'confirmed', 'conjugate', 'conjugated', 'conjugates', 'connectivity', 'consuming', 'contained', 'containing', 'contaminated', 'control', 'converting', 'coordinate', 'coordinated', 'copolymer', 'copolymers', 'core', 'cored', 'cotransport', 'coupled', 'covered', 'crosslinked', 'cyclized', 'damaged', 'dealkylation', 'decocted', 'decorated', 'deethylation', 'deficiency', 'deficient', 'defined', 'degrading', 'demethylated', 'demethylation', 'dendrimer', 'density', 'dependant', 'dependence', 'dependent', 'deplete', 'depleted', 'depleting', 'depletion', 'depolarization', 'depolarized', 'deprived', 'derivatised', 'derivative', 'derivatives', 'derivatized', 'derived', 'desorption', 'detected', 'devalued', 'dextran', 'dextrans', 'diabetic', 'dimensional', 'dimer', 'distribution', 'divalent', 'domain', 'dominated', 'donating', 'donor', 'dopant', 'doped', 'doping', 'dosed', 'dot', 'drinking', 'driven', 'drug', 'drugs', 'dye', 'edge', 'efficiency', 'electrodeposited', 'electrolyte', 'elevating', 'elicited', 'embedded', 'emersion', 'emitting', 'encapsulated', 'encapsulating', 'enclosed', 'enhanced', 'enhancing', 'enriched', 'enrichment', 'enzyme', 'epidermal', 'equivalents', 'etched', 'ethanolamine', 'evoked', 'exchange', 'excimer', 'excluder', 'expanded', 'experimental', 'exposed', 'exposure', 'expressing', 'extract', 'extraction', 'fed', 'finger', 'fixed', 'fixing', 'flanking', 'flavonoid', 'fluorescence', 'formation', 'forming', 'fortified', 'free', 'function', 'functionalised', 'functionalized', 'functionalyzed', 'fused', 'gas', 'gated', 'generating', 'glucuronidating', 'glycoprotein', 'glycosylated', 'glycosylation', 'gradient', 'grafted', 'group', 'groups', 'halogen', 'heterocyclic', 'homologues', 'hydrogel', 'hydrolyzing', 'hydroxylated', 'hydroxylation', 'hydroxysteroid', 'immersion', 'immobilized', 'immunoproteins', 'impregnated', 'imprinted', 'inactivated', 'increased', 'increasing', 'incubated', 'independent', 'induce', 'induced', 'inducible', 'inducing', 'induction', 'influx', 'inhibited', 'inhibitor', 'inhibitory', 'initiated', 'injected', 'insensitive', 'insulin', 'integrated', 'interlinked', 'intermediate', 'intolerant', 'intoxicated', 'ion', 'ions', 'island', 'isomer', 'isomers', 'knot', 'label', 'labeled', 'labeling', 'labelled', 'laden', 'lamp', 'laser', 'layer', 'layers', 'lesioned', 'ligand', 'ligated', 'like', 'limitation', 'limited', 'limiting', 'lined', 'linked', 'linker', 'lipid', 'lipids', 'lipoprotein', 'liposomal', 'liposomes', 'liquid', 'liver', 'loaded', 'loading', 'locked', 'loss', 'lowering', 'lubricants', 'luminance', 'luminescence', 'maintained', 'majority', 'making', 'mannosylated', 'material', 'mediated', 'metabolizing', 'metal', 'metallized', 'methylation', 'migrated', 'mimetic', 'mimicking', 'mixed', 'mixture', 'mode', 'model', 'modified', 'modifying', 'modulated', 'moiety', 'molecule', 'monoadducts', 'monomer', 'mutated', 'nanogel', 'nanoparticle', 'nanotube', 'need', 'negative', 'nitrosated', 'nitrosation', 'nitrosylation', 'nmr', 'noncompetitive', 'normalized', 'nuclear', 'nucleoside', 'nucleosides', 'nucleotide', 'nucleotides', 'nutrition', 'olefin', 'olefins', 'oligomers', 'omitted', 'only', 'outcome', 'overload', 'oxidation', 'oxidized', 'oxo-mediated', 'oxygenation', 'page', 'paired', 'pathway', 'patterned', 'peptide', 'permeabilized', 'permeable', 'phase', 'phospholipids', 'phosphopeptide', 'phosphorylated', 'pillared', 'placebo', 'planted', 'plasma', 'polymer', 'polymers', 'poor', 'porous', 'position', 'positive', 'postlabeling', 'precipitated', 'preferring', 'pretreated', 'primed', 'produced', 'producing', 'production', 'promoted', 'promoting', 'protected', 'protein', 'proteomic', 'protonated', 'provoked', 'purified', 'radical', 'reacting', 'reaction', 'reactive', 'reagents', 'rearranged', 'receptor', 'receptors', 'recognition', 'redistribution', 'redox', 'reduced', 'reducing', 'reduction', 'refractory', 'refreshed', 'regenerating', 'regulated', 'regulating', 'regulatory', 'related', 'release', 'releasing', 'replete', 'requiring', 'resistance', 'resistant', 'resitant', 'response', 'responsive', 'responsiveness', 'restricted', 'resulted', 'retinal', 'reversible', 'ribosylated', 'ribosylating', 'ribosylation', 'rich', 'right', 'ring', 'saturated', 'scanning', 'scavengers', 'scavenging', 'sealed', 'secreting', 'secretion', 'seeking', 'selective', 'selectivity', 'semiconductor', 'sensing', 'sensitive', 'sensitized', 'soluble', 'solution', 'solvent', 'sparing', 'specific', 'spiked', 'stabilised', 'stabilized', 'stabilizing', 'stable', 'stained', 'steroidal', 'stimulated', 'stimulating', 'storage', 'stressed', 'stripped', 'substituent', 'substituted', 'substitution', 'substrate', 'sufficient', 'sugar', 'sugars', 'supplemented', 'supported', 'suppressed', 'surface', 'susceptible', 'sweetened', 'synthesizing', 'tagged', 'target', 'telopeptide', 'terminal', 'terminally', 'terminated', 'termini', 'terminus', 'ternary', 'terpolymer', 'tertiary', 'tested', 'testes', 'tethered', 'tetrabrominated', 'tolerance', 'tolerant', 'toxicity', 'toxin', 'tracer', 'transfected', 'transfer', 'transition', 'transport', 'transporter', 'treated', 'treating', 'treatment', 'triggered', 'turn', 'type', 'unesterified', 'untreated', 'vacancies', 'vacancy', 'variable', 'water', 'yeast', 'yield', 'zwitterion'}

Split on hyphens followed by one of these sequences

NO_SPLIT = {'°c'}
get_additional_regex(sentence)[source]
class chemdataextractor.nlp.tokenize.FineWordTokenizer(split_last_stop=True)[source]

Bases: chemdataextractor.nlp.tokenize.WordTokenizer

Word Tokenizer that also split around hyphens and all colons.

SPLIT = ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '’', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '-', '−', '‒', '‐', '‑', '+', '±', ':']

Split before and after these sequences, wherever they occur, unless entire token is one of these sequences

SPLIT_NO_DIGIT = [',']

Split before these sequences if they end a token

NO_SPLIT = {}
NO_SPLIT_PREFIX = {}

Don’t split around hyphens with these prefixes

NO_SPLIT_SUFFIX = {}

Don’t split around hyphens with these suffixes.