.nlp¶
Tools for performing the NLP stages, such as POS tagging, Word clustering, CNER, Abbreviation detection
Chemistry-aware natural language processing framework.
.nlp.abbrev¶
Abbreviation detection.
-
class
chemdataextractor.nlp.abbrev.
AbbreviationDetector
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Bases:
object
Detect abbreviation definitions in a list of tokens.
Similar to the algorithm in Schwartz & Hearst 2003.
-
__init__
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
abbr_min
= 3¶ Minimum abbreviation length
-
abbr_max
= 10¶ Maximum abbreviation length
-
abbr_equivs
= []¶ String equivalents to use when detecting abbreviations.
-
-
class
chemdataextractor.nlp.abbrev.
ChemAbbreviationDetector
(abbr_min=None, abbr_max=None, abbr_equivs=None)[source]¶ Bases:
chemdataextractor.nlp.abbrev.AbbreviationDetector
Chemistry-aware abbreviation detector.
This abbreviation detector has an additional list of string equivalents (e.g. Silver = Ag) that improve abbreviation detection on chemistry texts.
-
abbr_min
= 3¶ Minimum abbreviation length
-
abbr_max
= 10¶ Maximum abbreviation length
-
abbr_equivs
= [('silver', 'Ag'), ('gold', 'Au'), ('mercury', 'Hg'), ('lead', 'Pb'), ('tin', 'Sn'), ('tungsten', 'W'), ('iron', 'Fe'), ('sodium', 'Na'), ('potassium', 'K'), ('copper', 'Cu'), ('sulfate', 'SO4'), ('methanol', 'MeOH'), ('ethanol', 'EtOH'), ('hydroxy', 'OH'), ('hexadecyltrimethylammonium bromide', 'CTAB'), ('cytarabine', 'Ara-C'), ('hydroxylated', 'OH'), ('hydrogen peroxide', 'H2O2'), ('quartz', 'SiO2'), ('amino', 'NH2'), ('amino', 'NH2'), ('ammonia', 'NH3'), ('ammonium', 'NH4'), ('methyl', 'CH3'), ('nitro', 'NO2'), ('potassium carbonate', 'K2CO3'), ('carbonate', 'CO3'), ('borohydride', 'BH4'), ('triethylamine', 'NEt3'), ('triethylamine', 'Et3N')]¶ String equivalents to use when detecting abbreviations.
-
.nlp.cem¶
Named entity recognition (NER) for Chemical entity mentions (CEM).
-
chemdataextractor.nlp.cem.
IGNORE_SUFFIX
= ['-', "'s", '-activated', '-adequate', '-affected', '-anesthetized', '-based', '-binding', '-boosted', '-cane', '-conditioned', '-containing', '-covered', '-deficient', '-dependent', '-derived', '-electrolyte', '-enriched', '-exposed', '-flanking', '-free', '-fused', '-gated', '-glucuronosyltransferases', '-increasing', '-induced', '-inducible', '-l-tyrosine', '-labeled', '-lesioned', '-loaded', '-mediated', '-patterned', '-primed', '-reducing', '-regulated', '-releasing', '-resistant', '-response', '-rich', '-s-transferase', '-sensitive', '-soluble', '-stimulated', '-stressed', '-supplemented', '-terminal', '-transferase', '-treated', '-type', '-blood', '-specific', '-like', '-elicited', '-stripped', '-transfer', '-conjugate', '-coated', '-producing', '-oxidized', '-associated', '-related', '-converting', '-ligand', '-on-glass', '-seeking', '-hydrolyzing', '-o-deethylase', '-deethylase', '-o-depentylase', '-depentylase', '-n-demethylase', '-demethylase', '-o-methyltransferase', '-c-oxidase', '-oxidase', '-n-biosidase', '-biosidase', '-immunoproteins', '-spiked', '-lowering', '-page', '-depletion', '-formation', '-dealkylation', '-deethylation', '-alkylation', '-ribosylation', '-production', '-demethylation', '-oxidation', '-transition', '-glycosylation', '-zwitterion', '-benzylation', '-reduction', '-oxygenation', '-nitrosylation', '-evoked', '-mutated', '-doped', '-aged', '-increased', '-triggered', '-linked', '-fixed', '-injected', '-contaminated', '-depleted', '-enhanced', '-stained', '-modified', '-fed', '-demethylated', '-catalyzed', '-etched', '-labelled', '-conjugated', '-pretreated', '-ribosylated', '-phosphorylated', '-reduced', '-bonded', '-stabilised', '-crosslinked', '-mannosylated', '-capped', '-supported', '-initiated', '-integrated', '-accelerated', '-encapsulated', '-untreated', '-expanded', '-coupled', '-terminated', '-assisted', '-permeabilized', '-resulted', '-alkylated', '-functionalized', '-contained', '-buffered', '-caused', '-cyclized', '-substituted', '-modulated', '-inhibited', '-centered', '-promoted', '-confirmed', '-provoked', '-dominated', '-limited', '-challenged', '-tetrabrominated', '-unesterified', '-refreshed', '-bottled', '-protonated', '-incubated', '-tagged', '-damaged', '-bridged', '-maintained', '-impregnated', '-metabolizing', '-deprived', '-insensitive', '-dendrimer', '-receptor', '-tolerant', '-influx', '-administrated', '-requiring', '-permeable', '-transport', '-intoxicated', '-overload', '-derivatives', '-derivative', '-sweetened', '-transporter', '-bound', '-extract', '-bonding', '-bond', '-trna', '-redistribution', '-copolymers', '-copolymer', '-appended', '-susceptible', '-transfected', '-bearing', '-regenerating', '-induction', '-conducting', '-decorated', '-encapsulating', '-consuming', '-bridge', '-dependence', '-Pdots', '-only', '-carrying', '-treating', '-isomerase', '-ion', '-ions', '-coordinated', '-saturated', '-sparing', '-enclosed', '-stabilized', '-polymer', '-yeast', '-making', '-porous', '-independent', '-metallized', '-attenuated', '-liquid', '-caged', '-deficiency', '-sensing', '-recognition', '-responsiveness', '-embedded', '-connectivity', '-abuse', '-chelating', '-decocted', '-forming', '-nutrition', '-scavenging', '-preferring', '-mimicking', '-drugs', '-drug', '-lubricants', '-adsorption', '-ligated', '-detected', '-responsive', '-reacting', '-defined', '-capturing', '-group', '-abstinent', '-paired', '-devalued', '-need', '-cellulose', '-atpase', '-inactivated', '-β-glucosaminidase', '-glucosaminidase', '-dosed', '-imprinted', '-precipitated', '-monoadducts', '-vacancies', '-vacancy', '-attributed', '-depolarization', '-depolarized', '-liver', '-testes', '-reversible', '-active', '-reactive', '-dextran', '-fixing', '-synthesizing', '-inhibitory', '-cleaving', '-positive', '-activity', '-fluorescence', '-regulating', '-NPs', '-scanning', '-water', '-nmr', '-limiting', '-refractory', '-knot', '-variable', '-biomolecule', '-backbone', '-exchange', '-donating', '-coating', '-hydrogenase', '-hydrogenases', '-intolerant', '-deplete', '-poor', '-loading', '-enrichment', '-elevating', '-resitant', '-stabilizing', '-pathway', '-fortified', '-adjusted', '-restricted', '-dependant', '-locked', '-normalized', '-aromatic', '-hydroxylation', '-intermediate', '-6-phosphatase', '-phosphatase', '-linker', '-proteomic', '-mimetic', '-lipid', '-radical', '-receptors', '-substrate', '-conjugates', '-promoting', '-dye', '-functionalyzed', '-catalysed', '-reductase', '-QDs', '-complexes', '-placebo', '-transferases', '-alginate', '-competing', '-depleting', '-sensitized', '-protein', '-regulatory', '-target', '-toxin', '-yield', '-planted', '-produced', '-derivatized', '-secreting', '-modifying', '-DNA', '-bonds', '-assemblages', '-exposure', '-negative', '-sealed', '-atom', '-atoms', '-abstraction', '-concentration', '-doping', '-competitive', '-acclimation', '-acclimated', '-interlinked', '-suppressed', '-postlabeling', '-labeling', '-diabetic', '-omitted', '-sufficient', '-generating', '-terminus', '-adducts', '-compound', '-compounds', '-γ-lyase', '-γ-synthase', '-lyase', '-synthase', '-inhibitor', '-protected', '-multiwall', '-stripping', '-plasma', '-evolving']¶ Token endings to ignore when considering stopwords and deriving spans
-
chemdataextractor.nlp.cem.
IGNORE_PREFIX
= ['fluorophore-', 'low-', 'high-', 'single-', 'odd-', 'non-', 'high-', 'cross-', 'cellulose-', 'anti-', '-multiwall', 'globular-', 'plasma-', 'hybrid-', 'protein-', 'explicit-', 'cation-', 'water-', 'through-', 'starch-', 'rigid-', 'conjugated-', 'photoactivatable-', 'alginate-', 'nano-', 'dye-', 'ligand-', 'enzyme-', 'platelet-', 'photo-', 'total-', 'drug-', 'nanoparticle-', 'nanomaterial-', 'inter-', 'ion-', 'post-', 'one-']¶ Token beginnings to ignore when considering stopwords and deriving spans
-
chemdataextractor.nlp.cem.
STRIP_END
= ['groups', 'group', 'colloidal', 'dyes', 'dye', 'products', 'product', 'substances', 'substance', 'solution', 'derivatives', 'derivative', 'analog', 'salts', 'salt', 'minerals', 'mineral', 'anesthetic', 'tablet', 'tablets', 'preparation', 'atoms', 'atom', 'monomers', 'monomer', 'nanoparticles', 'nanoparticle', 'radicals', 'radical', 'dendrimers', 'dendrimer', 'ions', 'ion', 'particles', 'particle', 'anion', 'cation', 'foam', 'cellulose', 'dextran', '(', 'dust', 'herbicide', 'disease', 'diseases', 'and', 'or', ';', ',', '.']¶ Final tokens to remove from entity matches
-
chemdataextractor.nlp.cem.
STRIP_START
= ['anhydrous', 'elemental', 'amorphous', 'conjugated', 'colloidal', 'activated', 'water-soluble', 'total', 'superparamagnetic', 'molecular', 'high-density', 'synthetic', 'low-density', 'long-chain', 'fused', 'radioactive', 'reduced', 'anatase', 'dextran', ')', 'trisubstituted', 'deposited', 'herbicide', 'antagonist', 'agonist', 'and', 'or', 'metallic', 'embryotoxic', 'monoclinic']¶ First tokens to remove from entity matches
-
chemdataextractor.nlp.cem.
STOP_TOKENS
= {'.cdx', '.sk2', '10.1021', '10.1039', '10.1186', 'account', 'adenovirus', 'affiliation', 'affiliations', 'aldrich', 'allphar', 'alpharma', 'america', 'angeles', 'apotex', 'approach', 'april', 'article', 'articles', 'astrazeneca', 'august', 'aventis', 'azərbaycanca', 'background', 'bayer', 'behringer', 'berlin', 'bibliography', 'bibtex', 'biochemistry', 'bioniche', 'bipharma', 'books', 'bovine', 'bristol', 'bristol-myers', 'cambridge', 'ccdc', 'chauvin', 'chemistry', 'chemspider', 'chemworx', 'chicago', 'chicken', 'children', 'china', 'chocolate', 'chromatography', 'ciba-geigy', 'citation', 'citing', 'claim', 'claims', 'claire', 'cm–1', 'coffee', 'colored', 'conclusion', 'conclusions', 'contact', 'crossref', 'cytochrome', 'danielle', 'december', 'dielectric', 'discussion', 'docking', 'doctrine', 'doi', 'download', 'edinburgh', 'edit', 'editor', 'editorial', 'editors', 'ekins', 'email', 'energy', 'english', 'esi', 'español', 'esperanto', 'ethical', 'euskara', 'external', 'february', 'fig.', 'file', 'fluorochem', 'francisco', 'gene', 'genetical', 'genevrier', 'genzyme', 'glaxo', 'glaxosmithkline', 'glycoprotein', 'google', 'guidelines', 'having', 'help', 'horse', 'human', 'imaging', 'index', 'inhibitor', 'interpharm', 'introduction', 'ireland', 'isbn', 'italiano', 'january', 'journal', 'journals', 'july', 'june', 'latviešu', 'letters', 'libraries', 'link', 'linkedin', 'links', 'literature', 'london', 'magazine', 'mammalian', 'march', 'marinlit', 'masthead', 'measurements', 'medline', 'members', 'menu', 'merck', 'method', 'methods', 'more', 'nano-beads', 'nanobeads', 'napoleon', 'navigation', 'nordfriisk', 'novartis', 'november', 'novopharm', 'october', 'overdose', 'oxford', 'palestine', 'parameters', 'paris', 'permissions', 'personal', 'pfizer', 'pharmacia', 'pharmacology', 'phenomena', 'pig', 'policy', 'prior', 'priority', 'privacy', 'procter', 'production', 'profile', 'rachel', 'ratiopharm', 'recombinant', 'recombination', 'references', 'research', 'results', 'retention', 'roche', 'safety', 'salmon', 'schering', 'scientifique', 'september', 'sheep', 'sigma-aldrich', 'southampton', 'squibb', 'studies', 'syntheticpage', 'systematic', 'technical', 'test', 'tobacco', 'tokyo', 'upload', 'visfarm', 'volume', 'wikimedia', 'wiskott', 'york', 'zhang', '§', 'नेपाल भाषा', '†'}¶ Disallowed tokens in chemical entity mentions (discard if any single token has exact case-insensitive match)
-
chemdataextractor.nlp.cem.
STOP_SUB
= {' brand of ', ' oil', ' with ', '!', '%', ', ', ';', '?', '@', '\\', 'activating factor', 'adrenocorticotropic', 'anticodon', 'botulinum', 'coagulation factor', 'concanavalin', 'conductance', 'corticotrophin', 'corticotropin', 'exciton', 'factor ', 'fibroblast', 'follicle', 'freund', 'gene-related', 'glucagon', 'glucan', 'gramicidin', 'growth factor', 'hemoglobin', 'insulin', 'intercellular', 'interferon', 'interleukin', 'luteinizing', 'melanin', 'natriuretic', 'necrosis', 'necrosis factor', 'neurofilament', 'neuropeptide', 'oil of ', 'plasminogen', 'platelet', 'reactive', 'regulator', 'releasing factor', 'selectin', 'stimulating factor', 'transcription factor', 'transmembrane', '|'}¶ Disallowed substrings in chemical entity mentions (only used when filtering to construct the dictionary?)
-
chemdataextractor.nlp.cem.
STOPLIST
= {'(gaba)ergic', '1,3-dpma', '1,5-dpma', '12mg', '3 ps', '3ps', "5'-amp", '90th', 'Nucleophosmin', '[h2o2]', 'a', 'a chlorophyll', 'abbott', 'about', 'absolute ethanol', 'ac187', 'acacia', 'accelerate', 'accent', 'acs mobile', 'acs nano', 'acs omega', 'acth', 'actinin-4', 'activated carbon', 'activated charcoal', 'active carbon', 'activin', 'actomyosin', 'adage', 'adalimumab', 'adept', 'adipsin', 'adma', 'admire', 'adrenocorticotrophic hormone', 'adrenocorticotropic hormone', 'adrenodoxin', 'advance', 'advantage', 'aero', 'af-2', 'again', 'agar', 'agarose', 'agcg', 'agglutinin', 'akron', 'alamethicin', 'alcoholic', 'aldrich', 'alginate', 'all', 'allay', 'alliance', 'almost', 'alpen', 'alpha-actinin-4', 'alpha-t', 'also', 'although', 'alto', 'alum', 'always', 'am1', 'amaze', 'amberlite', 'ambush', 'amen', 'amitraz', 'ammo', 'among', 'amorphous carbon', 'amorphous silica', 'amphiregulin', 'amylin', 'amylopectin', 'amylose', 'an', 'an-152', 'and', 'android', 'angiotensin', 'angiotensin i', 'angiotensinogen', 'anion', 'anna', 'anon', 'another', 'anterior pituitary hormone', 'anti-stress', 'antidiuretic hormone', 'antitussive', 'any', 'aopp', 'apex', 'apolar', 'applaud', 'apron', 'aprotinin', 'aqua', 'arabinogalactan', 'arac', 'are', 'arena', 'aria', 'aromatic amine', 'arrow', 'arsenal', 'artemisinin', 'artist', 'as', 'ascophyllum', 'assert', 'assure', 'at', 'atpγs', 'atrium', 'aurora', 'auroxanthin', 'austin', 'authority', 'avastin', 'avenge', 'aversion', 'avicel', 'avicel cl611', 'avicel ph101', 'b(+)', 'b-dna', 'b13', 'bacp-2', 'bacteriorhodopsin', 'balance', 'banner', 'bantu', 'barnase', 'barnase-barstar', 'baron', 'baroque', 'barrage', 'barrels', 'barricade', 'barstar', 'battalion', 'bazooka', 'be', 'beast', 'because', 'been', 'before', 'being', 'belatacept', 'belt', 'benchmark', 'benet', 'bengal', 'beret', 'bernice', 'betaine', 'betula', 'between', 'bevacizumab', 'bide', 'bile', 'bionic', 'biopterin', 'bishop', 'bishop-kirtman', 'blazer', 'blizzard', 'bloc', 'blood coagulation factor x', 'blood sugar', 'bloom', 'blow', 'bnp-32', 'bont/a', 'borneo', 'both', 'botox', 'brace', 'brake', 'brass', 'bridal', 'brigade', 'briton', 'bromelain', 'bromelia', 'brs-3-ap', 'btx-a', 'bumetanide', 'bump', 'but', 'butter', 'by', 'c-15', 'c-peptide', 'c-reactive protein', 'cadherin 11', 'cadmium chloride (cdcl2)', 'calcined', 'calcitonin', 'calibre', 'calypso', 'cameo', 'campaign', 'can', 'candidate molecules', 'candy', 'cannon', 'canopy', 'capmul', 'caprine', 'capture', 'caramel', 'carbomer', 'carboxymethylcellulose', 'carob', 'carol', 'carotene', 'carotenoid', 'carotenoids', 'carrageenan', 'carrageenin', 'carrie', 'cascade', 'casein', 'castor oil', 'caviar', 'ccl3', 'ccl3(-/-)', 'cd2', 'cd2+', 'cd3(+)', 'cd3ε', 'cd4+', 'cd68', 'cecil', 'cellulase', 'cellulose', 'centurion', 'cetuximab', 'chamomile', 'charged', 'charlie', 'chemokine', 'chess', 'chitin', 'chitosan', 'cholecystokinin', 'cholera toxin', 'chondroitin', 'chondroitin sulfate', 'chondroitin sulphate', 'chopper', 'chorionic gonadotrophin', 'chorionic gonadotropin', 'chymotrypsin', 'cinch', 'citation', 'citizen', 'citrus pectin', 'classic', 'clathrin heavy chain', 'clay', 'clin', 'clipper', 'clout', 'clove oil', 'coca', 'cochineal', 'cocktail', 'cocoa butter', 'coke', 'cola', 'collagen', 'collagenase', 'collagens', 'colt', 'combat', 'comet', 'comfort', 'command', 'commando', 'commodore', 'compass', 'compendium', 'complement proteins', 'compound', 'concanavalin a', 'concise', 'conclusion', 'concord', 'confront', 'conjugated estrogens', 'conjugated linoleic acid', 'conserve', 'consist', 'cont', 'contest', 'cope-bd', 'coral', 'corn oil', 'corn starch', 'cornstarch', 'corsair', 'corticotrophin-releasing hormone', 'corticotropin', 'could', 'counter', 'counter-anion', 'counter-ion', 'crack', 'crackdown', 'crank', 'crap', 'cremophor el', 'crest', 'crossbow', 'crotoxin', 'crunch', 'crystal', 'crystallography', 'cubes', 'cultivate', 'curb', 'curcuma', 'cutlass', 'cyclin d1', 'cyclin d3', 'cyclones', 'cytochrome c', 'cytochrome p450', 'd250', 'daclizumab', 'dagger', 'dalteparin', 'dams', 'danshen', 'darbepoetin alfa', 'daren', 'dart', 'dash', 'ddds', 'defibrotide', 'deionized water', 'demon', 'denosumab', 'deoxyribonucleic acid', 'dept', 'dermatan sulfate', 'desmethyl-olanzapine', 'dextran', 'dextran sulfate sodium', 'dextrin', 'dial', 'diana', 'diane', 'dibs', 'did', 'dihydro', 'dinucleotide', 'discover', 'discussion', 'distilled water', 'diurnal', 'dividend', 'dixon', 'dm-10', 'dna double strand', 'dnase', 'dnase-i', 'do', 'does', 'dolly', 'done', 'dorado', 'dorm', 'dot-silica', 'double stranded dna', 'double-stranded dna', 'doyle', 'dpma', 'dragnet', 'drago', 'dragon', 'dreamer', 'due', 'duet', 'during', 'dwell', 'dynorphin', 'dynorphins', 'e-selectin', 'e-ssa', 'e3330', 'each', 'ecstasy', 'eculizumab', 'edge', 'either', 'elastin', 'elastomers', 'elevate', 'elite', 'elon', 'embark', 'emblem', 'emerald', 'eminent', 'emotion', 'empire', 'endeavour', 'endothelin-1', 'endurance', 'enforcer', 'enough', 'enoxaparin', 'epic', 'epidermal growth factor', 'epidermal growth factor (egf)', 'epoetin', 'epoetin alfa', 'epoetin beta', 'equity', 'eristostatin', 'erythropoietin', 'escort', 'especially', 'essex', 'estate', 'etc', 'ethylcellulose', 'excel', 'exciton', 'exenatide', 'exendin-4', 'exotoxin', 'expand', 'experimental', 'experimental procedures', 'facet', 'factor iia', 'factor v', 'factor vii', 'factor x', 'fenton', 'fenugreek', 'ferredoxin', 'ferritin', 'fetal hemoglobin', 'fgf2', 'fibrinogen', 'fibroin', 'finale', 'finish', 'first sign', 'flair', 'flake', 'flaxseed oil', 'flex', 'flint', 'flonase', 'flue gas', 'fly ash', 'follicle-stimulating hormone', 'for', 'fore', 'formulation', 'fortress', 'found', 'foxo1', 'fp-2', 'fractal', 'freedom', 'french green', 'fret-capture', 'from', 'fructose corn syrup', 'fucoidan', 'fulfill', 'furfural-water', 'furosemide', 'further', 'fury', 'fusarium toxin', 'galanin', 'gallant', 'gallery', 'galsulfase', 'gana', 'gastrin', 'gelatin', 'gelatine', 'gemini', 'general experimental', 'genesis', 'ghrp-2', 'ginseng', 'gleevec', 'glide', 'glp-1', 'glucagon', 'glucagon-like peptide-1', 'glucans', 'glucomannan', 'glucophage', 'glut', 'gluten proteins', 'glycerin', 'glycine', 'glycogen', 'glycopeptide', 'glycoprotein', 'glycoproteins', 'gm-csf', 'gold', 'gonadotropin releasing hormone', 'goon', 'gradual', 'gramicidin a', 'granite', 'grasp', 'green tea leaves', 'grenade', 'groundnut oil', 'growth hormone', 'growth hormone releasing hormone', 'gsno', 'gst-p(+)', 'gtpγs', 'guardian', 'gum arabic', 'gypsum', 'gypsum fibrosum', 'h3n2', 'had', 'hairy', 'halt', 'happy', 'harness', 'harry', 'has', 'have', 'having', 'headline', 'heat pre', 'heaven', 'hell', 'hemocyanin', 'hemoglobin', 'hemopexin', 'hemozoin', 'henna', 'heparan sulfate', 'heparin', 'herald', 'herceptin', 'here', 'heteroatoms', 'heterocyclic', 'hirudin', 'histone', 'hmqc', 'hocus', 'homogentisate', 'honey', 'horizon', 'how', 'however', 'human serum albumin', 'hyalgan', 'hyaluronan', 'hyaluronic acid', 'hyaluronidase', 'hydro', 'hydrogel', 'hydrolyzed polyacrylamide', 'hydroxyethylcellulose', 'hydroxypropyl methylcellulose', 'hydroxypropylcellulose', 'hydroxypropylmethylcellulose', 'hyperoxia', 'hypo', 'i', 'iberiotoxin', 'icatibant', 'icon', 'if', 'ifn-gamma', 'ifn-β', 'ifn-γ', 'igaba', 'igf-1', 'ignite', 'il-11', 'il-2', 'il10', 'il12', 'imperator', 'in', 'inas', 'indigo', 'inhalable', 'insular', 'insulin', 'insulin glargine', 'integrin', 'intense blue', 'interceptor', 'interferon', 'interferon-gamma', 'interferon-γ', 'into', 'introduction', 'inulin', 'invader', 'ion', 'ip-10', 'is', 'iscu', 'isomaltosaccharide', 'it', 'its', 'itself', 'iκb-α', 'iκbα', 'jasmonate', 'joker', 'jolt', 'joust', 'jumbo', 'junk', 'just', 'k-12', 'karate', 'kelp', 'keratin', 'kestrel', 'kg', 'km', 'kokan', 'kollidon', 'kudos', 'laba', 'lady', 'lama', 'lance', 'lancer', 'lasso', 'latex', 'latex particles', 'lats', 'lawson', 'lead', 'lead ion', 'leader', 'legend', 'liberty', 'light yellow', 'lignin', 'lignins', 'lignocellulose', 'limber', 'lime', 'linseed oil', 'linseed oils', 'lipid a', 'liposomal doxorubicin', 'lipoteichoic acid', 'liraglutide', 'lmwh', 'log in', 'lotion', 'lrp1', 'lsopc', 'ltb4', 'luteinising hormone', 'luteinizing hormone', 'm1-glucuronide', 'm41.4', 'maba', 'made', 'mag2', "mag2's", 'magnum', 'mainly', 'maintain a', 'make', 'malo', 'maltodextrin', 'manage', 'mandate', 'maneb', 'mannan', 'margarine', 'marksman', 'marshal', 'marshall', 'mascot', 'matador', 'match', 'may', 'maya', 'mega', 'melanin', 'melody', 'menopur', 'merit', 'merlin', 'meta', 'meta-analysis', 'metal-oxide', 'metallothionein', 'metallothioneins', 'methb', 'methemoglobin', 'methocel', 'method', 'methods', 'methylcellulose', 'metric', 'mg', 'mg-1', 'mibc', 'mica', 'microcrystalline cellulose', 'microdots', 'might', 'mighty', 'milk thistle', 'millie', 'minus', 'miracle', 'mirage', 'mist', 'ml', 'mm', 'mobile site', 'moesin', 'mol', 'molten', 'molybdate', 'moment', 'momentum', 'monarch', 'mops', 'morph', 'morpho', 'most', 'mostly', 'motilin', 'mpo-anca', 'mtcc', 'multiple', 'murabutide', 'musk', 'must', 'mustang', 'mustard', 'mustard oil', 'mutagen', 'n17', 'naglazyme', 'natalizumab', 'natural rubber', 'ndma', 'ndp-α-msh', 'nearly', 'neither', 'nerve agent', 'neurokinin a', 'neuropeptide y', 'new titles', 'nexus', 'nida', 'no', 'nociceptin', 'nodular', 'nor', 'nor-1', 'noxa', 'nucleobase', 'nucleophosmin', 'nucleotide', 'obtained', 'octadecaneuropeptide', 'octanol-air', 'octave', 'octreotide', 'of', 'often', 'oil', 'oil red', 'oil-in-water', 'oligonucleotide', 'olive oil', 'oliver', 'olympus', 'omalizumab', 'omega', 'on', 'opium', 'optimizer', 'orange', 'orbit', 'organometallic', 'organometallics', 'organometalloidal', 'orion', 'orphan', 'orphanin fq', 'osteocalcin', 'osteopontin', 'our', 'outflank', 'ovalbumin', 'overall', 'ovomucoid', 'p-selectin', 'p300', 'p450', 'pak1', 'paladin', 'pampa', 'pancreatin', 'papp-a', 'paraffin', 'parathyroid hormone', 'parkin', 'parlay', 'part 2', 'partially hydrolyzed polyacrylamide', 'pat4', 'patrol', 'pc-12', 'pc1', 'pc12', 'peace', 'pectin', 'pectins', 'pegasus', 'pegsunercept', 'pensive', 'peon', 'peony', 'peppermint oil', 'peptide e', 'percolate', 'perhaps', 'perk', 'perna', 'persian', 'petroleum ether', 'pgc1α', 'phaseolin', 'phosphor', 'phycocyanin', 'picket', 'picrate', 'pima', 'pink', 'piper', 'pirate', 'pivot', 'pla2', 'placental growth hormone', 'plasminogen', 'pledge', 'plumbago', 'pmid', 'polo', 'poloxamer', 'poly', 'poly(a)-poly(t)', 'poly(i:c)', 'polygon', 'polypeptide', 'polysaccharide', 'polystyrene latex', 'polyubiquitin', 'posse', 'potato starch', 'pounce', 'prep', 'preparation', 'press', 'preview', 'pride', 'prism', 'pristine', 'pro-opiomelanocortin', 'probate', 'probiotic', 'procure', 'prolactin', 'proopiomelanocortin', 'propolis', 'prosper', 'protamine', 'protanal', 'protein hydrolysate', 'prothrombin', 'prothrombinase', 'protide', 'protio', 'proton', 'provitamin a', 'prowl', 'pser-stat3', 'pseudo', 'psychogenic', 'puerarin', 'pullulan', 'punch', 'pursuit', 'pylon', 'pyrethrum', 'quark', 'quench', 'quite', 'racer', 'radar', 'radio', 'radixin', 'raid', 'raiser', 'rally', 'rampart', 'raptor', 'rather', 'ravage', 'raven', 'reactions', 'reactivity', 'really', 'recoil', 'reconcile', 'recruit', 'redeem', 'redskin', 'reduced hemoglobin', 'redux', 'references', 'regarding', 'regent', 'regulon', 'relax', 'relaxant', 'res', 'resilin', 'resovist', 'restful', 'results', 'retard', 'reticulin', 'revolution', 'rgd peptide', 'rhombic', 'ribonucleic acid', 'rice starch', 'ricin', 'rifle', 'ripcord', 'rival', 'rock', 'rogue', 'rosin', 'rotate', 'roundup', 'rubber', 'rufus', 'rugby', 'rutile', 's100', 'saber', 'saccharina', 'saccharum', 'safari', 'safety', 'saffron', 'saline', 'salix', 'salute', 'samp', 'sanction', 'sceptre', 'schisandra chinensis', 'scopolamine', 'scot', 'scourge', 'scout', 'scpa', 'scuffle', 'se-selectin', 'section', 'seem', 'seen', 'senna', 'sentry', 'sephadex g-75', 'sephadex lh-20', 'sepharose', 'serum albumin', 'sesame oil', 'several', 'shellac', 'sherpa', 'shiga toxin', 'shogun', 'should', 'show', 'showed', 'shown', 'shows', 'siamycin', 'significantly', 'silence', 'silicone', 'silybum marianum', 'since', 'singlet oxygen', 'sirius', 'slam', 'smack', 'smash', 'smear', 'snap', 'snap-25', 'snip', 'sniper', 'snort', 'snow', 'so', 'soda', 'solo', 'solvent', 'somatostatine', 'some', 'sonar', 'sonata', 'sophia', 'sp1', 'spectrin', 'spiegel', 'spirit', 'splendor', 'spme', 'spotless', 'spotlight', 'sprinkle', 'squad', 'squalamine', 'stability', 'stainless steel', 'stalker', 'stanza', 'staple', 'star', 'starch', 'starches', 'steel', 'stim', 'stipend', 'stomp', 'storm', 'streptavidin', 'strike', 'stuff', 'subdue', 'substance p', 'such', 'sultan', 'summit', 'sunshine', 'supra', 'supreme', 'surfer', 'surpass', 'suspend', 'sv2', 'sword', 'synacthen', 'syntheses', 'synthesis', 'synthol', 'synvisc', 't-47', 't-pa', 't140', 'ta98', 'tabloid', 'tace', 'tackle', 'talin', 'talon', 'tame', 'tara', 'tarragon', 'tattoo', 'taxus', 'tea catechin', 'tea polyphenol', 'teac', 'tell', 'telomerase', 'tenax', 'teriparatide', 'terminator', 'terpolymer', 'test mixture', 'textile', 'than', 'that', 'the', 'the-7', 'their', 'theirs', 'them', 'then', 'there', 'therefore', 'these', 'they', 'thioredoxin', 'thioredoxins', 'this', 'those', 'thrombin', 'thromboplastin', 'through', 'thus', 'thymosin β4', 'thyroglobulin', 'thyroid stimulating hormone', 'tilt', 'tindal', 'titan', 'titus', 'tm-74', 'tnfα', 'to', 'toke', 'tomahawk', 'tonal', 'toot', 'tops', 'torpedo', 'total bilirubin', 'touchdown', 'tough', 'trails', 'tranquil', 'transdermal patch', 'transfer rna', 'transferrin', 'trastuzumab', 'triangle', 'trim', 'triticum', 'triumph', 'trypsin', 'trypsinogen', 'tsar', 'tsst-1', 'tunic', 'turbo', 'turmeric', 'turpentine', 'ubiquinone', 'ubiquitin', 'ubr2', 'ultimate', 'upon', 'ural', 'uranyl nitrate', 'urea nitrogen', 'urokinase', 'use', 'used', 'using', 'vacate', 'valiant', 'valosin-containing protein', 'vanilla', 'vanquish', 'various', 'vas1', 'vasal', 'vaseline', 'vengeance', 'verdict', 'vermin', 'versed', 'vertex', 'very', 'vicilin', 'vigil', 'vinca', 'vinculin', 'vinegar', 'vishnu', 'visor', 'vitamin', 'vitamins', 'volley', 'vortex', 'wander', 'wang', 'warf', 'was', 'water', 'water vapor', 'water-in-oil', 'waters', 'we', 'were', 'whack', 'what', 'wheat starch', 'when', 'which', 'while', 'whip', 'white light', 'with', 'within', 'without', 'would', 'x close', 'xanthan', 'xanthan gum', 'xanthium', 'xylan', 'xyloglucan', 'yellow', 'yellows', 'zest', 'ziconotide', 'zodiac', 'zymosan', 'Ω127', 'α-lactalbumin', 'α-msh', 'β-endorphin', 'β-nf', 'δr(1)', 'ω', '∑pcbs'}¶ Disallowed chemical entity mentions (discard if exact case-insensitive match)
-
chemdataextractor.nlp.cem.
STOP_RES
= ['^(http|ftp)://', '\\.(com|uk|eu|org|net)$', '^\\d{4}-\\d{3}[\\dx]$', '^[\\w\\-\\.\\+%]{4,} @ \\w[\\w\\-\\.]+\\.(com?|edu|gov|ac)(\\.[\\w\\-\\.]+)?$', '^[\\d,:\\- ]*\\d{4,}[\\d,:\\- ]*$', '\\d{3,} , \\d{3,}', '(\\d\\d+\\.\\d+|\\d\\.\\d\\d+)', '\\d and \\d', '^(\\[\\d+\\]\\s*)+$', '^\\d+$', '= \\d', '^\\+?\\d[ \\d-]$', 'cm-1', '^(compound|ligand|chemical|dye|derivative|complex|example|intermediate|product|formulae?)s? [a-z\\d]{1,3}', '(b3lyp|31g\\(d,p\\)|td-dft)', 'et al\\.?$', '^(ep|wo|us)\\s*\\d\\s*\\d\\d[\\d\\s]*([AB]\\d)?($|\\s*and)', '^(pre|post)-\\d\\d\\d\\d', '\\d ml$', '\\.(png|gif|jpg|txt|html|docx?|xlsx?)$', '^(tel|fax)\\s*:?\\s*\\+?\\s*\\d']¶ the entity text is passed as lowercase.
Type: Regular expressions that define disallowed chemical entity mentions. Note
-
chemdataextractor.nlp.cem.
SPLITS
= ['^(actinium|aluminium|aluminum|americium|antimony|argon|arsenic|astatine|barium|berkelium|beryllium|bismuth|bohrium|boron|bromine|cadmium|caesium|calcium|californium|carbon|cerium|cesium|chlorine|chromium|cobalt|copernicium|copper|curium|darmstadtium|dubnium|dysprosium|einsteinium|erbium|europium|fermium|flerovium|fluorine|francium|gadolinium|gallium|germanium|gold|hafnium|hassium|helium|holmium|hydrargyrum|hydrogen|indium|iodine|iridium|iron|kalium|krypton|lanthanum|lawrencium|lead|lithium|livermorium|lutetium|magnesium|manganese|meitnerium|mendelevium|mercury|molybdenum|natrium|neodymium|neon|neptunium|nickel|niobium|nitrogen|nobelium|osmium|oxygen|palladium|phosphorus|platinum|plumbum|plutonium|polonium|potassium|praseodymium|promethium|protactinium|radium|radon|rhenium|rhodium|roentgenium|rubidium|ruthenium|rutherfordium|samarium|scandium|seaborgium|selenium|silicon|silver|sodium|stannum|stibium|strontium|sulfur|tantalum|technetium|tellurium|terbium|thallium|thorium|thulium|tin|titanium|tungsten|ununoctium|ununpentium|ununseptium|ununtrium|uranium|vanadium|wolfram|xenon|ytterbium|yttrium|zinc|zirconium)$', '^(Ag|Al|Ar|Au|Br|Cd|Cl|Co|Cu|Fe|Gd|Ge|Hg|Kr|Li|Mg|Na|Ne|Ni|Pb|Pd|Pt|Ru|Sb|Si|Sn|Ti|Xe|Zn|Zr|Zn)$', '^(iodide|triiodide|nitrite|nitrate)$', '^(graphane|graphene|carbon|silica|glucose)$', '^(sugar|phospate)$', '^(azide|alkyne|alkene|alkane)$', '^(arginine|cysteine|glycine|aspartic acid|glutamate|dopamine|serotonin|acetone|methanol|ethanol|EtOH|MeOH|AcOEt|melatonin|leucine|alanine|histidine|isoleucine|lysine|threonine|tryptophan|nicotine|gentamicin|ATP|FITC|biotin|tamoxifen|catechin|asparagine)$', '^(Ala|Arg|Asn|Asp|Cys|Glu|Gln|Gly|His|Ile|Leu|Lys|Met|Phe|Pro|Ser|Thr|Trp|Tyr|Val)(?:\\(?\\d+\\)?)?$', '^(\\(?1\\)?H|\\(?1[45]\\)?N|\\(?1[234]\\)?C|\\(?19\\)?F)$', '^(F|Cl|Zn[OS]|H\\(?2\\)?O(\\(?2\\)?)?|Ni\\(OH\\)\\(?2\\)?|(NiF|SnO|TiO|NO)\\(?2\\)?|(Al|Y|Fe)\\(?2\\)?O\\(?3\\)?|CaCO\\(?3\\)?)$', '^(ester|amide)$']¶ Regular expressions defining collections of words that should be split if joined by hyphens or -to-
-
class
chemdataextractor.nlp.cem.
CiDictCemTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.DictionaryTagger
Case-insensitive CEM dictionary tagger.
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
model
= 'models/cem_dict-1.0.pickle'¶
-
-
class
chemdataextractor.nlp.cem.
CsDictCemTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.DictionaryTagger
Case-sensitive CEM dictionary tagger.
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
model
= 'models/cem_dict_cs-1.0.pickle'¶
-
case_sensitive
= True¶
-
-
class
chemdataextractor.nlp.cem.
CrfCemTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.CrfTagger
-
model
= 'models/cem_crf_chemdner_cemp-1.0.pickle'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
clusters
= True¶
-
params
= {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 200}¶
-
-
class
chemdataextractor.nlp.cem.
CemTagger
[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Return the combined output of a number of chemical entity taggers.
-
taggers
= [<chemdataextractor.nlp.cem.CrfCemTagger object>, <chemdataextractor.nlp.cem.CiDictCemTagger object>, <chemdataextractor.nlp.cem.CsDictCemTagger object>]¶ The individual chemical entity taggers to use.
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
.nlp.corpus¶
Tools for reading and writing text corpora.
-
class
chemdataextractor.nlp.corpus.
LazyCorpusLoader
(name, reader_cls, *args, **kwargs)[source]¶ Bases:
object
Derived from NLTK LazyCorpusLoader.
-
chemdataextractor.nlp.corpus.
wsj
= <BracketParseCorpusReader in '.../corpora/wsj_training' (not loaded yet)>¶ Penn Treebank Revised, LDC2015T13)
Type: Entire WSJ corpus (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_training
= <BracketParseCorpusReader in '.../corpora/wsj_training' (not loaded yet)>¶ Penn Treebank Revised, LDC2015T13)
Type: WSJ corpus sections 0-18 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_development
= <BracketParseCorpusReader in '.../corpora/wsj_development' (not loaded yet)>¶ Penn Treebank Revised, LDC2015T13)
Type: WSJ corpus sections 19-21 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
wsj_evaluation
= <BracketParseCorpusReader in '.../corpora/wsj_evaluation' (not loaded yet)>¶ Penn Treebank Revised, LDC2015T13)
Type: WSJ corpus sections 22-24 (English News Text Treebank
-
chemdataextractor.nlp.corpus.
treebank2_training
= <ChunkedCorpusReader in '.../corpora/treebank2_training' (not loaded yet)>¶ WSJ corpus sections 0-18 (treebank2)
-
chemdataextractor.nlp.corpus.
treebank2_development
= <ChunkedCorpusReader in '.../corpora/treebank2_development' (not loaded yet)>¶ WSJ corpus sections 19-21 (treebank2)
-
chemdataextractor.nlp.corpus.
treebank2_evaluation
= <ChunkedCorpusReader in '.../corpora/treebank2_evaluation' (not loaded yet)>¶ WSJ corpus sections 22-24 (treebank2)
-
chemdataextractor.nlp.corpus.
genia_training
= <TaggedCorpusReader in '.../corpora/genia_training' (not loaded yet)>¶ First 80% of GENIA POS-tagged corpus
-
chemdataextractor.nlp.corpus.
genia_evaluation
= <TaggedCorpusReader in '.../corpora/genia_evaluation' (not loaded yet)>¶ Last 20% of GENIA POS-tagged corpus
-
chemdataextractor.nlp.corpus.
medpost
= <TaggedCorpusReader in '.../corpora/medpost' (not loaded yet)>¶
-
chemdataextractor.nlp.corpus.
medpost_training
= <TaggedCorpusReader in '.../corpora/medpost_training' (not loaded yet)>¶
-
chemdataextractor.nlp.corpus.
medpost_evaluation
= <TaggedCorpusReader in '.../corpora/medpost_evaluation' (not loaded yet)>¶
-
chemdataextractor.nlp.corpus.
cde_tokensc
= <PlaintextCorpusReader in '.../corpora/cde_tokensc' (not loaded yet)>¶
-
chemdataextractor.nlp.corpus.
chemdner_training
= <PlaintextCorpusReader in '.../corpora/chemdner_training' (not loaded yet)>¶
.nlp.lexicon¶
Cache features of previously seen words.
-
class
chemdataextractor.nlp.lexicon.
Lexeme
(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]¶ Bases:
object
-
__init__
(text, normalized, lower, first, suffix, shape, length, upper_count, lower_count, digit_count, is_alpha, is_ascii, is_digit, is_lower, is_upper, is_title, is_punct, is_hyphenated, like_url, like_number, cluster)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
text
¶ Original Lexeme text.
-
cluster
¶ The Brown Word Cluster for this Lexeme.
-
normalized
¶ Normalized text, using the Lexicon Normalizer.
-
lower
¶ Lowercase text.
-
first
¶ First character.
-
suffix
¶ Three-character suffix
-
shape
¶ Word shape. Derived by replacing every number with ‘d’, every greek letter with ‘g’, and every latin letter with ‘X’ or ‘x’ for uppercase and lowercase respectively.
-
length
¶ Lexeme length.
-
upper_count
¶ Count of uppercase characters.
-
lower_count
¶ Count of lowercase characters.
-
digit_count
¶ Count of digits.
-
is_alpha
¶ Whether the text is entirely alphabetical characters.
-
is_ascii
¶ Whether the text is entirely ASCII characters.
-
is_digit
¶ Whether the text is entirely digits.
-
is_lower
¶ Whether the text is entirely lowercase.
-
is_upper
¶ Whether the text is entirely uppercase.
-
is_title
¶ Whether the text is title cased.
-
is_punct
¶ Whether the text is entirely punctuation characters.
-
is_hyphenated
¶ Whether the text is hyphenated.
-
like_url
¶ Whether the text looks like a URL.
-
like_number
¶ Whether the text looks like a number.
-
-
class
chemdataextractor.nlp.lexicon.
Lexicon
[source]¶ Bases:
object
-
normalizer
= <chemdataextractor.text.normalize.Normalizer object>¶ The Normalizer for this Lexicon.
-
clusters_path
= None¶ Path to the Brown clusters model file for this Lexicon.
-
-
class
chemdataextractor.nlp.lexicon.
ChemLexicon
[source]¶ Bases:
chemdataextractor.nlp.lexicon.Lexicon
A Lexicon that is pre-configured with a Chemistry-aware Normalizer and Brown word clusters derived from a chemistry corpus.
-
normalizer
= <chemdataextractor.text.normalize.ChemNormalizer object>¶
-
clusters_path
= 'models/clusters_chem1500-1.0.pickle'¶
-
.nlp.pos¶
Part-of-speech tagging.
-
chemdataextractor.nlp.pos.
TAGS
= ['NN', 'IN', 'NNP', 'DT', 'NNS', 'JJ', ',', '.', 'CD', 'RB', 'VBD', 'VB', 'CC', 'VBN', 'VBZ', 'PRP', 'VBG', 'TO', 'VBP', 'HYPH', 'MD', 'POS', 'PRP$', '$', '``', "''", ':', 'WDT', 'JJR', 'RP', 'NNPS', 'WP', 'WRB', 'RBR', 'JJS', '-RRB-', '-LRB-', 'EX', 'RBS', 'PDT', 'SYM', 'FW', 'WP$', 'UH', 'LS', 'NFP', 'AFX']¶ Complete set of POS tags. Ordered by decreasing frequency in WSJ corpus.
-
class
chemdataextractor.nlp.pos.
ApPosTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.tag.ApTagger
Greedy Averaged Perceptron POS tagger trained on WSJ corpus.
-
model
= 'models/pos_ap_wsj_nocluster-1.0.pickle'¶
-
clusters
= False¶
-
-
class
chemdataextractor.nlp.pos.
ChemApPosTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.pos.ApPosTagger
Greedy Averaged Perceptron POS tagger trained on both WSJ and GENIA corpora.
Uses features based on word clusters from chemistry text.
-
model
= 'models/pos_ap_wsj_genia-1.0.pickle'¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
clusters
= True¶
-
-
class
chemdataextractor.nlp.pos.
CrfPosTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.CrfTagger
-
model
= 'models/pos_crf_wsj_nocluster-1.0.pickle'¶
-
clusters
= False¶
-
.nlp.tag¶
Tagger implementations. Used for part-of-speech tagging and named entity recognition.
-
class
chemdataextractor.nlp.tag.
BaseTagger
[source]¶ Bases:
object
Abstract tagger class from which all taggers inherit.
Subclasses must implement a
tag()
method.
-
class
chemdataextractor.nlp.tag.
NoneTagger
[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Tag every token with None.
-
class
chemdataextractor.nlp.tag.
RegexTagger
(patterns=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Regular Expression Tagger.
-
__init__
(patterns=None, lexicon=None)[source]¶ Parameters: string)) patterns (list(tuple(string,) – List of (regex, tag) pairs.
-
patterns
= [('^-?[0-9]+(.[0-9]+)?$', 'CD'), ('(The|the|A|a|An|an)$', 'AT'), ('.*able$', 'JJ'), ('.*ness$', 'NN'), ('.*ly$', 'RB'), ('.*s$', 'NNS'), ('.*ing$', 'VBG'), ('.*ed$', 'VBD'), ('.*', 'NN')]¶ Regular expression patterns in (regex, tag) tuples.
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶ The lexicon to use
-
-
class
chemdataextractor.nlp.tag.
AveragedPerceptron
[source]¶ Bases:
object
Averaged Perceptron implementation.
Based on implementation by Matthew Honnibal, released under the MIT license.
-
class
chemdataextractor.nlp.tag.
ApTagger
(model=None, lexicon=None, clusters=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Greedy Averaged Perceptron tagger, based on implementation by Matthew Honnibal, released under the MIT license.
- See more:
- http://spacy.io/blog/part-of-speech-POS-tagger-in-python/ https://github.com/sloria/textblob-aptagger
-
START
= ['-START-', '-START2-']¶
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶
-
clusters
= False¶
-
class
chemdataextractor.nlp.tag.
CrfTagger
(model=None, lexicon=None, clusters=None, params=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Tagger that uses Conditional Random Fields (CRF).
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶
-
clusters
= False¶
-
params
= {'c1': 1.0, 'c2': 0.001, 'feature.possible_states': False, 'feature.possible_transitions': False, 'max_iterations': 50}¶ //www.chokkan.org/software/crfsuite/manual.html
Type: Parameters to pass to training algorithm. See http
-
-
class
chemdataextractor.nlp.tag.
DictionaryTagger
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Bases:
chemdataextractor.nlp.tag.BaseTagger
Dictionary Tagger. Tag tokens based on inclusion in a DAWG.
-
delimiters
= re.compile('(^.|\\b|\\s|\\W|.$)')¶ Delimiters that define where matches are allowed to start or end.
-
__init__
(words=None, model=None, entity=None, case_sensitive=None, lexicon=None)[source]¶ Parameters: words (list(list(string))) – list of words, each of which is a list of tokens.
-
model
= None¶ DAWG model file path.
-
entity
= 'CM'¶ Optional no B/I?
Type: Entity tag. Matches will be tagged like ‘B-CM’ and ‘I-CM’ according to IOB scheme. TODO
-
case_sensitive
= False¶ Whether dictionary matches are case sensitive.
-
lexicon
= <chemdataextractor.nlp.lexicon.Lexicon object>¶ The lexicon to use.
-
.nlp.tokenize¶
Word and sentence tokenizers.
-
class
chemdataextractor.nlp.tokenize.
BaseTokenizer
[source]¶ Bases:
object
Abstract base class from which all Tokenizer classes inherit.
Subclasses must implement a
span_tokenize(text)
method that returns a list of integer offset tuples that identify tokens in the text.
-
chemdataextractor.nlp.tokenize.
regex_span_tokenize
(s, regex)[source]¶ Return spans that identify tokens in s split using regex.
-
class
chemdataextractor.nlp.tokenize.
SentenceTokenizer
(model=None)[source]¶ Bases:
chemdataextractor.nlp.tokenize.BaseTokenizer
Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006).
-
model
= 'models/punkt_english.pickle'¶
-
-
class
chemdataextractor.nlp.tokenize.
ChemSentenceTokenizer
(model=None)[source]¶ Bases:
chemdataextractor.nlp.tokenize.SentenceTokenizer
Sentence tokenizer that uses the Punkt algorithm by Kiss & Strunk (2006), trained on chemistry text.
-
model
= 'models/punkt_chem-1.0.pickle'¶
-
-
class
chemdataextractor.nlp.tokenize.
WordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.BaseTokenizer
Standard word tokenizer for generic English text.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '−', '‒', '+', '±']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_NO_DIGIT
= [':', ',']¶ Split around these sequences unless they are followed by a digit
-
SPLIT_START_WORD
= ["''", '``', "'"]¶ Split after these sequences if they start a word
-
SPLIT_END_WORD
= ["'s", "'m", "'d", "'ll", "'re", "'ve", "n't", "''", "'", '’s', '’m', '’d', '’ll', '’re', '’ve', 'n’t', '’', '’’']¶ Split before these sequences if they end a word
-
NO_SPLIT_STOP
= ['...', 'al.', 'Co.', 'Ltd.', 'Pvt.', 'A.D.', 'B.C.', 'B.V.', 'S.D.', 'U.K.', 'U.S.', 'r.t.']¶ Don’t split full stop off last token if it is one of these sequences
-
CONTRACTIONS
= [('cannot', 3), ("d'ye", 1), ('d’ye', 1), ('gimme', 3), ('gonna', 3), ('gotta', 3), ('lemme', 3), ("mor'n", 3), ('mor’n', 3), ('wanna', 3), ("'tis", 2), ("'twas", 2)]¶ Split these contractions at the specified index
-
NO_SPLIT
= {'mm-hm', 'mm-mm', 'o-kay', 'uh-huh', 'uh-oh', 'wanna-be'}¶ Don’t split these sequences.
-
NO_SPLIT_PREFIX
= {'a', 'agro', 'ante', 'anti', 'arch', 'be', 'bi', 'bio', 'co', 'counter', 'cross', 'cyber', 'de', 'e', 'eco', 'ex', 'extra', 'inter', 'intra', 'macro', 'mega', 'micro', 'mid', 'mini', 'multi', 'neo', 'non', 'over', 'pan', 'para', 'peri', 'post', 'pre', 'pro', 'pseudo', 'quasi', 're', 'semi', 'sub', 'super', 'tri', 'u', 'ultra', 'un', 'uni', 'vice', 'x'}¶ Don’t split around hyphens with these prefixes
-
NO_SPLIT_SUFFIX
= {'-o-torium', 'esque', 'ette', 'fest', 'fold', 'gate', 'itis', 'less', 'most', 'rama', 'wise'}¶ Don’t split around hyphens with these suffixes.
-
NO_SPLIT_CHARS
= '0123456789,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗'¶ Don’t split around hyphens if only these characters before or after.
-
__init__
(split_last_stop=True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
split_last_stop
= None¶ Whether to split off the final full stop (unless preceded by NO_SPLIT_STOP). Default True.
-
-
class
chemdataextractor.nlp.tokenize.
ChemWordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.WordTokenizer
Word Tokenizer for chemistry text.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '<', ').', '.(', '–', '—', '―', '~', '⁓', '∼', '°', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '⇄', '"', '“', '”', '„', '‟', '‘', '‚', '‛', '`', '´']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_END
= [':', ',', '(TM)', '(R)', '(®)', '(™)', '(■)', '(◼)', '(●)', '(▲)', '(○)', '(◆)', '(▼)', '(⧫)', '(△)', '(◇)', '(▽)', '(⬚)', '(×)', '(□)', '(•)', '’', '°C']¶ Split before these sequences if they end a token
-
SPLIT_END_NO_DIGIT
= ['(aq)', '(aq.)', '(s)', '(l)', '(g)']¶ Split before these sequences if they end a token, unless preceded by a digit
-
NO_SPLIT_SLASH
= ['+', '-', '−']¶ Don’t split around slash when both preceded and followed by these characters
-
QUANTITY_RE
= re.compile('^((?P<split>\\d\\d\\d)g|(?P<_split1>[-−]?\\d+\\.\\d+|10[-−]\\d+)(g|s|m|N|V)([-−]?[1-4])?|(?P<_split2>\\d*[-−]?\\d+\\.?\\d*)([pnµμm]A|[µμmk]g|[kM]J|m[lL]|[nµμm]?M|[nµμmc]m|kN|[mk]V|[mkMG]?W|[mnpμµ]s|H)¶ Regular expression that matches a numeric quantity with units
-
NO_SPLIT_PREFIX_ENDING
= re.compile('(^\\(.*\\)|^[\\d,\'"“”„‟‘’‚‛`´′″‴‵‶‷⁗Α-Ωα-ω]+|ano|ato|azo|boc|bromo|cbz|chloro|eno|fluoro|fmoc|ido|ino|io|iodo|mercapto|nitro|ono|oso|oxalo|oxo|oxy|phospho|telluro|tms|yl|ylen|ylene|yliden|ylidene|yl)¶ Don’t split on hyphen if the prefix matches this regular expression
-
NO_SPLIT_CHEM
= re.compile('([\\-α-ω]|\\d+,\\d+|\\d+[A-Z]|^d\\d\\d?$|acetic|acetyl|acid|acyl|anol|azo|benz|bromo|carb|cbz|chlor|cyclo|ethan|ethyl|fluoro|fmoc|gluc|hydro|idyl|indol|iene|ione|iodo|mercapto|n,n|nitro|noic|o,o|oxal, re.IGNORECASE)¶ Don’t split on hyphen if prefix or suffix match this regular expression
-
NO_SPLIT_PREFIX
= {'a', 'aci', 'adeno', 'agro', 'aldehydo', 'allo', 'alpha', 'altro', 'ambi', 'ante', 'anti', 'aorto', 'arachno', 'arch', 'as', 'be', 'beta', 'bi', 'bio', 'bis', 'catena', 'centi', 'chi', 'chiro', 'circum', 'cis', 'closo', 'co', 'colo', 'conjuncto', 'conta', 'contra', 'cortico', 'cosa', 'counter', 'cran', 'cross', 'crypto', 'cyber', 'cyclo', 'de', 'deca', 'deci', 'delta', 'demi', 'di', 'dis', 'dl', 'e', 'eco', 'electro', 'endo', 'ennea', 'ent', 'epi', 'epsilon', 'erythro', 'eta', 'ex', 'exo', 'extra', 'ferro', 'galacto', 'gamma', 'gastro', 'giga', 'gluco', 'glycero', 'graft', 'gulo', 'hemi', 'hepta', 'hexa', 'homo', 'hydro', 'hypho', 'hypo', 'ideo', 'idio', 'in', 'infra', 'inter', 'intra', 'iota', 'iso', 'judeo', 'kappa', 'keto', 'kis', 'lambda', 'lyxo', 'macro', 'manno', 'medi', 'mega', 'meso', 'meta', 'micro', 'mid', 'milli', 'mini', 'mono', 'mu', 'muco', 'multi', 'musculo', 'myo', 'nano', 'neo', 'neuro', 'nido', 'nitro', 'non', 'nona', 'nor', 'novem', 'novi', 'nu', 'octa', 'octi', 'octo', 'omega', 'omicron', 'ortho', 'over', 'paleo', 'pan', 'para', 'pelvi', 'penta', 'peri', 'pheno', 'phi', 'pi', 'pica', 'pneumo', 'poly', 'post', 'pre', 'preter', 'pro', 'pseudo', 'psi', 'quadri', 'quasi', 'quater', 'quinque', 're', 'recto', 'rho', 'ribo', 'salpingo', 'scyllo', 'sec', 'semi', 'sept', 'septi', 'sero', 'sesqui', 'sexi', 'sigma', 'sn', 'soci', 'sub', 'super', 'supra', 'sur', 'sym', 'syn', 'talo', 'tau', 'tele', 'ter', 'tera', 'tert', 'tetra', 'theta', 'threo', 'trans', 'tri', 'triangulo', 'tris', 'u', 'uber', 'ultra', 'un', 'uni', 'unsym', 'upsilon', 'veno', 'ventriculo', 'vice', 'x', 'xi', 'xylo', 'zeta'}¶ Don’t split on hyphen if the prefix is one of these sequences
-
SPLIT_SUFFIX
= {'absorption', 'abstinent', 'abstraction', 'abuse', 'accelerated', 'accepting', 'acclimated', 'acclimation', 'acid', 'activated', 'activation', 'active', 'activity', 'addition', 'adducted', 'adducts', 'adequate', 'adjusted', 'administrated', 'adsorption', 'affected', 'aged', 'alcohol', 'alcoholic', 'algae', 'alginate', 'alkaline', 'alkylated', 'alkylation', 'alkyne', 'analogous', 'anesthetized', 'appended', 'armed', 'aromatic', 'assay', 'assemblages', 'assisted', 'associated', 'atom', 'atoms', 'attenuated', 'attributed', 'backbone', 'base', 'based', 'bearing', 'benzylation', 'binding', 'biomolecule', 'biotic', 'blocking', 'blood', 'bond', 'bonded', 'bonding', 'bonds', 'boosted', 'bottle', 'bottled', 'bound', 'bridge', 'bridged', 'buffer', 'buffered', 'caged', 'cane', 'capped', 'capturing', 'carrier', 'carrying', 'catalysed', 'catalyzed', 'cation', 'caused', 'centered', 'challenged', 'chelating', 'cleaving', 'coated', 'coating', 'coenzyme', 'competing', 'competitive', 'complex', 'complexes', 'compound', 'compounds', 'concentration', 'conditioned', 'conditions', 'conducting', 'configuration', 'confirmed', 'conjugate', 'conjugated', 'conjugates', 'connectivity', 'consuming', 'contained', 'containing', 'contaminated', 'control', 'converting', 'coordinate', 'coordinated', 'copolymer', 'copolymers', 'core', 'cored', 'cotransport', 'coupled', 'covered', 'crosslinked', 'cyclized', 'damaged', 'dealkylation', 'decocted', 'decorated', 'deethylation', 'deficiency', 'deficient', 'defined', 'degrading', 'demethylated', 'demethylation', 'dendrimer', 'density', 'dependant', 'dependence', 'dependent', 'deplete', 'depleted', 'depleting', 'depletion', 'depolarization', 'depolarized', 'deprived', 'derivatised', 'derivative', 'derivatives', 'derivatized', 'derived', 'desorption', 'detected', 'devalued', 'dextran', 'dextrans', 'diabetic', 'dimensional', 'dimer', 'distribution', 'divalent', 'domain', 'dominated', 'donating', 'donor', 'dopant', 'doped', 'doping', 'dosed', 'dot', 'drinking', 'driven', 'drug', 'drugs', 'dye', 'edge', 'efficiency', 'electrodeposited', 'electrolyte', 'elevating', 'elicited', 'embedded', 'emersion', 'emitting', 'encapsulated', 'encapsulating', 'enclosed', 'enhanced', 'enhancing', 'enriched', 'enrichment', 'enzyme', 'epidermal', 'equivalents', 'etched', 'ethanolamine', 'evoked', 'exchange', 'excimer', 'excluder', 'expanded', 'experimental', 'exposed', 'exposure', 'expressing', 'extract', 'extraction', 'fed', 'finger', 'fixed', 'fixing', 'flanking', 'flavonoid', 'fluorescence', 'formation', 'forming', 'fortified', 'free', 'function', 'functionalised', 'functionalized', 'functionalyzed', 'fused', 'gas', 'gated', 'generating', 'glucuronidating', 'glycoprotein', 'glycosylated', 'glycosylation', 'gradient', 'grafted', 'group', 'groups', 'halogen', 'heterocyclic', 'homologues', 'hydrogel', 'hydrolyzing', 'hydroxylated', 'hydroxylation', 'hydroxysteroid', 'immersion', 'immobilized', 'immunoproteins', 'impregnated', 'imprinted', 'inactivated', 'increased', 'increasing', 'incubated', 'independent', 'induce', 'induced', 'inducible', 'inducing', 'induction', 'influx', 'inhibited', 'inhibitor', 'inhibitory', 'initiated', 'injected', 'insensitive', 'insulin', 'integrated', 'interlinked', 'intermediate', 'intolerant', 'intoxicated', 'ion', 'ions', 'island', 'isomer', 'isomers', 'knot', 'label', 'labeled', 'labeling', 'labelled', 'laden', 'lamp', 'laser', 'layer', 'layers', 'lesioned', 'ligand', 'ligated', 'like', 'limitation', 'limited', 'limiting', 'lined', 'linked', 'linker', 'lipid', 'lipids', 'lipoprotein', 'liposomal', 'liposomes', 'liquid', 'liver', 'loaded', 'loading', 'locked', 'loss', 'lowering', 'lubricants', 'luminance', 'luminescence', 'maintained', 'majority', 'making', 'mannosylated', 'material', 'mediated', 'metabolizing', 'metal', 'metallized', 'methylation', 'migrated', 'mimetic', 'mimicking', 'mixed', 'mixture', 'mode', 'model', 'modified', 'modifying', 'modulated', 'moiety', 'molecule', 'monoadducts', 'monomer', 'mutated', 'nanogel', 'nanoparticle', 'nanotube', 'need', 'negative', 'nitrosated', 'nitrosation', 'nitrosylation', 'nmr', 'noncompetitive', 'normalized', 'nuclear', 'nucleoside', 'nucleosides', 'nucleotide', 'nucleotides', 'nutrition', 'olefin', 'olefins', 'oligomers', 'omitted', 'only', 'outcome', 'overload', 'oxidation', 'oxidized', 'oxo-mediated', 'oxygenation', 'page', 'paired', 'pathway', 'patterned', 'peptide', 'permeabilized', 'permeable', 'phase', 'phospholipids', 'phosphopeptide', 'phosphorylated', 'pillared', 'placebo', 'planted', 'plasma', 'polymer', 'polymers', 'poor', 'porous', 'position', 'positive', 'postlabeling', 'precipitated', 'preferring', 'pretreated', 'primed', 'produced', 'producing', 'production', 'promoted', 'promoting', 'protected', 'protein', 'proteomic', 'protonated', 'provoked', 'purified', 'radical', 'reacting', 'reaction', 'reactive', 'reagents', 'rearranged', 'receptor', 'receptors', 'recognition', 'redistribution', 'redox', 'reduced', 'reducing', 'reduction', 'refractory', 'refreshed', 'regenerating', 'regulated', 'regulating', 'regulatory', 'related', 'release', 'releasing', 'replete', 'requiring', 'resistance', 'resistant', 'resitant', 'response', 'responsive', 'responsiveness', 'restricted', 'resulted', 'retinal', 'reversible', 'ribosylated', 'ribosylating', 'ribosylation', 'rich', 'right', 'ring', 'saturated', 'scanning', 'scavengers', 'scavenging', 'sealed', 'secreting', 'secretion', 'seeking', 'selective', 'selectivity', 'semiconductor', 'sensing', 'sensitive', 'sensitized', 'soluble', 'solution', 'solvent', 'sparing', 'specific', 'spiked', 'stabilised', 'stabilized', 'stabilizing', 'stable', 'stained', 'steroidal', 'stimulated', 'stimulating', 'storage', 'stressed', 'stripped', 'substituent', 'substituted', 'substitution', 'substrate', 'sufficient', 'sugar', 'sugars', 'supplemented', 'supported', 'suppressed', 'surface', 'susceptible', 'sweetened', 'synthesizing', 'tagged', 'target', 'telopeptide', 'terminal', 'terminally', 'terminated', 'termini', 'terminus', 'ternary', 'terpolymer', 'tertiary', 'tested', 'testes', 'tethered', 'tetrabrominated', 'tolerance', 'tolerant', 'toxicity', 'toxin', 'tracer', 'transfected', 'transfer', 'transition', 'transport', 'transporter', 'treated', 'treating', 'treatment', 'triggered', 'turn', 'type', 'unesterified', 'untreated', 'vacancies', 'vacancy', 'variable', 'water', 'yeast', 'yield', 'zwitterion'}¶ Split on hyphens followed by one of these sequences
-
NO_SPLIT
= {'°c'}¶
-
-
class
chemdataextractor.nlp.tokenize.
FineWordTokenizer
(split_last_stop=True)[source]¶ Bases:
chemdataextractor.nlp.tokenize.WordTokenizer
Word Tokenizer that also split around hyphens and all colons.
-
SPLIT
= ['----', '––––', '————', '<--->', '---', '–––', '———', '<-->', '-->', '...', '--', '––', '——', '``', "''", '->', '<', '>', '–', '—', '―', '~', '⁓', '∼', '°', ';', '@', '#', '$', '£', '€', '%', '&', '?', '!', '™', '®', '…', '⋯', '†', '‡', '§', '¶≠', '≡', '≢', '≣', '≤', '≥', '≦', '≧', '≨', '≩', '≪', '≫', '≈', '=', '÷', '×', '→', '⇄', '"', '“', '”', '„', '‟', '‘', '’', '‚', '‛', '`', '´', '′', '″', '‴', '‵', '‶', '‷', '⁗', '(', '[', '{', '}', ']', ')', '/', '⁄', '∕', '-', '−', '‒', '‐', '‑', '+', '±', ':']¶ Split before and after these sequences, wherever they occur, unless entire token is one of these sequences
-
SPLIT_NO_DIGIT
= [',']¶ Split before these sequences if they end a token
-
NO_SPLIT
= {}¶
-
NO_SPLIT_PREFIX
= {}¶ Don’t split around hyphens with these prefixes
-
NO_SPLIT_SUFFIX
= {}¶ Don’t split around hyphens with these suffixes.
-