.text¶
Useful tools for processing text
Tools for processing text.
-
chemdataextractor.text.
CONTROLS
= {'\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '...¶ Control characters.
-
chemdataextractor.text.
HYPHENS
= {'-', '‐', '‑', '‒', '–', '—', '―', '⁃'}¶ Hyphen and dash characters.
-
chemdataextractor.text.
MINUSES
= {'-', '⁻', '−', '-'}¶ Minus characters.
-
chemdataextractor.text.
PLUSES
= {'+', '⁺', '+'}¶ Plus characters.
-
chemdataextractor.text.
SLASHES
= {'/', '⁄', '∕'}¶ Slash characters.
-
chemdataextractor.text.
TILDES
= {'~', '˜', '⁓', '∼', '∽', '∿', '〜', '~'}¶ Tilde characters.
-
chemdataextractor.text.
APOSTROPHES
= {"'", '՚', '’', 'Ꞌ', 'ꞌ', '''}¶ Apostrophe characters.
-
chemdataextractor.text.
SINGLE_QUOTES
= {"'", '‘', '’', '‚', '‛'}¶ Single quote characters.
-
chemdataextractor.text.
DOUBLE_QUOTES
= {'"', '“', '”', '„', '‟'}¶ Double quote characters.
-
chemdataextractor.text.
ACCENTS
= {'`', '´'}¶ Accent characters.
-
chemdataextractor.text.
PRIMES
= {'′', '″', '‴', '‵', '‶', '‷', '⁗'}¶ Prime characters.
-
chemdataextractor.text.
QUOTES
= {'"', "'", '`', '´', '՚', '‘', '’', '‚', '‛', '“',...¶ Quote characters, including apostrophes, single quotes, double quotes, accents and primes.
-
chemdataextractor.text.
GREEK
= {'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Ζ', 'Η', 'Θ', 'Ι', 'Κ',...¶ Uppercase and lowercase greek letters.
-
chemdataextractor.text.
GREEK_WORDS
= {'Alpha', 'Beta', 'Chi', 'Delta', 'Epsilon', 'Eta'...¶ Names of greek letters spelled out as words.
-
chemdataextractor.text.
SMALL
= {'a', 'an', 'and', 'as', 'at', 'but', 'by', 'en', ...¶ Words that should not be capitalized in titles.
-
chemdataextractor.text.
NAME_SMALL
= {'abu', 'bin', 'bon', 'da', 'dal', 'de', 'del', 'd...¶ Words that should not be capitalized in names.
-
chemdataextractor.text.
NUMBERS
= {'billion', 'eight', 'eighteen', 'eighty', 'eleven...¶ A variety of numbers, spelled out as words.
-
chemdataextractor.text.
EMAIL_RE
= re.compile('([\\w\\-\\.\\+%]+@(\\w[\\w\\-]+\\.)+[\...¶ Regular expression that matches email addresses.
-
chemdataextractor.text.
DOI_RE
= re.compile('^10\\.\\d{4,9}/[-\\._;()/:A-Z0-9]+$')¶ Regular expression that matches DOIs.
-
chemdataextractor.text.
ISSN_RE
= re.compile('^\\d{4}-\\d{3}[\\dX]$')¶ Regular expression that matches ISSNs.
-
chemdataextractor.text.
CONTROL_RE
= re.compile('[^ -\ud7ff\t\n\r\ue000-�က0-ჿFF]+')¶ Regular expression that matches control characters not allowed in XML.
-
chemdataextractor.text.
get_encoding
(input_string, guesses=None, is_html=False)[source]¶ Return the encoding of a byte string. Uses bs4 UnicodeDammit.
-
chemdataextractor.text.
levenshtein
(s1, s2, allow_substring=False)[source]¶ Return the Levenshtein distance between two strings.
The Levenshtein distance (a.k.a “edit difference”) is the number of characters that need to be substituted, inserted or deleted to transform s1 into s2.
Setting the allow_substring parameter to True allows s1 to be a substring of s2, so that, for example, “hello” and “hello there” would have a distance of zero.
-
chemdataextractor.text.
bracket_level
(text, open={'(', '[', '{'}, close={')', ']', '}'})[source]¶ Return 0 if string contains balanced brackets or no brackets.
.text.chem¶
Chemistry text handling tools.
-
chemdataextractor.text.chem.
extract_inchis
(s)[source]¶ Return a list of InChI identifiers extracted from the string.
-
chemdataextractor.text.chem.
extract_inchikeys
(s)[source]¶ Return a list of InChIKey identifiers extracted from the string.
.text.latex¶
Tools for converting LaTeX to unicode.
.text.normalize¶
Tools for normalizing text.
-
class
chemdataextractor.text.normalize.
BaseNormalizer
[source]¶ Bases:
chemdataextractor.text.processors.BaseProcessor
Abstract normalizer class from which all normalizers inherit.
Subclasses must implement a
normalize()
method.
-
class
chemdataextractor.text.normalize.
Normalizer
(form='NFKC', strip=True, collapse=True, hyphens=False, quotes=False, ellipsis=False, slashes=False, tildes=False)[source]¶ Bases:
chemdataextractor.text.normalize.BaseNormalizer
Main Normalizer class for generic English text.
Normalize unicode, hyphens, quotes, whitespace.
By default, the normal form NFKC is used for unicode normalization. This applies a compatibility decomposition, under which equivalent characters are unified, followed by a canonical composition. See Python docs for information on normal forms: http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize
-
__init__
(form='NFKC', strip=True, collapse=True, hyphens=False, quotes=False, ellipsis=False, slashes=False, tildes=False)[source]¶ - Parameters:
form (string) – Normal form for unicode normalization.
strip (bool) – Whether to strip whitespace from start and end.
collapse (bool) – Whether to collapse all whitespace (tabs, newlines) down to single spaces.
hyphens (bool) – Whether to normalize all hyphens, minuses and dashes to the ASCII hyphen-minus character.
quotes (bool) – Whether to normalize all apostrophes, quotes and primes to the ASCII quote character.
ellipsis (bool) – Whether to normalize ellipses to three full stops.
slashes (bool) – Whether to normalize slash characters to the ASCII slash character.
tildes (bool) – Whether to normalize tilde characters to the ASCII tilde character.
-
-
chemdataextractor.text.normalize.
normalize
= <chemdataextractor.text.normalize.Normalizer objec...¶ Default normalize that canonicalizes unicode and fixes whitespace.
-
chemdataextractor.text.normalize.
strict_normalize
= <chemdataextractor.text.normalize.Normalizer objec...¶ More aggressive normalize that also standardizes hyphens, and quotes.
-
class
chemdataextractor.text.normalize.
ExcessNormalizer
(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True)[source]¶ Bases:
chemdataextractor.text.normalize.Normalizer
Excessive string normalization.
This is useful when doing fuzzy string comparisons. A common use case is to run this before calculating the Levenshtein distance between two strings, so that only “important” differences are counted.
-
class
chemdataextractor.text.normalize.
ChemNormalizer
(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True, chem_spell=True)[source]¶ Bases:
chemdataextractor.text.normalize.Normalizer
Normalizer that also unifies chemical spelling.
.text.processors¶
Text processors.
-
class
chemdataextractor.text.processors.
BaseProcessor
[source]¶ Bases:
object
Abstract processor class from which all processors inherit. Subclasses must implement a
__call__()
method.
-
class
chemdataextractor.text.processors.
Chain
(*callables)[source]¶ Bases:
object
Apply a series of processors in turn. Stops if a processors returns None.
-
class
chemdataextractor.text.processors.
Discard
(*match)[source]¶ Bases:
object
Return None if value matches a string.
-
class
chemdataextractor.text.processors.
LAdd
(substring)[source]¶ Bases:
object
Add a substring to the start of a value.
-
class
chemdataextractor.text.processors.
RAdd
(substring)[source]¶ Bases:
object
Add a substring to the end of a value.
-
class
chemdataextractor.text.processors.
LStrip
(*substrings)[source]¶ Bases:
object
Remove a substring from the start of a value.
-
class
chemdataextractor.text.processors.
RStrip
(*substrings)[source]¶ Bases:
object
Remove a substring from the end of a value.
-
chemdataextractor.text.processors.
floats
(s)[source]¶ Convert string to float. Handles more string formats that the standard python conversion.
-
chemdataextractor.text.processors.
strip_querystring
(url)[source]¶ Remove the querystring from the end of a URL.
-
class
chemdataextractor.text.processors.
Substitutor
(substitutions)[source]¶ Bases:
object
Perform a list of substitutions defined by regex on text.
Useful to clean up text where placeholders are used in place of actual unicode characters.