.text

Useful tools for processing text

Tools for processing text.

chemdataextractor.text.CONTROLS = {'\x01', '\x02', '\x03', '\x04', '\x05', '\x06', '...

Control characters.

chemdataextractor.text.HYPHENS = {'-', '‐', '‑', '‒', '–', '—', '―', '⁃'}

Hyphen and dash characters.

chemdataextractor.text.MINUSES = {'-', '⁻', '−', '-'}

Minus characters.

chemdataextractor.text.PLUSES = {'+', '⁺', '+'}

Plus characters.

chemdataextractor.text.SLASHES = {'/', '⁄', '∕'}

Slash characters.

chemdataextractor.text.TILDES = {'~', '˜', '⁓', '∼', '∽', '∿', '〜', '~'}

Tilde characters.

chemdataextractor.text.APOSTROPHES = {"'", '՚', '’', 'Ꞌ', 'ꞌ', '''}

Apostrophe characters.

chemdataextractor.text.SINGLE_QUOTES = {"'", '‘', '’', '‚', '‛'}

Single quote characters.

chemdataextractor.text.DOUBLE_QUOTES = {'"', '“', '”', '„', '‟'}

Double quote characters.

chemdataextractor.text.ACCENTS = {'`', '´'}

Accent characters.

chemdataextractor.text.PRIMES = {'′', '″', '‴', '‵', '‶', '‷', '⁗'}

Prime characters.

chemdataextractor.text.QUOTES = {'"', "'", '`', '´', '՚', '‘', '’', '‚', '‛', '“',...

Quote characters, including apostrophes, single quotes, double quotes, accents and primes.

chemdataextractor.text.GREEK = {'Α', 'Β', 'Γ', 'Δ', 'Ε', 'Ζ', 'Η', 'Θ', 'Ι', 'Κ',...

Uppercase and lowercase greek letters.

chemdataextractor.text.GREEK_WORDS = {'Alpha', 'Beta', 'Chi', 'Delta', 'Epsilon', 'Eta'...

Names of greek letters spelled out as words.

chemdataextractor.text.SMALL = {'a', 'an', 'and', 'as', 'at', 'but', 'by', 'en', ...

Words that should not be capitalized in titles.

chemdataextractor.text.NAME_SMALL = {'abu', 'bin', 'bon', 'da', 'dal', 'de', 'del', 'd...

Words that should not be capitalized in names.

chemdataextractor.text.NUMBERS = {'billion', 'eight', 'eighteen', 'eighty', 'eleven...

A variety of numbers, spelled out as words.

chemdataextractor.text.EMAIL_RE = re.compile('([\\w\\-\\.\\+%]+@(\\w[\\w\\-]+\\.)+[\...

Regular expression that matches email addresses.

chemdataextractor.text.DOI_RE = re.compile('^10\\.\\d{4,9}/[-\\._;()/:A-Z0-9]+$')

Regular expression that matches DOIs.

chemdataextractor.text.ISSN_RE = re.compile('^\\d{4}-\\d{3}[\\dX]$')

Regular expression that matches ISSNs.

chemdataextractor.text.CONTROL_RE = re.compile('[^ -\ud7ff\t\n\r\ue000-�က0-ჿFF]+')

Regular expression that matches control characters not allowed in XML.

chemdataextractor.text.get_encoding(input_string, guesses=None, is_html=False)[source]

Return the encoding of a byte string. Uses bs4 UnicodeDammit.

Parameters:
  • input_string (string) – Encoded byte string.

  • guesses (list[string]) – (Optional) List of encoding guesses to prioritize. Default is [‘utf-8’]

  • is_html (bool) – Whether the input is HTML.

chemdataextractor.text.levenshtein(s1, s2, allow_substring=False)[source]

Return the Levenshtein distance between two strings.

The Levenshtein distance (a.k.a “edit difference”) is the number of characters that need to be substituted, inserted or deleted to transform s1 into s2.

Setting the allow_substring parameter to True allows s1 to be a substring of s2, so that, for example, “hello” and “hello there” would have a distance of zero.

Parameters:
  • s1 (string) – The first string

  • s2 (string) – The second string

  • allow_substring (bool) – Whether to allow s1 to be a substring of s2

Returns:

Levenshtein distance.

Type:

int

chemdataextractor.text.bracket_level(text, open={'(', '[', '{'}, close={')', ']', '}'})[source]

Return 0 if string contains balanced brackets or no brackets.

chemdataextractor.text.is_punct(text)[source]
chemdataextractor.text.is_ascii(text)[source]
chemdataextractor.text.like_url(text)[source]
chemdataextractor.text.like_number(text)[source]
chemdataextractor.text.word_shape(text)[source]

.text.chem

Chemistry text handling tools.

chemdataextractor.text.chem.extract_inchis(s)[source]

Return a list of InChI identifiers extracted from the string.

chemdataextractor.text.chem.extract_inchikeys(s)[source]

Return a list of InChIKey identifiers extracted from the string.

chemdataextractor.text.chem.extract_cas(s)[source]

Return a list of CAS identifiers extracted from the string.

chemdataextractor.text.chem.extract_smiles(s)[source]

Return a list of SMILES identifiers extracted from the string.

.text.latex

Tools for converting LaTeX to unicode.

chemdataextractor.text.latex.latex_to_unicode(text, capitalize=False)[source]

Replace LaTeX entities with the equivalent unicode and optionally capitalize.

Parameters:
  • text – The LaTeX string to be converted

  • capitalize – Can be ‘sentence’, ‘name’, ‘title’, ‘upper’, ‘lower’

.text.normalize

Tools for normalizing text.

class chemdataextractor.text.normalize.BaseNormalizer[source]

Bases: chemdataextractor.text.processors.BaseProcessor

Abstract normalizer class from which all normalizers inherit.

Subclasses must implement a normalize() method.

normalize(text)[source]

Normalize the text.

Parameters:

text (string) – The text to normalize.

Returns:

Normalized text.

Return type:

string

class chemdataextractor.text.normalize.Normalizer(form='NFKC', strip=True, collapse=True, hyphens=False, quotes=False, ellipsis=False, slashes=False, tildes=False)[source]

Bases: chemdataextractor.text.normalize.BaseNormalizer

Main Normalizer class for generic English text.

Normalize unicode, hyphens, quotes, whitespace.

By default, the normal form NFKC is used for unicode normalization. This applies a compatibility decomposition, under which equivalent characters are unified, followed by a canonical composition. See Python docs for information on normal forms: http://docs.python.org/2/library/unicodedata.html#unicodedata.normalize

__init__(form='NFKC', strip=True, collapse=True, hyphens=False, quotes=False, ellipsis=False, slashes=False, tildes=False)[source]
Parameters:
  • form (string) – Normal form for unicode normalization.

  • strip (bool) – Whether to strip whitespace from start and end.

  • collapse (bool) – Whether to collapse all whitespace (tabs, newlines) down to single spaces.

  • hyphens (bool) – Whether to normalize all hyphens, minuses and dashes to the ASCII hyphen-minus character.

  • quotes (bool) – Whether to normalize all apostrophes, quotes and primes to the ASCII quote character.

  • ellipsis (bool) – Whether to normalize ellipses to three full stops.

  • slashes (bool) – Whether to normalize slash characters to the ASCII slash character.

  • tildes (bool) – Whether to normalize tilde characters to the ASCII tilde character.

normalize(text)[source]

Run the Normalizer on a string.

Parameters:

text – The string to normalize.

chemdataextractor.text.normalize.normalize = <chemdataextractor.text.normalize.Normalizer objec...

Default normalize that canonicalizes unicode and fixes whitespace.

chemdataextractor.text.normalize.strict_normalize = <chemdataextractor.text.normalize.Normalizer objec...

More aggressive normalize that also standardizes hyphens, and quotes.

class chemdataextractor.text.normalize.ExcessNormalizer(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True)[source]

Bases: chemdataextractor.text.normalize.Normalizer

Excessive string normalization.

This is useful when doing fuzzy string comparisons. A common use case is to run this before calculating the Levenshtein distance between two strings, so that only “important” differences are counted.

__init__(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True)[source]
normalize(text)[source]

Run the Normalizer on a string.

Parameters:

text – The string to normalize.

class chemdataextractor.text.normalize.ChemNormalizer(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True, chem_spell=True)[source]

Bases: chemdataextractor.text.normalize.Normalizer

Normalizer that also unifies chemical spelling.

__init__(form='NFKC', strip=True, collapse=True, hyphens=True, quotes=True, ellipsis=True, tildes=True, chem_spell=True)[source]
normalize(text)[source]

Normalize unicode, hyphens, whitespace, and some chemistry terms and formatting.

.text.processors

Text processors.

class chemdataextractor.text.processors.BaseProcessor[source]

Bases: object

Abstract processor class from which all processors inherit. Subclasses must implement a __call__() method.

class chemdataextractor.text.processors.Chain(*callables)[source]

Bases: object

Apply a series of processors in turn. Stops if a processors returns None.

__init__(*callables)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.text.processors.Discard(*match)[source]

Bases: object

Return None if value matches a string.

__init__(*match)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.text.processors.LAdd(substring)[source]

Bases: object

Add a substring to the start of a value.

__init__(substring)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.text.processors.RAdd(substring)[source]

Bases: object

Add a substring to the end of a value.

__init__(substring)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.text.processors.LStrip(*substrings)[source]

Bases: object

Remove a substring from the start of a value.

__init__(*substrings)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.text.processors.RStrip(*substrings)[source]

Bases: object

Remove a substring from the end of a value.

__init__(*substrings)[source]

Initialize self. See help(type(self)) for accurate signature.

chemdataextractor.text.processors.floats(s)[source]

Convert string to float. Handles more string formats that the standard python conversion.

chemdataextractor.text.processors.strip_querystring(url)[source]

Remove the querystring from the end of a URL.

class chemdataextractor.text.processors.Substitutor(substitutions)[source]

Bases: object

Perform a list of substitutions defined by regex on text.

Useful to clean up text where placeholders are used in place of actual unicode characters.

__init__(substitutions)[source]
Parameters:

substitutions – List of (regex, string) tuples that define the substitution.

chemdataextractor.text.processors.extract_emails(text)[source]

Return a list of email addresses extracted from the string.

chemdataextractor.text.processors.unapostrophe(text)[source]

Strip apostrophe and ‘s’ from the end of a string.