.parse¶

Chemical property parsers. Parsers have been refactored in 2.0 which has introduced breaking changes to older code. Please refer to the examples and the migration guide for 2.0 for an overview of the changes.

Parse text using rule-based grammars.

.parse.actions¶

Actions to perform during parsing.

chemdataextractor.parse.actions.flatten(tokens, start, result)[source]¶: Replace all child results with their text contents.

chemdataextractor.parse.actions.join(tokens, start, result)[source]¶: Join tokens into a single string with spaces between.

chemdataextractor.parse.actions.merge(tokens, start, result)[source]¶: Join tokens into a single string with no spaces.

chemdataextractor.parse.actions.strip_stop(tokens, start, result)[source]¶: Remove trailing full stop from tokens.

chemdataextractor.parse.actions.fix_whitespace(tokens, start, result)[source]¶: Fix whitespace around hyphens and commas. Can be used to remove whitespace tokenization artefacts.

.parse.auto¶

Parser for automatic parsing, without user-written parsing rules. Mainly used for tables.

Models must be constructed in a certain way for them to work optimally with autoparsers. Namely, they should have:

A specifier field with an associated parse expression (Optional, only required if autoparsers are desired). These parse expressions will be updated automatically using forward-looking Interdependency Resolution if the updatable flag is set to True.
These specifiers should also have required set to True so that spurious matches are not found.
If applicable, a compound entity, named compound.

Any parse_expressions set in the model should have an added action to ensure that the results are a single word. An example would be to call add_action(join) on each parse expression.

chemdataextractor.parse.auto.construct_unit_element(dimensions)[source]¶

Construct an element for detecting units for the dimensions given. Any magnitude modifiers (e.g. kilo) will be automatically handled.

Parameters:: dimensions (Dimension) – The dimensions that the element produced will look for.
Returns:: An Element to look for units of given dimensions. If None or Dimensionless are passed in, returns None.
Return type:: BaseParserElement or None

chemdataextractor.parse.auto.construct_category_element(category_dict)[source]¶

Construct an element for detecting categories.

Parameters:: category (Category) – The Category to look for.
Return type:: BaseParserElement or None

chemdataextractor.parse.auto.match_dimensions_of(model)[source]¶

Produces a function that checks whether the given results of parsing match the dimensions of the model provided.

Parameters:: model (QuantityModel) – The model with which to check dimensions.
Returns:: A function which will return True if the results of parsing match the model’s dimensions, False if not.
Return type:: function(tuple(list(Element), int) -> bool)

chemdataextractor.parse.auto.create_entities_list(entities)[source]¶

For a list of Base parser entities, creates an entity of structure. For example, with 4 entities in the list, the output is:

(entities[0] | entities[1] | entities[2] | entities[3])

Parameters:: entities – BaseParserElement type objects
Returns:: BaseParserElement type object

class chemdataextractor.parse.auto.BaseAutoParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

model = None¶

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

interpret(results, start, end)[source]¶

class chemdataextractor.parse.auto.AutoSentenceParser(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>, activate_to_range=False)[source]¶

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

__init__(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>, activate_to_range=False)[source]¶: Initialize self. See help(type(self)) for accurate signature.

trigger_phrase¶

root¶

class chemdataextractor.parse.auto.AutoTableParser(chem_name=<chemdataextractor.parse.elements.First object>)[source]¶

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseTableParser

Additions for automated parsing of tables

__init__(chem_name=<chemdataextractor.parse.elements.First object>)[source]¶: Initialize self. See help(type(self)) for accurate signature.

root¶

.parse.base¶

Base classes for parsing sentences and tables.

class chemdataextractor.parse.base.BaseParser[source]¶

Bases: object

model = None¶

trigger_phrase = None¶

skip_section_phrase = None¶

allow_section_phrase = None¶

Optional BaseParserElement instance. All sentences are run through this before the full root phrase is applied to the sentence. If nothing is found for this phrase, the sentence will not go through the full root phrase. This is done for performance reasons, and if not set, ChemDataExtractor will perform as it did in previous versions. If this phrase is set to an appropriate value, it can help ChemDataExtractor perform at up to 2x its previous speed.

To ensure that this works as intended, the BaseParserElement should be a simple parse rule (substantially simpler than the root) that takes little time to process.

root¶

interpret(result, start, end)[source]¶

extract_error(string)[source]¶

Extract the error from a string

Usage:

bp = BaseParser()
test_string = '150±5'
end_value = bp.extract_error(test_string)
print(end_value) # 5

Parameters:: string (str) – A representation of the value and error as a string
Returns:: The error expressed as a float .
Return type:: float

extract_value(string)[source]¶

Takes a string and returns a list of floats representing the string given.

Usage:

bp = BaseParser()
test_string = '150 to 160'
end_value = bp.extract_value(test_string)
print(end_value) # [150., 160.]

Parameters:: string (str) – A representation of the values as a string
Returns:: The value expressed as a list of floats of length 1 if the value had no range, and as a list of floats of length 2 if it was a range.
Return type:: list(float)

extract_units(string, strict=False)[source]¶

Takes a string and returns a Unit. Raises TypeError if strict and the dimensions do not match the expected dimensions or the string has extraneous characters, e.g. if a string Fe was given, and we were looking for a temperature, strict=False would return Fahrenheit, strinct=True would raise a TypeError.

Usage:

bp = QuantityParser()
bp.model = QuantityModel()
bp.model.dimensions = Temperature() * Length()**0.5 * Time()**(1.5)
test_string = 'Kh2/(km/s)-1/2'
end_units = bp.extract_units(test_string, strict=True)
print(end_units) # Units of: (10^1.5) * Hour^(2.0)  Meter^(0.5)  Second^(-0.5)  Kelvin^(1.0)

Parameters:

string (str) – A representation of the units as a string
strict (bool) – Whether to raise a TypeError if the dimensions of the parsed units do not have the expected dimensions.

Returns:

The string expressed as a Unit

Return type:

chemdataextractor.quantities.Unit

class chemdataextractor.parse.base.BaseSentenceParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

Base class for parsing sentences. To implement a parser for a new property, impelement the interpret function.

parse_full_sentence = False¶

should_read_section(heading)[source]¶

parse_sentence(sentence)[source]¶

Parse a sentence. This function is primarily called by the records property of Sentence.

Parameters:: tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by chemdataextractor.doc.text.Sentence.records, the tokens passed in are chemdataextractor.doc.text.Sentence.tagged_tokens.
Returns:: All the models found in the sentence.
Return type:: Iterator[chemdataextractor.model.base.BaseModel]

class chemdataextractor.parse.base.BaseTableParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

Base class for parsing new-style tables. To implement a parser for a new property, impelement the interpret function.

parse_cell(cell)[source]¶

Parse a cell. This function is primarily called by the records property of Table.

Parameters:: tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by chemdataextractor.doc.text.table.Table, the tokens passed in are in the same form as chemdataextractor.doc.text.Sentence.tagged_tokens, after the category table has been flattened into a sentence.
Returns:: All the models found in the table.
Return type:: Iterator[chemdataextractor.model.base.BaseModel]

.parse.cem¶

Chemical entity mention parser elements. ..codeauthor:: Matt Swain (mcs07@cam.ac.uk) ..codeauthor:: Callum Court (cc889@cam.ac.uk)

chemdataextractor.parse.cem.standardize_role(role)[source]¶: Convert role text into standardized form.

class chemdataextractor.parse.cem.CompoundParser[source]¶

Bases: chemdataextractor.parse.base.BaseSentenceParser

Chemical name possibly with an associated label.

root¶

interpret(result, start, end)[source]¶

class chemdataextractor.parse.cem.ChemicalLabelParser[source]¶

Bases: chemdataextractor.parse.base.BaseSentenceParser

Chemical label occurrences with no associated name.

root¶

interpret(result, start, end)[source]¶

class chemdataextractor.parse.cem.CompoundHeadingParser[source]¶

Bases: chemdataextractor.parse.base.BaseSentenceParser

Better matching of abbreviated names in dedicated compound headings.

root = <chemdataextractor.parse.elements.Group object>¶

parse_full_sentence = True¶

interpret(result, start, end)[source]¶

class chemdataextractor.parse.cem.CompoundTableParser[source]¶

Bases: chemdataextractor.parse.base.BaseTableParser

entities = <chemdataextractor.parse.elements.First object>¶

root¶

interpret(result, start, end)[source]¶

.parse.common¶

Common parser elements.

.parse.context¶

.parse.elements¶

Parser elements.

exception chemdataextractor.parse.elements.ParseException(tokens, i=0, msg=None, element=None)[source]¶

Bases: Exception

Exception thrown by a ParserElement when it doesn’t match input.

__init__(tokens, i=0, msg=None, element=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

classmethod wrap(parse_exception)[source]¶

chemdataextractor.parse.elements.safe_name(name)[source]¶: Make name safe for use in XML output.

class chemdataextractor.parse.elements.BaseParserElement[source]¶

Bases: object

Abstract base parser element class.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

actions = None¶

name for BaseParserElement. This is used to set the name of the Element when a result is found

Type:: str or None

streamlined = None¶

list of actions that will be applied to the results after parsing. Actions are functions with arguments of (tokens, start, result)

Type:: list(chemdataextractor.parse.actions)

set_action(*fns)[source]¶

add_action(*fns)[source]¶

with_condition(condition)[source]¶: Add a condition to the parser element. The condition must be a function that takes a match and return True or False, i.e. a function which takes tuple(list(Element), int) and returns bool. If the function evaluates True, the match is kept, while if the function evaluates False, the match is discarded. The condition is executed after any other actions.

copy()[source]¶

set_name(name)[source]¶

scan(tokens, max_matches=9223372036854775807, overlap=False)[source]¶

Scans for matches in given tokens.

Parameters:

string)) tokens (list(tuple(string,) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
max_matches (int) – The maximum number of matches to look for. Default is the maximum size possible for a list.
overlap (bool) – Whether the found results are allowed to overlap. Default False.

Returns:

A generator of the results found. Each result is a tuple with the first element being a list of elements found, and the second and third elements are the start and end indices representing the span of the result.

Return type:

generator(tuple(list(lxml.etree.Element), int, int))

parse(tokens, i, actions=True)[source]¶

Parse given tokens and return results

Parameters:

tokens (list(tuple(string, string))) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
i (int) – The index at which to start scanning from
actions (bool) – Whether the actions attached to this element will be executed. Default True.

Returns:

A tuple where the first element is a list of elements found (can be None if no results were found), and the last index investigated.

Return type:

tuple(list(Element) or None, int)

try_parse(tokens, i)[source]¶

streamline()[source]¶: Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

hide()[source]¶

class chemdataextractor.parse.elements.Any[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Always match a single token.

class chemdataextractor.parse.elements.NoMatch[source]¶: Bases: chemdataextractor.parse.elements.BaseParserElement

class chemdataextractor.parse.elements.Word(match)[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Match token text exactly. Case-sensitive.

__init__(match)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Tag(match, tag_type=None)[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Match tag exactly.

__init__(match, tag_type=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.IWord(match)[source]¶

Bases: chemdataextractor.parse.elements.Word

Case-insensitive match token text.

__init__(match)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Regex(pattern, flags=0, group=None)[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Match token text with regular expression.

__init__(pattern, flags=0, group=None)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Start[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Match at start of tokens.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.End[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Match at end of tokens.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.ParseExpression(exprs)[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Abstract class for combining and post-processing parsed tokens.

__init__(exprs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

append(other)[source]¶

copy()[source]¶

streamline()[source]¶: Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

class chemdataextractor.parse.elements.And(exprs)[source]¶

Bases: chemdataextractor.parse.elements.ParseExpression

Match all in the given order. Can probably be replaced by the plus operator ‘+’?

__init__(exprs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Or(exprs)[source]¶

Bases: chemdataextractor.parse.elements.ParseExpression

Match the longest. Can probably be replaced by the pipe operator ‘|’.

class chemdataextractor.parse.elements.Every(exprs)[source]¶

Bases: chemdataextractor.parse.elements.ParseExpression

Match all of the containing parse expressions, and return the longest

class chemdataextractor.parse.elements.First(exprs)[source]¶

Bases: chemdataextractor.parse.elements.ParseExpression

Match the first.

__init__(exprs)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.ParseElementEnhance(expr)[source]¶

Bases: chemdataextractor.parse.elements.BaseParserElement

Abstract class for combining and post-processing parsed tokens.

__init__(expr)[source]¶: Initialize self. See help(type(self)) for accurate signature.

streamline()[source]¶: Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

class chemdataextractor.parse.elements.FollowedBy(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Check ahead if matches.

Example:

Tn + FollowedBy('Neel temperature')
Tn will match only if followed by 'Neel temperature', but 'Neel temperature' will not be part of the output/tree

class chemdataextractor.parse.elements.Not(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Check ahead to disallow a match with the given parse expression.

Example:

Tn + Not('some_string')
Tn will match if not followed by 'some_string'

class chemdataextractor.parse.elements.ZeroOrMore(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Optional repetition of zero or more of the given expression.

class chemdataextractor.parse.elements.OneOrMore(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Repetition of one or more of the given expression.

class chemdataextractor.parse.elements.Optional(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Can be present but doesn’t need to be. If present, will be added to the result/tree.

__init__(expr)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Group(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

For nested tags; will group argument and give it a label, preserving the original sub-tags. Otherwise, the default behaviour would be to rename the outermost tag in the argument. Usage: Group(some_text)(‘new_tag) where ‘some_text’ is a previously tagged expression

class chemdataextractor.parse.elements.SkipTo(expr, include=False)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Skips to the next occurance of expression. Does not add the next occurance of expression to the parse tree. For example:

entities + SkipTo(entities)

will output entities only once. Whereas:

entities + SkipTo(entities) + entities

will output entities as well as the second occurrence of entities after an arbitrary number of tokens in between.

__init__(expr, include=False)[source]¶: Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Hide(expr)[source]¶

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Converter for ignoring the results of a parsed expression. It wouldn’t appear in the generated xml element tree, but it would still be part of the rule.

hide()[source]¶

chemdataextractor.parse.elements.W¶: alias of chemdataextractor.parse.elements.Word

chemdataextractor.parse.elements.I¶: alias of chemdataextractor.parse.elements.IWord

chemdataextractor.parse.elements.R¶: alias of chemdataextractor.parse.elements.Regex

chemdataextractor.parse.elements.T¶: alias of chemdataextractor.parse.elements.Tag

chemdataextractor.parse.elements.H¶: alias of chemdataextractor.parse.elements.Hide

.parse.ir¶

IR spectrum text parser.

chemdataextractor.parse.ir.extract_units(tokens, start, result)[source]¶: Extract units from bracketed after nu

class chemdataextractor.parse.ir.IrParser[source]¶

Bases: chemdataextractor.parse.base.BaseSentenceParser

root = <chemdataextractor.parse.elements.And object>¶

parse_full_sentence = True¶

interpret(result, start, end)[source]¶

.parse.mp¶

NMR text parser.

class chemdataextractor.parse.mp.MpParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.First object>¶

interpret(result, start, end)[source]¶

.parse.nmr¶

NMR text parser.

chemdataextractor.parse.nmr.fix_nmr_peak_whitespace_error(tokens, start, result)[source]¶

chemdataextractor.parse.nmr.strip_delta(tokens, start, result)[source]¶

class chemdataextractor.parse.nmr.NmrParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.And object>¶

parse_full_sentence = True¶

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

interpret(result, start, end)[source]¶

.parse.template¶

Basic property parser template for Quantity Models

class chemdataextractor.parse.template.QuantityModelTemplateParser[source]¶

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

Template parser for QuantityModel-type structures

Finds Cem, Specifier, Value and Units from single sentences

Other entities are merged contextually

specifier_phrase¶: The model specifier

value_phrase¶: Value and units

cem_phrase¶: CEM phrases

prefix¶: Specifier prefix phrase e.g. Tc equal to

specifier_and_value¶: Specifier and value + units

cem_before_specifier_and_value_phrase¶: Phrases ordered CEM, Specifier, Value, Unit

specifier_before_cem_and_value_phrase¶

cem_after_specifier_and_value_phrase¶: Phrases ordered specifier, value, unit, CEM

value_specifier_cem_phrase¶: Phrases ordered value unit specifier cem

root¶: Root Phrases

class chemdataextractor.parse.template.MultiQuantityModelTemplateParser[source]¶

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

Template for parsing sentences that contain nested or chained entities

MULTIPLE ENTITY PHRASES

Single compound, multiple specifiers, multiple phase transitions e.g. BiFeO3 has TC = 1093 K and TN = 640 K

single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K

multiple compounds, single specifier, multiple transitions e.g. TC in BiFeO3 and LaFeO3 of 640 and 750 K

multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO

multiple compounds, multiple specifiers, multiple transitions e.g. BiFeO3 and LaFeO3 have Tc = 640 K and TN = 750 K respectively

Parameters:

{[type]} -- [description] (BaseSentenceParser) –
{[type]} -- [description] –

parse_full_sentence = True¶

specifier_phrase¶: Specifier Phrase

prefix_only¶: prefix

prefix¶: specifier and prefix

single_cem¶: Any cem

unit¶: Unit element

value_with_optional_unit¶: Value possibly followed by a unit

value_phrase¶: Value with unit

list_of_values¶: List of values with either multiple units or one at the end

list_of_cems¶: List of cems e.g. cem1, cem2, cem3 and cem4

single_specifier_and_value_with_optional_unit¶: Specifier plus value and possible unit

single_specifier_and_value¶: Specifier value and unit

list_of_properties¶: List of specifiers and units

multi_entity_phrase_1¶: Single compound, multiple specifiers, values e.g. BiFeO3 has TC1 = 1093 K and Tc2 = 640 K

multi_entity_phrase_2¶: single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K

multi_entity_phrase_3a¶: multiple compounds, single specifier, multiple transitions cems first e.g. In BiFeO3 and LaFeO3 Tc are found to be 640 and 750 K

multi_entity_phrase_3b¶: multiple compounds, single specifier, multiple transitions cems last e.g. Tc = 750 and 640 K in LaFeO3 and BiFeO3, respectivel

multi_entity_phrase_3c¶: multiple compounds, single specifier, multiple transitions cems first e.g. Tc of BiFeO3 and LaFeO3 are found to be 640 and 750 K

multi_entity_phrase_3¶: Combined phrases of type 3

multi_entity_phrase_4a¶: multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO

multi_entity_phrase_4b¶: Cems first

multi_entity_phrase_4¶

multi_entity_phrase_5¶: multiple compounds, single specifier, multiple transitions cems last e.g. curie temperatures from 100 K in MnO to 300 K in NiO

root¶

interpret(result, start, end)[source]¶

interpret_multi_entity_1(result, start, end)[source]¶: Interpret phrases that have a single CEM and multiple values with multiple specifiers

interpret_multi_entity_2(result, start, end)[source]¶: single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K

interpret_multi_entity_3(result, start, end)[source]¶: interpret multiple compounds, single specifier, multiple transitions

interpret_multi_entity_4(result, start, end)[source]¶: interpret multiple compounds, single specifier, single transition

interpret_multi_entity_5(result, start, end)[source]¶: interpret multiple compounds, single specifier, multiple transitions

.parse.tg¶

Glass transition temperature parser.

class chemdataextractor.parse.tg.TgParser[source]¶

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.First object>¶

interpret(result, start, end)[source]¶

.parse.uvvis¶

UV-vis text parser.

class chemdataextractor.parse.uvvis.UvvisParser[source]¶

Bases: chemdataextractor.parse.base.BaseSentenceParser

root = <chemdataextractor.parse.elements.And object>¶

interpret(result, start, end)[source]¶

.nlp

.reader