
Chemical property parsers. Parsers have been refactored in 2.0 which has introduced breaking changes to older code. Please refer to the examples and the migration guide for 2.0 for an overview of the changes.

Parse text using rule-based grammars.


Actions to perform during parsing.

chemdataextractor.parse.actions.flatten(tokens, start, result)[source]

Replace all child results with their text contents.

chemdataextractor.parse.actions.join(tokens, start, result)[source]

Join tokens into a single string with spaces between.

chemdataextractor.parse.actions.merge(tokens, start, result)[source]

Join tokens into a single string with no spaces.

chemdataextractor.parse.actions.strip_stop(tokens, start, result)[source]

Remove trailing full stop from tokens.

chemdataextractor.parse.actions.fix_whitespace(tokens, start, result)[source]

Fix whitespace around hyphens and commas. Can be used to remove whitespace tokenization artefacts.


Parser for automatic parsing, without user-written parsing rules. Mainly used for tables.

Models must be constructed in a certain way for them to work optimally with autoparsers. Namely, they should have:

  • A specifier field with an associated parse expression (Optional, only required if autoparsers are desired). These parse expressions will be updated automatically using forward-looking Interdependency Resolution if the updatable flag is set to True.
  • These specifiers should also have required set to True so that spurious matches are not found.
  • If applicable, a compound entity, named compound.

Any parse_expressions set in the model should have an added action to ensure that the results are a single word. An example would be to call add_action(join) on each parse expression.


Construct an element for detecting units for the dimensions given. Any magnitude modifiers (e.g. kilo) will be automatically handled.

Parameters:dimensions (Dimension) – The dimensions that the element produced will look for.
Returns:An Element to look for units of given dimensions. If None or Dimensionless are passed in, returns None.
Return type:BaseParserElement or None

Construct an element for detecting categories.

Parameters:category (Category) – The Category to look for.
Return type:BaseParserElement or None

Produces a function that checks whether the given results of parsing match the dimensions of the model provided.

Parameters:model (QuantityModel) – The model with which to check dimensions.
Returns:A function which will return True if the results of parsing match the model’s dimensions, False if not.
Return type:function(tuple(list(Element), int) -> bool)

For a list of Base parser entities, creates an entity of structure. For example, with 4 entities in the list, the output is:

(entities[0] | entities[1] | entities[2] | entities[3])
Parameters:entities – BaseParserElement type objects
Returns:BaseParserElement type object
class chemdataextractor.parse.auto.BaseAutoParser[source]

Bases: chemdataextractor.parse.base.BaseParser

model = None

Initialize self. See help(type(self)) for accurate signature.

interpret(result, start, end)[source]
class chemdataextractor.parse.auto.AutoSentenceParser(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>)[source]

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

__init__(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.auto.AutoTableParser(chem_name=<chemdataextractor.parse.elements.First object>)[source]

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseTableParser

Additions for automated parsing of tables

__init__(chem_name=<chemdataextractor.parse.elements.First object>)[source]

Initialize self. See help(type(self)) for accurate signature.



Base classes for parsing sentences and tables.

class chemdataextractor.parse.base.BaseParser[source]

Bases: object

model = None
trigger_phrase = None

Optional BaseParserElement instance. All sentences are run through this before the full root phrase is applied to the sentence. If nothing is found for this phrase, the sentence will not go through the full root phrase. This is done for performance reasons, and if not set, ChemDataExtractor will perform as it did in previous versions. If this phrase is set to an appropriate value, it can help ChemDataExtractor perform at up to 2x its previous speed.

To ensure that this works as intended, the BaseParserElement should be a simple parse rule (substantially simpler than the root) that takes little time to process.

interpret(result, start, end)[source]

Extract the error from a string


bp = BaseParser()
test_string = '150±5'
end_value = bp.extract_error(test_string)
print(end_value) # 5
Parameters:string (str) – A representation of the value and error as a string
Returns:The error expressed as a float .
Return type:float

Takes a string and returns a list of floats representing the string given.


bp = BaseParser()
test_string = '150 to 160'
end_value = bp.extract_value(test_string)
print(end_value) # [150., 160.]
Parameters:string (str) – A representation of the values as a string
Returns:The value expressed as a list of floats of length 1 if the value had no range, and as a list of floats of length 2 if it was a range.
Return type:list(float)
extract_units(string, strict=False)[source]

Takes a string and returns a Unit. Raises TypeError if strict and the dimensions do not match the expected dimensions or the string has extraneous characters, e.g. if a string Fe was given, and we were looking for a temperature, strict=False would return Fahrenheit, strinct=True would raise a TypeError.


bp = QuantityParser()
bp.model = QuantityModel()
bp.model.dimensions = Temperature() * Length()**0.5 * Time()**(1.5)
test_string = 'Kh2/(km/s)-1/2'
end_units = bp.extract_units(test_string, strict=True)
print(end_units) # Units of: (10^1.5) * Hour^(2.0)  Meter^(0.5)  Second^(-0.5)  Kelvin^(1.0)
  • string (str) – A representation of the units as a string
  • strict (bool) – Whether to raise a TypeError if the dimensions of the parsed units do not have the expected dimensions.

The string expressed as a Unit

Return type:


class chemdataextractor.parse.base.BaseSentenceParser[source]

Bases: chemdataextractor.parse.base.BaseParser

Base class for parsing sentences. To implement a parser for a new property, impelement the interpret function.


Parse a sentence. This function is primarily called by the records property of Sentence.

Parameters:tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by chemdataextractor.doc.text.Sentence.records, the tokens passed in are chemdataextractor.doc.text.Sentence.tagged_tokens.
Returns:All the models found in the sentence.
Return type:Iterator[chemdataextractor.model.base.BaseModel]
class chemdataextractor.parse.base.BaseTableParser[source]

Bases: chemdataextractor.parse.base.BaseParser

Base class for parsing new-style tables. To implement a parser for a new property, impelement the interpret function.


Parse a cell. This function is primarily called by the records property of Table.

Parameters:tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by chemdataextractor.doc.text.table.Table, the tokens passed in are in the same form as chemdataextractor.doc.text.Sentence.tagged_tokens, after the category table has been flattened into a sentence.
Returns:All the models found in the table.
Return type:Iterator[chemdataextractor.model.base.BaseModel]


Chemical entity mention parser elements. ..codeauthor:: Matt Swain (mcs07@cam.ac.uk) ..codeauthor:: Callum Court (cc889@cam.ac.uk)

chemdataextractor.parse.cem.strict_chemical_label = <chemdataextractor.parse.elements.And object>

Chemical label. Very permissive - must be used in context to avoid false positives.

chemdataextractor.parse.cem.chemical_label_phrase1 = <chemdataextractor.parse.elements.And object>

Chemical label with a label type before

chemdataextractor.parse.cem.chemical_label_phrase2 = <chemdataextractor.parse.elements.And object>

Chemical label with synthesis of before

chemdataextractor.parse.cem.element_symbol = <chemdataextractor.parse.elements.Regex object>

Mostly unambiguous element symbols

chemdataextractor.parse.cem.registry_number = <chemdataextractor.parse.elements.First object>

Registry number patterns

chemdataextractor.parse.cem.amino_acid = <chemdataextractor.parse.elements.Regex object>

Amino acid abbreviations. His removed, too ambiguous

chemdataextractor.parse.cem.formula = <chemdataextractor.parse.elements.First object>

Chemical formula patterns, updated to include Inorganic compound formulae

chemdataextractor.parse.cem.other_solvent = <chemdataextractor.parse.elements.First object>

Solvent names.


Convert role text into standardized form.

class chemdataextractor.parse.cem.CompoundParser[source]

Bases: chemdataextractor.parse.base.BaseSentenceParser

Chemical name possibly with an associated label.

interpret(result, start, end)[source]
class chemdataextractor.parse.cem.ChemicalLabelParser[source]

Bases: chemdataextractor.parse.base.BaseSentenceParser

Chemical label occurrences with no associated name.

interpret(result, start, end)[source]
class chemdataextractor.parse.cem.CompoundHeadingParser[source]

Bases: chemdataextractor.parse.base.BaseSentenceParser

Better matching of abbreviated names in dedicated compound headings.

root = <chemdataextractor.parse.elements.Group object>
interpret(result, start, end)[source]
class chemdataextractor.parse.cem.CompoundTableParser[source]

Bases: chemdataextractor.parse.base.BaseTableParser

entities = <chemdataextractor.parse.elements.First object>
interpret(result, start, end)[source]


Common parser elements.



Parser elements.

exception chemdataextractor.parse.elements.ParseException(tokens, i=0, msg=None, element=None)[source]

Bases: Exception

Exception thrown by a ParserElement when it doesn’t match input.

__init__(tokens, i=0, msg=None, element=None)[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod wrap(parse_exception)[source]

Make name safe for use in XML output.

class chemdataextractor.parse.elements.BaseParserElement[source]

Bases: object

Abstract base parser element class.


Initialize self. See help(type(self)) for accurate signature.

actions = None

name for BaseParserElement. This is used to set the name of the Element when a result is found

Type:str or None
streamlined = None

list of actions that will be applied to the results after parsing. Actions are functions with arguments of (tokens, start, result)


Add a condition to the parser element. The condition must be a function that takes a match and return True or False, i.e. a function which takes tuple(list(Element), int) and returns bool. If the function evaluates True, the match is kept, while if the function evaluates False, the match is discarded. The condition is executed after any other actions.

scan(tokens, max_matches=9223372036854775807, overlap=False)[source]

Scans for matches in given tokens.

  • string)) tokens (list(tuple(string,) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
  • max_matches (int) – The maximum number of matches to look for. Default is the maximum size possible for a list.
  • overlap (bool) – Whether the found results are allowed to overlap. Default False.

A generator of the results found. Each result is a tuple with the first element being a list of elements found, and the second and third elements are the start and end indices representing the span of the result.

Return type:

generator(tuple(list(lxml.etree.Element), int, int))

parse(tokens, i, actions=True)[source]

Parse given tokens and return results

  • tokens (list(tuple(string, string))) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
  • i (int) – The index at which to start scanning from
  • actions (bool) – Whether the actions attached to this element will be executed. Default True.

A tuple where the first element is a list of elements found (can be None if no results were found), and the last index investigated.

Return type:

tuple(list(Element) or None, int)

try_parse(tokens, i)[source]

Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

class chemdataextractor.parse.elements.Any[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Always match a single token.

class chemdataextractor.parse.elements.NoMatch[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

class chemdataextractor.parse.elements.Word(match)[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Match token text exactly. Case-sensitive.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Tag(match)[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Match tag exactly.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.IWord(match)[source]

Bases: chemdataextractor.parse.elements.Word

Case-insensitive match token text.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Regex(pattern, flags=0, group=None)[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Match token text with regular expression.

__init__(pattern, flags=0, group=None)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Start[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Match at start of tokens.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.End[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Match at end of tokens.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.ParseExpression(exprs)[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Abstract class for combining and post-processing parsed tokens.


Initialize self. See help(type(self)) for accurate signature.


Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

class chemdataextractor.parse.elements.And(exprs)[source]

Bases: chemdataextractor.parse.elements.ParseExpression

Match all in the given order. Can probably be replaced by the plus operator ‘+’?


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Or(exprs)[source]

Bases: chemdataextractor.parse.elements.ParseExpression

Match the longest. Can probably be replaced by the pipe operator ‘|’.

class chemdataextractor.parse.elements.First(exprs)[source]

Bases: chemdataextractor.parse.elements.ParseExpression

Match the first.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.ParseElementEnhance(expr)[source]

Bases: chemdataextractor.parse.elements.BaseParserElement

Abstract class for combining and post-processing parsed tokens.


Initialize self. See help(type(self)) for accurate signature.


Streamlines internal representations. e.g., if we have something like And(And(And(And(a), b), c), d), streamline this to And(a, b, c, d)

class chemdataextractor.parse.elements.FollowedBy(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Check ahead if matches.


Tn + FollowedBy('Neel temperature')
Tn will match only if followed by 'Neel temperature', but 'Neel temperature' will not be part of the output/tree
class chemdataextractor.parse.elements.Not(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Check ahead to disallow a match with the given parse expression.


Tn + Not('some_string')
Tn will match if not followed by 'some_string'
class chemdataextractor.parse.elements.ZeroOrMore(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Optional repetition of zero or more of the given expression.

class chemdataextractor.parse.elements.OneOrMore(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Repetition of one or more of the given expression.

class chemdataextractor.parse.elements.Optional(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Can be present but doesn’t need to be. If present, will be added to the result/tree.


Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Group(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

For nested tags; will group argument and give it a label, preserving the original sub-tags. Otherwise, the default behaviour would be to rename the outermost tag in the argument. Usage: Group(some_text)(‘new_tag) where ‘some_text’ is a previously tagged expression

class chemdataextractor.parse.elements.SkipTo(expr, include=False)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Skips to the next occurance of expression. Does not add the next occurance of expression to the parse tree. For example:

entities + SkipTo(entities)

will output entities only once. Whereas:

entities + SkipTo(entities) + entities

will output entities as well as the second occurrence of entities after an arbitrary number of tokens in between.

__init__(expr, include=False)[source]

Initialize self. See help(type(self)) for accurate signature.

class chemdataextractor.parse.elements.Hide(expr)[source]

Bases: chemdataextractor.parse.elements.ParseElementEnhance

Converter for ignoring the results of a parsed expression. It wouldn’t appear in the generated xml element tree, but it would still be part of the rule.


alias of chemdataextractor.parse.elements.Word


alias of chemdataextractor.parse.elements.IWord


alias of chemdataextractor.parse.elements.Regex


alias of chemdataextractor.parse.elements.Tag


alias of chemdataextractor.parse.elements.Hide


IR spectrum text parser.

chemdataextractor.parse.ir.extract_units(tokens, start, result)[source]

Extract units from bracketed after nu

class chemdataextractor.parse.ir.IrParser[source]

Bases: chemdataextractor.parse.base.BaseSentenceParser

root = <chemdataextractor.parse.elements.And object>
interpret(result, start, end)[source]


NMR text parser.

class chemdataextractor.parse.mp.MpParser[source]

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.First object>
interpret(result, start, end)[source]


NMR text parser.

chemdataextractor.parse.nmr.fix_nmr_peak_whitespace_error(tokens, start, result)[source]
chemdataextractor.parse.nmr.strip_delta(tokens, start, result)[source]
class chemdataextractor.parse.nmr.NmrParser[source]

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.And object>

Initialize self. See help(type(self)) for accurate signature.

interpret(result, start, end)[source]


Basic property parser template for Quantity Models

class chemdataextractor.parse.template.QuantityModelTemplateParser[source]

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

Template parser for QuantityModel-type structures

Finds Cem, Specifier, Value and Units from single sentences

Other entities are merged contextually


The model specifier


Value and units


CEM phrases


Specifier prefix phrase e.g. Tc equal to


Specifier and value + units


Phrases ordered CEM, Specifier, Value, Unit


Phrases ordered specifier, value, unit, CEM


Phrases ordered value unit specifier cem


Root Phrases

class chemdataextractor.parse.template.MultiQuantityModelTemplateParser[source]

Bases: chemdataextractor.parse.auto.BaseAutoParser, chemdataextractor.parse.base.BaseSentenceParser

Template for parsing sentences that contain nested or chained entities


  1. Single compound, multiple specifiers, multiple phase transitions e.g. BiFeO3 has TC = 1093 K and TN = 640 K
  2. single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K
  3. multiple compounds, single specifier, multiple transitions e.g. TC in BiFeO3 and LaFeO3 of 640 and 750 K
  4. multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO
  5. multiple compounds, multiple specifiers, multiple transitions e.g. BiFeO3 and LaFeO3 have Tc = 640 K and TN = 750 K respectively

Specifier Phrase


Specifier and prefix


Any cem


Unit element


Value possibly followed by a unit


Value with unit


List of values with either multiple units or one at the end


List of cems e.g. cem1, cem2, cem3 and cem4


Specifier plus value and possible unit


Specifier value and unit


List of specifiers and units


Single compound, multiple specifiers, values e.g. BiFeO3 has TC1 = 1093 K and Tc2 = 640 K


single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K


multiple compounds, single specifier, multiple transitions cems first e.g. TC in BiFeO3 and LaFeO3 of 640 and 750 K


multiple compounds, single specifier, multiple transitions cems last e.g. Tc = 750 and 640 K in LaFeO3 and BiFeO3, respectivel


multiple compounds, single specifier, multiple transitions cems last e.g. curie temperatures from 100 K in MnO to 300 K in NiO


Combined phrases of type 3


multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO


Cems first

interpret(result, start, end)[source]
interpret_multi_entity_1(result, start, end)[source]

Interpret phrases that have a single CEM and multiple values with multiple specifiers

interpret_multi_entity_2(result, start, end)[source]

single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K

interpret_multi_entity_3(result, start, end)[source]

interpret multiple compounds, single specifier, multiple transitions

interpret_multi_entity_4(result, start, end)[source]

interpret multiple compounds, single specifier, single transition


Glass transition temperature parser.

class chemdataextractor.parse.tg.TgParser[source]

Bases: chemdataextractor.parse.base.BaseParser

root = <chemdataextractor.parse.elements.First object>
interpret(result, start, end)[source]


UV-vis text parser.

class chemdataextractor.parse.uvvis.UvvisParser[source]

Bases: chemdataextractor.parse.base.BaseSentenceParser

root = <chemdataextractor.parse.elements.And object>
interpret(result, start, end)[source]