.parse¶
Chemical property parsers. Parsers have been refactored in 2.0 which has introduced breaking changes to older code. Please refer to the examples and the migration guide for 2.0 for an overview of the changes.
Parse text using rule-based grammars.
.parse.actions¶
Actions to perform during parsing.
-
chemdataextractor.parse.actions.
flatten
(tokens, start, result)[source]¶ Replace all child results with their text contents.
-
chemdataextractor.parse.actions.
join
(tokens, start, result)[source]¶ Join tokens into a single string with spaces between.
-
chemdataextractor.parse.actions.
merge
(tokens, start, result)[source]¶ Join tokens into a single string with no spaces.
.parse.auto¶
Parser for automatic parsing, without user-written parsing rules. Mainly used for tables.
Models must be constructed in a certain way for them to work optimally with autoparsers. Namely, they should have:
A specifier field with an associated parse expression (Optional, only required if autoparsers are desired). These parse expressions will be updated automatically using forward-looking Interdependency Resolution if the updatable flag is set to True.
These specifiers should also have required set to True so that spurious matches are not found.
If applicable, a compound entity, named compound.
Any parse_expressions set in the model should have an added action to ensure that the results are a single word. An example would be to call add_action(join) on each parse expression.
-
chemdataextractor.parse.auto.
construct_unit_element
(dimensions)[source]¶ Construct an element for detecting units for the dimensions given. Any magnitude modifiers (e.g. kilo) will be automatically handled.
- Parameters:
dimensions (Dimension) – The dimensions that the element produced will look for.
- Returns:
An Element to look for units of given dimensions. If None or Dimensionless are passed in, returns None.
- Return type:
-
chemdataextractor.parse.auto.
construct_category_element
(category_dict)[source]¶ Construct an element for detecting categories.
- Parameters:
category (Category) – The Category to look for.
- Return type:
-
chemdataextractor.parse.auto.
match_dimensions_of
(model)[source]¶ Produces a function that checks whether the given results of parsing match the dimensions of the model provided.
- Parameters:
model (QuantityModel) – The model with which to check dimensions.
- Returns:
A function which will return True if the results of parsing match the model’s dimensions, False if not.
- Return type:
-
chemdataextractor.parse.auto.
create_entities_list
(entities)[source]¶ For a list of Base parser entities, creates an entity of structure. For example, with 4 entities in the list, the output is:
(entities[0] | entities[1] | entities[2] | entities[3])
- Parameters:
entities – BaseParserElement type objects
- Returns:
BaseParserElement type object
-
class
chemdataextractor.parse.auto.
BaseAutoParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseParser
-
model
= None¶
-
-
class
chemdataextractor.parse.auto.
AutoSentenceParser
(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>, activate_to_range=False)[source]¶ Bases:
chemdataextractor.parse.auto.BaseAutoParser
,chemdataextractor.parse.base.BaseSentenceParser
-
__init__
(lenient=False, chem_name=<chemdataextractor.parse.elements.First object>, activate_to_range=False)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
trigger_phrase
¶
-
root
¶
-
-
class
chemdataextractor.parse.auto.
AutoTableParser
(chem_name=<chemdataextractor.parse.elements.First object>)[source]¶ Bases:
chemdataextractor.parse.auto.BaseAutoParser
,chemdataextractor.parse.base.BaseTableParser
Additions for automated parsing of tables
-
__init__
(chem_name=<chemdataextractor.parse.elements.First object>)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
root
¶
-
.parse.base¶
Base classes for parsing sentences and tables.
-
class
chemdataextractor.parse.base.
BaseParser
[source]¶ Bases:
object
-
model
= None¶
-
trigger_phrase
= None¶
-
skip_section_phrase
= None¶
-
allow_section_phrase
= None¶ Optional
BaseParserElement
instance. All sentences are run through this before the full root phrase is applied to the sentence. If nothing is found for this phrase, the sentence will not go through the full root phrase. This is done for performance reasons, and if not set, ChemDataExtractor will perform as it did in previous versions. If this phrase is set to an appropriate value, it can help ChemDataExtractor perform at up to 2x its previous speed.To ensure that this works as intended, the
BaseParserElement
should be a simple parse rule (substantially simpler than theroot
) that takes little time to process.
-
root
¶
-
extract_error
(string)[source]¶ Extract the error from a string
Usage:
bp = BaseParser() test_string = '150±5' end_value = bp.extract_error(test_string) print(end_value) # 5
-
extract_value
(string)[source]¶ Takes a string and returns a list of floats representing the string given.
Usage:
bp = BaseParser() test_string = '150 to 160' end_value = bp.extract_value(test_string) print(end_value) # [150., 160.]
-
extract_units
(string, strict=False)[source]¶ Takes a string and returns a Unit. Raises TypeError if strict and the dimensions do not match the expected dimensions or the string has extraneous characters, e.g. if a string Fe was given, and we were looking for a temperature, strict=False would return Fahrenheit, strinct=True would raise a TypeError.
Usage:
bp = QuantityParser() bp.model = QuantityModel() bp.model.dimensions = Temperature() * Length()**0.5 * Time()**(1.5) test_string = 'Kh2/(km/s)-1/2' end_units = bp.extract_units(test_string, strict=True) print(end_units) # Units of: (10^1.5) * Hour^(2.0) Meter^(0.5) Second^(-0.5) Kelvin^(1.0)
-
-
class
chemdataextractor.parse.base.
BaseSentenceParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseParser
Base class for parsing sentences. To implement a parser for a new property, impelement the interpret function.
-
parse_full_sentence
= False¶
-
parse_sentence
(sentence)[source]¶ Parse a sentence. This function is primarily called by the
records
property ofSentence
.- Parameters:
tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by
chemdataextractor.doc.text.Sentence.records
, the tokens passed in arechemdataextractor.doc.text.Sentence.tagged_tokens
.- Returns:
All the models found in the sentence.
- Return type:
Iterator[
chemdataextractor.model.base.BaseModel
]
-
-
class
chemdataextractor.parse.base.
BaseTableParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseParser
Base class for parsing new-style tables. To implement a parser for a new property, impelement the interpret function.
-
parse_cell
(cell)[source]¶ Parse a cell. This function is primarily called by the
records
property ofTable
.- Parameters:
tokens (list[(token,tag)]) – List of tokens for parsing. When this method is called by
chemdataextractor.doc.text.table.Table
, the tokens passed in are in the same form aschemdataextractor.doc.text.Sentence.tagged_tokens
, after the category table has been flattened into a sentence.- Returns:
All the models found in the table.
- Return type:
Iterator[
chemdataextractor.model.base.BaseModel
]
-
.parse.cem¶
Chemical entity mention parser elements. ..codeauthor:: Matt Swain (mcs07@cam.ac.uk) ..codeauthor:: Callum Court (cc889@cam.ac.uk)
-
chemdataextractor.parse.cem.
standardize_role
(role)[source]¶ Convert role text into standardized form.
-
class
chemdataextractor.parse.cem.
CompoundParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseSentenceParser
Chemical name possibly with an associated label.
-
root
¶
-
-
class
chemdataextractor.parse.cem.
ChemicalLabelParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseSentenceParser
Chemical label occurrences with no associated name.
-
root
¶
-
-
class
chemdataextractor.parse.cem.
CompoundHeadingParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseSentenceParser
Better matching of abbreviated names in dedicated compound headings.
-
root
= <chemdataextractor.parse.elements.Group object>¶
-
parse_full_sentence
= True¶
-
.parse.common¶
Common parser elements.
.parse.context¶
.parse.elements¶
Parser elements.
-
exception
chemdataextractor.parse.elements.
ParseException
(tokens, i=0, msg=None, element=None)[source]¶ Bases:
Exception
Exception thrown by a ParserElement when it doesn’t match input.
-
class
chemdataextractor.parse.elements.
BaseParserElement
[source]¶ Bases:
object
Abstract base parser element class.
-
actions
= None¶ name for BaseParserElement. This is used to set the name of the Element when a result is found
-
streamlined
= None¶ list of actions that will be applied to the results after parsing. Actions are functions with arguments of (tokens, start, result)
- Type:
list(chemdataextractor.parse.actions)
-
with_condition
(condition)[source]¶ Add a condition to the parser element. The condition must be a function that takes a match and return True or False, i.e. a function which takes tuple(list(Element), int) and returns bool. If the function evaluates True, the match is kept, while if the function evaluates False, the match is discarded. The condition is executed after any other actions.
-
scan
(tokens, max_matches=9223372036854775807, overlap=False)[source]¶ Scans for matches in given tokens.
- Parameters:
string)) tokens (list(tuple(string,) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
max_matches (int) – The maximum number of matches to look for. Default is the maximum size possible for a list.
overlap (bool) – Whether the found results are allowed to overlap. Default False.
- Returns:
A generator of the results found. Each result is a tuple with the first element being a list of elements found, and the second and third elements are the start and end indices representing the span of the result.
- Return type:
-
parse
(tokens, i, actions=True)[source]¶ Parse given tokens and return results
- Parameters:
tokens (list(tuple(string, string))) – A tokenized representation of the text to scan. The first string in the tuple is the content, typically a word, and the second string is the part of speech tag.
i (int) – The index at which to start scanning from
actions (bool) – Whether the actions attached to this element will be executed. Default True.
- Returns:
A tuple where the first element is a list of elements found (can be None if no results were found), and the last index investigated.
- Return type:
-
-
class
chemdataextractor.parse.elements.
Any
[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Always match a single token.
-
class
chemdataextractor.parse.elements.
Word
(match)[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Match token text exactly. Case-sensitive.
-
class
chemdataextractor.parse.elements.
Tag
(match, tag_type=None)[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Match tag exactly.
-
class
chemdataextractor.parse.elements.
IWord
(match)[source]¶ Bases:
chemdataextractor.parse.elements.Word
Case-insensitive match token text.
-
class
chemdataextractor.parse.elements.
Regex
(pattern, flags=0, group=None)[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Match token text with regular expression.
-
class
chemdataextractor.parse.elements.
Start
[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Match at start of tokens.
-
class
chemdataextractor.parse.elements.
End
[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Match at end of tokens.
-
class
chemdataextractor.parse.elements.
ParseExpression
(exprs)[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Abstract class for combining and post-processing parsed tokens.
-
class
chemdataextractor.parse.elements.
And
(exprs)[source]¶ Bases:
chemdataextractor.parse.elements.ParseExpression
Match all in the given order. Can probably be replaced by the plus operator ‘+’?
-
class
chemdataextractor.parse.elements.
Or
(exprs)[source]¶ Bases:
chemdataextractor.parse.elements.ParseExpression
Match the longest. Can probably be replaced by the pipe operator ‘|’.
-
class
chemdataextractor.parse.elements.
Every
(exprs)[source]¶ Bases:
chemdataextractor.parse.elements.ParseExpression
Match all of the containing parse expressions, and return the longest
-
class
chemdataextractor.parse.elements.
First
(exprs)[source]¶ Bases:
chemdataextractor.parse.elements.ParseExpression
Match the first.
-
class
chemdataextractor.parse.elements.
ParseElementEnhance
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.BaseParserElement
Abstract class for combining and post-processing parsed tokens.
-
class
chemdataextractor.parse.elements.
FollowedBy
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Check ahead if matches.
Example:
Tn + FollowedBy('Neel temperature') Tn will match only if followed by 'Neel temperature', but 'Neel temperature' will not be part of the output/tree
-
class
chemdataextractor.parse.elements.
Not
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Check ahead to disallow a match with the given parse expression.
Example:
Tn + Not('some_string') Tn will match if not followed by 'some_string'
-
class
chemdataextractor.parse.elements.
ZeroOrMore
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Optional repetition of zero or more of the given expression.
-
class
chemdataextractor.parse.elements.
OneOrMore
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Repetition of one or more of the given expression.
-
class
chemdataextractor.parse.elements.
Optional
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Can be present but doesn’t need to be. If present, will be added to the result/tree.
-
class
chemdataextractor.parse.elements.
Group
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
For nested tags; will group argument and give it a label, preserving the original sub-tags. Otherwise, the default behaviour would be to rename the outermost tag in the argument. Usage: Group(some_text)(‘new_tag) where ‘some_text’ is a previously tagged expression
-
class
chemdataextractor.parse.elements.
SkipTo
(expr, include=False)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Skips to the next occurance of expression. Does not add the next occurance of expression to the parse tree. For example:
entities + SkipTo(entities)
will output
entities
only once. Whereas:entities + SkipTo(entities) + entities
will output
entities
as well as the second occurrence ofentities
after an arbitrary number of tokens in between.
-
class
chemdataextractor.parse.elements.
Hide
(expr)[source]¶ Bases:
chemdataextractor.parse.elements.ParseElementEnhance
Converter for ignoring the results of a parsed expression. It wouldn’t appear in the generated xml element tree, but it would still be part of the rule.
-
chemdataextractor.parse.elements.
W
¶
-
chemdataextractor.parse.elements.
I
¶
-
chemdataextractor.parse.elements.
R
¶
-
chemdataextractor.parse.elements.
T
¶ alias of
chemdataextractor.parse.elements.Tag
-
chemdataextractor.parse.elements.
H
¶
.parse.ir¶
IR spectrum text parser.
.parse.mp¶
NMR text parser.
.parse.nmr¶
NMR text parser.
-
class
chemdataextractor.parse.nmr.
NmrParser
[source]¶ Bases:
chemdataextractor.parse.base.BaseParser
-
root
= <chemdataextractor.parse.elements.And object>¶
-
parse_full_sentence
= True¶
-
.parse.template¶
Basic property parser template for Quantity Models
-
class
chemdataextractor.parse.template.
QuantityModelTemplateParser
[source]¶ Bases:
chemdataextractor.parse.auto.BaseAutoParser
,chemdataextractor.parse.base.BaseSentenceParser
Template parser for QuantityModel-type structures
Finds Cem, Specifier, Value and Units from single sentences
Other entities are merged contextually
-
specifier_phrase
¶ The model specifier
-
value_phrase
¶ Value and units
-
cem_phrase
¶ CEM phrases
-
prefix
¶ Specifier prefix phrase e.g. Tc equal to
-
specifier_and_value
¶ Specifier and value + units
-
cem_before_specifier_and_value_phrase
¶ Phrases ordered CEM, Specifier, Value, Unit
-
specifier_before_cem_and_value_phrase
¶
-
cem_after_specifier_and_value_phrase
¶ Phrases ordered specifier, value, unit, CEM
-
value_specifier_cem_phrase
¶ Phrases ordered value unit specifier cem
-
root
¶ Root Phrases
-
-
class
chemdataextractor.parse.template.
MultiQuantityModelTemplateParser
[source]¶ Bases:
chemdataextractor.parse.auto.BaseAutoParser
,chemdataextractor.parse.base.BaseSentenceParser
Template for parsing sentences that contain nested or chained entities
MULTIPLE ENTITY PHRASES
Single compound, multiple specifiers, multiple phase transitions e.g. BiFeO3 has TC = 1093 K and TN = 640 K
single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K
multiple compounds, single specifier, multiple transitions e.g. TC in BiFeO3 and LaFeO3 of 640 and 750 K
multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO
multiple compounds, multiple specifiers, multiple transitions e.g. BiFeO3 and LaFeO3 have Tc = 640 K and TN = 750 K respectively
- Parameters:
{[type]} -- [description] (BaseSentenceParser) –
{[type]} -- [description] –
-
parse_full_sentence
= True¶
-
specifier_phrase
¶ Specifier Phrase
-
prefix_only
¶ prefix
-
prefix
¶ specifier and prefix
-
single_cem
¶ Any cem
-
unit
¶ Unit element
-
value_with_optional_unit
¶ Value possibly followed by a unit
-
value_phrase
¶ Value with unit
-
list_of_values
¶ List of values with either multiple units or one at the end
-
list_of_cems
¶ List of cems e.g. cem1, cem2, cem3 and cem4
-
single_specifier_and_value_with_optional_unit
¶ Specifier plus value and possible unit
-
single_specifier_and_value
¶ Specifier value and unit
-
list_of_properties
¶ List of specifiers and units
-
multi_entity_phrase_1
¶ Single compound, multiple specifiers, values e.g. BiFeO3 has TC1 = 1093 K and Tc2 = 640 K
-
multi_entity_phrase_2
¶ single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K
-
multi_entity_phrase_3a
¶ multiple compounds, single specifier, multiple transitions cems first e.g. In BiFeO3 and LaFeO3 Tc are found to be 640 and 750 K
-
multi_entity_phrase_3b
¶ multiple compounds, single specifier, multiple transitions cems last e.g. Tc = 750 and 640 K in LaFeO3 and BiFeO3, respectivel
-
multi_entity_phrase_3c
¶ multiple compounds, single specifier, multiple transitions cems first e.g. Tc of BiFeO3 and LaFeO3 are found to be 640 and 750 K
-
multi_entity_phrase_3
¶ Combined phrases of type 3
-
multi_entity_phrase_4a
¶ multiple compounds, single specifier, single transition e.g. TC of 640 K in BifEO3, LaFeO3 and MnO
-
multi_entity_phrase_4b
¶ Cems first
-
multi_entity_phrase_4
¶
-
multi_entity_phrase_5
¶ multiple compounds, single specifier, multiple transitions cems last e.g. curie temperatures from 100 K in MnO to 300 K in NiO
-
root
¶
-
interpret_multi_entity_1
(result, start, end)[source]¶ Interpret phrases that have a single CEM and multiple values with multiple specifiers
-
interpret_multi_entity_2
(result, start, end)[source]¶ single compound, single specifier, multiple transitions e.g. BiFeO3 shows magnetic transitions at 1093 and 640 K
-
interpret_multi_entity_3
(result, start, end)[source]¶ interpret multiple compounds, single specifier, multiple transitions
.parse.tg¶
Glass transition temperature parser.
.parse.uvvis¶
UV-vis text parser.