Extracting a Custom Property¶
In [1]:
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading
Example Document¶
Let’s create a simple example document with a single heading followed by a single paragraph:
In [2]:
d = Document(
Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)
What does this look like?
In [3]:
d
Out[3]:
Synthesis of 2,4,6-trinitrotoluene (3a)
The procedure was followed to yield a pale yellow solid (b.p. 240 °C)
Default Models¶
While ChemDataExtractor looks for a lot of properties out of the box, ChemDataExtractor won’t extract the boiling point property.
In [4]:
d.records.serialize()
Out[4]:
[{'Compound': {'names': ['2,4,6-trinitrotoluene'],
'labels': ['3a'],
'roles': ['product']}}]
Defining a New Property Model¶
The first task is to define the schema of a new property. We already
have a TemperatureModel
defined, which will handle things such as
value and units. Because of this information, all we need to add is a
specifier for boiling point, and the automatic parsers defined in
ChemDataExtractor should be able to handle the rest.
In [5]:
from chemdataextractor.model.units import TemperatureModel, Temperature, Kelvin
from chemdataextractor.model import ListType, ModelType, StringType, Compound
from chemdataextractor.parse import I, AutoSentenceParser
class BoilingPoint(TemperatureModel):
specifier = StringType(parse_expression=I('Boiling') + I('Point'))
compound = ModelType(Compound, required=True, contextual=True)
parsers = [AutoSentenceParser()]
Writing a New Parser¶
There are also cases when we want to define our own parsers in addition to the already defined ones. Let’s define parsing rules that define how to interpret text and convert it into the model:
In [6]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge
prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')
In [7]:
from chemdataextractor.parse.base import BaseSentenceParser
from chemdataextractor.utils import first
class BpParser(BaseSentenceParser):
root = bp
def interpret(self, result, start, end):
compound = Compound()
raw_value = first(result.xpath('./value/text()'))
raw_units = first(result.xpath('./units/text()'))
melting_point = self.model(raw_value=raw_value,
raw_units=raw_units,
value=self.extract_value(raw_value),
error=self.extract_error(raw_value),
units=self.extract_units(raw_units, strict=True),
compound=compound)
cem_el = first(result.xpath('./cem'))
if cem_el is not None:
melting_point.compound.names = cem_el.xpath('./name/text()')
melting_point.compound.labels = cem_el.xpath('./label/text()')
yield melting_point
In [8]:
BoilingPoint.parsers.append(BpParser())
Running the New Parser¶
In [9]:
d = Document(
Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)'),
models = [BoilingPoint]
)
d.records.serialize()
Out[9]:
[{'BoilingPoint': {'raw_value': '240',
'raw_units': '°C)',
'value': [240.0],
'units': 'Celsius^(1.0)',
'compound': {'Compound': {'names': ['2,4,6-trinitrotoluene'],
'labels': ['3a'],
'roles': ['product', 'Synthesis of']}}}}]