Extracting a Custom Property

In [1]:
from chemdataextractor import Document
from chemdataextractor.model import Compound
from chemdataextractor.doc import Paragraph, Heading

Example Document

Let’s create a simple example document with a single heading followed by a single paragraph:

In [2]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)')
)

What does this look like?

In [3]:
d
Out[3]:

Synthesis of 2,4,6-trinitrotoluene (3a)

The procedure was followed to yield a pale yellow solid (b.p. 240 °C)

Default Models

While ChemDataExtractor looks for a lot of properties out of the box, ChemDataExtractor won’t extract the boiling point property.

In [4]:
d.records.serialize()
Out[4]:
[{'Compound': {'names': ['2,4,6-trinitrotoluene'],
   'labels': ['3a'],
   'roles': ['product']}}]

Defining a New Property Model

The first task is to define the schema of a new property. We already have a TemperatureModel defined, which will handle things such as value and units. Because of this information, all we need to add is a specifier for boiling point, and the automatic parsers defined in ChemDataExtractor should be able to handle the rest.

In [5]:
from chemdataextractor.model.units import TemperatureModel, Temperature, Kelvin
from chemdataextractor.model import ListType, ModelType, StringType, Compound
from chemdataextractor.parse import I, AutoSentenceParser

class BoilingPoint(TemperatureModel):
    specifier = StringType(parse_expression=I('Boiling') + I('Point'))
    compound = ModelType(Compound, required=True, contextual=True)
    parsers = [AutoSentenceParser()]

Writing a New Parser

There are also cases when we want to define our own parsers in addition to the already defined ones. Let’s define parsing rules that define how to interpret text and convert it into the model:

In [6]:
import re
from chemdataextractor.parse import R, I, W, Optional, merge

prefix = (R(u'^b\.?p\.?$', re.I) | I(u'boiling') + I(u'point')).hide()
units = (W(u'°') + Optional(R(u'^[CFK]\.?$')))(u'units').add_action(merge)
value = R(u'^\d+(\.\d+)?$')(u'value')
bp = (prefix + value + units)(u'bp')
In [7]:
from chemdataextractor.parse.base import BaseSentenceParser
from chemdataextractor.utils import first

class BpParser(BaseSentenceParser):
    root = bp

    def interpret(self, result, start, end):
        compound = Compound()
        raw_value = first(result.xpath('./value/text()'))
        raw_units = first(result.xpath('./units/text()'))
        melting_point = self.model(raw_value=raw_value,
                    raw_units=raw_units,
                    value=self.extract_value(raw_value),
                    error=self.extract_error(raw_value),
                    units=self.extract_units(raw_units, strict=True),
                    compound=compound)
        cem_el = first(result.xpath('./cem'))
        if cem_el is not None:
            melting_point.compound.names = cem_el.xpath('./name/text()')
            melting_point.compound.labels = cem_el.xpath('./label/text()')
            yield melting_point

In [8]:
BoilingPoint.parsers.append(BpParser())

Running the New Parser

In [9]:
d = Document(
    Heading(u'Synthesis of 2,4,6-trinitrotoluene (3a)'),
    Paragraph(u'The procedure was followed to yield a pale yellow solid (b.p. 240 °C)'),
    models = [BoilingPoint]
)

d.records.serialize()
Out[9]:
[{'BoilingPoint': {'raw_value': '240',
   'raw_units': '°C)',
   'value': [240.0],
   'units': 'Celsius^(1.0)',
   'compound': {'Compound': {'names': ['2,4,6-trinitrotoluene'],
     'labels': ['3a'],
     'roles': ['product', 'Synthesis of']}}}}]