Automated parsing for tables with TableDataExtractor

First, we will check out a particular table we want to parse. The table can be passed into the ChemDataExtractor (CDE) framework manually, or, will be processed automatically when a document is passed into CDE. More information about TableDataExtractor can be found at TDE documentation.

At the moment no records will be found since we haven’t defined a model yet.

In [1]:
from chemdataextractor.doc.table import Table
from chemdataextractor.doc import Caption

path = "./example_tables/table_example_tkt_2.csv"
table = Table(caption=Caption(""),table_data=path)

print(table.tde_table)
table.records
+--------+--------------------------+------------------------------+
|  Data  |      Row Categories      |      Column Categories       |
+--------+--------------------------+------------------------------+
|  1100  | ['Inorganic', 'BiFeO3']  |   ['Temperatures', 'Tc/K']   |
|  643   | ['Inorganic', 'BiFeO3']  |   ['Temperatures', 'Tn/K']   |
|        | ['Inorganic', 'BiFeO3']  | ['Magnetic moment', 'B [T]'] |
|  257   | ['Inorganic', ' LaCrO3'] |   ['Temperatures', 'Tc/K']   |
|  150   | ['Inorganic', ' LaCrO3'] |   ['Temperatures', 'Tn/K']   |
| 0.1 mT | ['Inorganic', ' LaCrO3'] | ['Magnetic moment', 'B [T]'] |
|        |  ['Organic', 'LaCrO2']   |   ['Temperatures', 'Tc/K']   |
|   10   |  ['Organic', 'LaCrO2']   |   ['Temperatures', 'Tn/K']   |
|  500   |  ['Organic', 'LaCrO2']   | ['Magnetic moment', 'B [T]'] |
|        |   ['Inorganic', 'Gd']    |   ['Temperatures', 'Tc/K']   |
|  294   |   ['Inorganic', 'Gd']    |   ['Temperatures', 'Tn/K']   |
| 659 T  |   ['Inorganic', 'Gd']    | ['Magnetic moment', 'B [T]'] |
+--------+--------------------------+------------------------------+
Out[1]:
[]

Model Creation

We want to retrieve the Curie temperatures, Tc, from the table. To define a suitable model, we can input some base model types. In our case, TemperatureModel is the right choice. It assumes units of temperature automatically. Alternatively, BaseModel can be used for anything. Also, we can import some parsing objects from CDE, like I, W, R, Optional, and other elements we need to create parse expressions.

A specifier is the only mandatory element for the new model. We also want to add a compound (reserved name) that is a model of the type Compound.

In [2]:
from chemdataextractor.model.units.temperature import TemperatureModel
from chemdataextractor.parse.elements import I
from chemdataextractor.model.model import Compound
from chemdataextractor.model.base import ListType, ModelType, StringType

class CurieTemperature(TemperatureModel):
    specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
    compound = ModelType(Compound, required=True, contextual=True)

We then parse the table, by setting the models for the table:

In [3]:
table.models = [CurieTemperature]
for record in table.records:
    print(record.serialize())
{'CurieTemperature': {'raw_value': '1100', 'raw_units': 'K', 'value': [1100.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['BiFeO3']}}}}
{'CurieTemperature': {'raw_value': '257', 'raw_units': 'K', 'value': [257.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO3']}}}}
{'CurieTemperature': {'raw_value': '10', 'raw_units': 'K', 'value': [10.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO2']}}}}
{'CurieTemperature': {'raw_value': '294', 'raw_units': 'K', 'value': [294.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['Gd']}}}}

Advanced Features

We can add custom fields to the model, that will be parsed automatically. For that we have to specify the data model of the fields (StringType, FloatType, …) and provide a parse expression that is composed out of parse elements, like all other parse expressions in ChemDataExtractor.

These field can be made required (required = True) if needed, or optional (required = False, default).

In [4]:
class CurieTemperature(TemperatureModel):
    StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
    specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
    compound = ModelType(Compound, required=True, contextual=True)
    label = StringType(parse_expression=I('inorganic'))

table.models = [CurieTemperature]
for record in table.records:
    print(record.serialize())
{'CurieTemperature': {'raw_value': '1100', 'raw_units': 'K', 'value': [1100.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['BiFeO3']}}, 'label': 'Inorganic'}}
{'CurieTemperature': {'raw_value': '257', 'raw_units': 'K', 'value': [257.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO3']}}, 'label': 'Inorganic'}}
{'CurieTemperature': {'raw_value': '10', 'raw_units': 'K', 'value': [10.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO2']}}}}
In [6]:
class CurieTemperature(TemperatureModel):
    StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
    specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
    compound = ModelType(Compound, required=True, contextual=True)
    label = StringType(parse_expression=I('something else'), required=True)

table.models = [CurieTemperature]
for record in table.records:
    print(record.serialize())
In [ ]: