Automated parsing for tables with TableDataExtractor¶
First, we will check out a particular table we want to parse. The table can be passed into the ChemDataExtractor (CDE) framework manually, or, will be processed automatically when a document is passed into CDE. More information about TableDataExtractor can be found at TDE documentation.
At the moment no records will be found since we haven’t defined a model yet.
In [1]:
from chemdataextractor.doc.table import Table
from chemdataextractor.doc import Caption
path = "./example_tables/table_example_tkt_2.csv"
table = Table(caption=Caption(""),table_data=path)
print(table.tde_table)
table.records
+--------+--------------------------+------------------------------+
| Data | Row Categories | Column Categories |
+--------+--------------------------+------------------------------+
| 1100 | ['Inorganic', 'BiFeO3'] | ['Temperatures', 'Tc/K'] |
| 643 | ['Inorganic', 'BiFeO3'] | ['Temperatures', 'Tn/K'] |
| | ['Inorganic', 'BiFeO3'] | ['Magnetic moment', 'B [T]'] |
| 257 | ['Inorganic', ' LaCrO3'] | ['Temperatures', 'Tc/K'] |
| 150 | ['Inorganic', ' LaCrO3'] | ['Temperatures', 'Tn/K'] |
| 0.1 mT | ['Inorganic', ' LaCrO3'] | ['Magnetic moment', 'B [T]'] |
| | ['Organic', 'LaCrO2'] | ['Temperatures', 'Tc/K'] |
| 10 | ['Organic', 'LaCrO2'] | ['Temperatures', 'Tn/K'] |
| 500 | ['Organic', 'LaCrO2'] | ['Magnetic moment', 'B [T]'] |
| | ['Inorganic', 'Gd'] | ['Temperatures', 'Tc/K'] |
| 294 | ['Inorganic', 'Gd'] | ['Temperatures', 'Tn/K'] |
| 659 T | ['Inorganic', 'Gd'] | ['Magnetic moment', 'B [T]'] |
+--------+--------------------------+------------------------------+
Out[1]:
[]
Model Creation¶
We want to retrieve the Curie temperatures, Tc, from the table. To
define a suitable model, we can input some base model types. In our
case, TemperatureModel
is the right choice. It assumes units of
temperature automatically. Alternatively, BaseModel
can be used for
anything. Also, we can import some parsing objects from CDE, like I
,
W
, R
, Optional
, and other elements we need to create parse
expressions.
A specifier
is the only mandatory element for the new model. We also
want to add a compound
(reserved name) that is a model of the type
Compound.
In [2]:
from chemdataextractor.model.units.temperature import TemperatureModel
from chemdataextractor.parse.elements import I
from chemdataextractor.model.model import Compound
from chemdataextractor.model.base import ListType, ModelType, StringType
class CurieTemperature(TemperatureModel):
specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
compound = ModelType(Compound, required=True, contextual=True)
We then parse the table, by setting the models for the table:
In [3]:
table.models = [CurieTemperature]
for record in table.records:
print(record.serialize())
{'CurieTemperature': {'raw_value': '1100', 'raw_units': 'K', 'value': [1100.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['BiFeO3']}}}}
{'CurieTemperature': {'raw_value': '257', 'raw_units': 'K', 'value': [257.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO3']}}}}
{'CurieTemperature': {'raw_value': '10', 'raw_units': 'K', 'value': [10.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO2']}}}}
{'CurieTemperature': {'raw_value': '294', 'raw_units': 'K', 'value': [294.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['Gd']}}}}
Advanced Features¶
We can add custom fields to the model, that will be parsed
automatically. For that we have to specify the data model of the fields
(StringType
, FloatType
, …) and provide a parse expression
that is composed out of parse elements, like all other parse expressions
in ChemDataExtractor.
These field can be made required (required = True
) if needed, or
optional (required = False
, default).
In [4]:
class CurieTemperature(TemperatureModel):
StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
compound = ModelType(Compound, required=True, contextual=True)
label = StringType(parse_expression=I('inorganic'))
table.models = [CurieTemperature]
for record in table.records:
print(record.serialize())
{'CurieTemperature': {'raw_value': '1100', 'raw_units': 'K', 'value': [1100.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['BiFeO3']}}, 'label': 'Inorganic'}}
{'CurieTemperature': {'raw_value': '257', 'raw_units': 'K', 'value': [257.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO3']}}, 'label': 'Inorganic'}}
{'CurieTemperature': {'raw_value': '10', 'raw_units': 'K', 'value': [10.0], 'units': 'Kelvin^(1.0)', 'specifier': 'Tc', 'compound': {'Compound': {'names': ['LaCrO2']}}}}
In [6]:
class CurieTemperature(TemperatureModel):
StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
specifier = StringType(parse_expression=I('TC'), required=True, contextual=True, updatable=True)
compound = ModelType(Compound, required=True, contextual=True)
label = StringType(parse_expression=I('something else'), required=True)
table.models = [CurieTemperature]
for record in table.records:
print(record.serialize())
In [ ]: