Snowball Relationship Extraction¶
The new Relex package is a toolkit for performing probabilistic chemical relationship extraction based on semi-supervised online learning. The aim is to train parse expressions probabilistically, removing the need for creating parsers with trial and error.
This overview is based on how to use the code, for a detailed explanation of the algorithm please see the associated paper: https://www.nature.com/articles/sdata2018111
In general, chemical relationships can consist of any number of entities, that is, the elements of a relationship that are linked together to uniquely define it. Here we will focus on a simple Curie Temperature relationship that consists of the following entities: - A compound - A specifier - A value - A unit
Thus this forms a quaternary relationship. Note the algorithm is general and so any number of entities can be specified. You can even make some entities more important than others.
First define a new model, as usual:
In [20]:
from chemdataextractor.relex import Snowball
from chemdataextractor.model import BaseModel, StringType, ListType, ModelType, Compound, TemperatureModel
import re
from chemdataextractor.parse import R, I, W, Optional, merge, join, OneOrMore, Any, ZeroOrMore, Start
from chemdataextractor.doc import Sentence
In [21]:
class CurieTemperature(TemperatureModel):
specifier_expression =((I('Curie') + I('temperature')) | I('Tc')).add_action(join)
specifier = StringType(parse_expression=specifier_expression, required=True, contextual=True, updatable=True)
compound = ModelType(Compound, required=True, contextual=True, updatable=False)
We are now ready to start Snowballing
Training¶
Create a Snowball
object to use on our relationship and point to a
path for training.
Here will we use the default parameters: - Tc = 0.95, the minimum Confidence required for a new relationship to be accepted - Tsim=0.95, The minimum similarity between phrases for them to be clustered together - learning_rate = 0.5, How quickly the system updates the confidences based on new information - prefix_length=1, number of tokens in phrase prefix - suffix_length = 1, number of tokens in phrase suffix - prefix_weight = 0.1, the weight of the prefix in determining similarity - middles_weight = 0.8, the weight of the middles in determining similarity - suffix_weight = 0.1, weight of suffix in determining similarity - max_candidate_combinations = 400, maximum number of candidate,name combinations to retrieve (memory constrained)
Note increasing TC and Tsim yields more extraction patterns but stricter rules on new relations.
Increasing the learning rate influences how much we trust new information compared to our training.
Increasing the prefix/suffix length increases the likelihood of getting overlapping relationships.
The training process in online. This means that the user can train the system on as many papers as they like, and it will continue to update the knowledge base. At each paper, the sentences are scanned for any matches to the parse phrase, and if the sentence matches, candidate relationships are formed. There can be many candidate relationships in a single sentence, so the output provides the user will all available candidates.
The user can specify to accept a relationship by typing in the number (or numbers) of the candidates they wish to accept. I.e. If you want candidate 0 only, type ‘0’ then press enter. If you want 0 and 3 type ‘0,3’ and press enter. If you dont want any, then press any other key. e.g. ‘n’ or ‘no’.
In [3]:
snowball = Snowball(CurieTemperature)
In [4]:
snowball.save_file_name = 'curie_temperatures'
In [5]:
snowball.train(corpus='../tests/data/relex/curie_training_set/')
1/6: c2jm33712f.html
Cobalt is a ferromagnetic transition metal exhibiting a high Curie temperature of 1388 K (ferromagnetic–paramagnetic transition) and a high saturation magnetization (1422 emu cm−3) at room temperature.2 The technological applications of Co nanoparticles in the field of ultrahigh-density data recording and data storage are well documented in the literature.1 Recently, Co has also been used in MRI agents,3 a field which has primarily been dominated by iron oxides (Fe3O4) because of their stability and biocompatibility albeit the oxides show much less saturation magnetization (84 emu cm−3)4 in comparison to cobalt.
Candidate 0 <(Cobalt,compound_names,0,1), (Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14)>
Candidate 1 <(transition metal,compound_names,4,6), (Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14)>
Candidate 2 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (Co,compound_names,37,38)>
Candidate 3 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (Co,compound_names,59,60)>
Candidate 4 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (iron oxides,compound_names,75,77)>
Candidate 5 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (oxides,compound_names,76,77)>
Candidate 6 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (Fe3O4,compound_names,78,79)>
Candidate 7 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (oxides,compound_names,88,89)>
Candidate 8 <(Curie temperature,specifier,9,11), (1388,raw_value,12,13), (K,raw_units,13,14), (cobalt,compound_names,101,102)>
...: 0
2/6: b806499g.html
Europium mono-chalcogenides (EuE, E = O, S, Se, Te) with the rock-salt structure have been investigated at least since the 1960s.8 The europium chalcogenides form an important class of magnetic semiconductors, revealing an assortment of magnetic ordering,8 and have frequently served as model Heisenberg magnets.9,10 Research on the effect of particle size and surface nature on the magnetic properties of the europium chalcogenides has become very appealing; for example, pressure studies have shown that the Curie temperature (TC) is very sensitive to both the Eu–Eu distance and the band gap energy.11 Among the europium mono-chalcogenides, EuS has been especially investigated due to its properties as a ferromagnetic semiconductor with a Curie temperature of 16.8 K, an energy gap of 1.6 eV, and a spin-splitting of the gap of 0.36 eV.
Candidate 0 <(Europium mono-chalcogenides,compound_names,0,2), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 1 <(Europium mono-chalcogenides,compound_names,0,2), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 2 <(Europium mono-chalcogenides,compound_names,0,2), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 3 <(EuE,compound_names,3,4), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 4 <(EuE,compound_names,3,4), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 5 <(EuE,compound_names,3,4), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 6 <(O,compound_names,7,8), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 7 <(O,compound_names,7,8), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 8 <(O,compound_names,7,8), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 9 <(S,compound_names,9,10), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 10 <(S,compound_names,9,10), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 11 <(S,compound_names,9,10), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 12 <(Se,compound_names,11,12), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 13 <(Se,compound_names,11,12), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 14 <(Se,compound_names,11,12), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 15 <(Te,compound_names,13,14), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 16 <(Te,compound_names,13,14), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 17 <(Te,compound_names,13,14), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 18 <(europium chalcogenides,compound_names,30,32), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 19 <(europium chalcogenides,compound_names,30,32), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 20 <(europium chalcogenides,compound_names,30,32), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 21 <(europium chalcogenides,compound_names,70,72), (Curie temperature,specifier,86,88), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 22 <(europium chalcogenides,compound_names,70,72), (TC,specifier,89,90), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 23 <(europium chalcogenides,compound_names,70,72), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 24 <(Curie temperature,specifier,86,88), (Eu,compound_names,97,98), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 25 <(TC,specifier,89,90), (Eu,compound_names,97,98), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 26 <(Eu,compound_names,97,98), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 27 <(Curie temperature,specifier,86,88), (Eu,compound_names,99,100), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 28 <(TC,specifier,89,90), (Eu,compound_names,99,100), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 29 <(Eu,compound_names,99,100), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 30 <(Curie temperature,specifier,86,88), (europium mono-chalcogenides,compound_names,108,110), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 31 <(TC,specifier,89,90), (europium mono-chalcogenides,compound_names,108,110), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 32 <(europium mono-chalcogenides,compound_names,108,110), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 33 <(Curie temperature,specifier,86,88), (EuS,compound_names,111,112), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 34 <(TC,specifier,89,90), (EuS,compound_names,111,112), (16.8,raw_value,129,130), (K,raw_units,130,131)>
Candidate 35 <(EuS,compound_names,111,112), (Curie temperature,specifier,126,128), (16.8,raw_value,129,130), (K,raw_units,130,131)>
...: 35
The DC susceptibility curves of the EuS nanocrystals show a transition temperature TC at around 15 K (estimated from the inflection point of the curves) and a thermal irreversibility below Tir = 14 K.
Candidate 0 <(EuS,compound_names,6,7), (TC,specifier,12,13), (15,raw_value,15,16), (K,raw_units,16,17)>
Candidate 1 <(EuS,compound_names,6,7), (TC,specifier,12,13), (15,raw_value,15,16), (K,raw_units,35,36)>
Candidate 2 <(EuS,compound_names,6,7), (TC,specifier,12,13), (K,raw_units,16,17), (14,raw_value,34,35)>
Candidate 3 <(EuS,compound_names,6,7), (TC,specifier,12,13), (14,raw_value,34,35), (K,raw_units,35,36)>
...: 0
The field-cooled (FC) curve tends to saturate at low temperatures, as is usually found in bulk and interacting-nanoparticle systems.23 An inspection of the temperature dependence of the inverse of the magnetic susceptibility shows that above TC the system has two Curie–Weiss regimes: one above ∼50 K with an extrapolated paramagnetic Curie temperature θp of 10.8 K, and other below ∼50 K with an extrapolated θp of 15.7 K.
Candidate 0 <(TC,specifier,42,43), (K,raw_units,56,57), (10.8,raw_value,65,66)>
Candidate 1 <(TC,specifier,42,43), (10.8,raw_value,65,66), (K,raw_units,66,67)>
Candidate 2 <(TC,specifier,42,43), (10.8,raw_value,65,66), (K,raw_units,73,74)>
Candidate 3 <(TC,specifier,42,43), (10.8,raw_value,65,66), (K,raw_units,80,81)>
Candidate 4 <(K,raw_units,56,57), (Curie temperature,specifier,61,63), (10.8,raw_value,65,66)>
Candidate 5 <(Curie temperature,specifier,61,63), (10.8,raw_value,65,66), (K,raw_units,66,67)>
Candidate 6 <(Curie temperature,specifier,61,63), (10.8,raw_value,65,66), (K,raw_units,73,74)>
Candidate 7 <(Curie temperature,specifier,61,63), (10.8,raw_value,65,66), (K,raw_units,80,81)>
Candidate 8 <(TC,specifier,42,43), (K,raw_units,56,57), (15.7,raw_value,79,80)>
Candidate 9 <(TC,specifier,42,43), (K,raw_units,66,67), (15.7,raw_value,79,80)>
Candidate 10 <(TC,specifier,42,43), (K,raw_units,73,74), (15.7,raw_value,79,80)>
Candidate 11 <(TC,specifier,42,43), (15.7,raw_value,79,80), (K,raw_units,80,81)>
Candidate 12 <(K,raw_units,56,57), (Curie temperature,specifier,61,63), (15.7,raw_value,79,80)>
Candidate 13 <(Curie temperature,specifier,61,63), (K,raw_units,66,67), (15.7,raw_value,79,80)>
Candidate 14 <(Curie temperature,specifier,61,63), (K,raw_units,73,74), (15.7,raw_value,79,80)>
Candidate 15 <(Curie temperature,specifier,61,63), (15.7,raw_value,79,80), (K,raw_units,80,81)>
...: n
The EuS nanocrystals have low coercive fields HC (at 2 K, HC = 90 Oe) and low remanence Mr, this rapidly approaches zero as temperature increases, being zero above TC (Fig. 7).
Candidate 0 <(EuS,compound_names,1,2), (2,raw_value,10,11), (K,raw_units,11,12), (TC,specifier,34,35)>
...: n
The Arrot plots shown in Fig. 6a, right, and obtained from the M(H) curves, confirm that the Curie temperature is around 15 K.
Candidate 0 <(Curie temperature,specifier,20,22), (15,raw_value,24,25), (K,raw_units,25,26)>
...: n
In what concerns the other range (>40 K), we observe a maximum at 60 K (Fig. 6b), probably related to a EuO-like environment located at the surface of the nanocrystals, since EuO has a Tc at 69 K.8 The AC susceptibility curves show a high temperature region Tir > 11 K for which the in-phase susceptibility χ′ is not frequency dependent and the out-of-phase susceptibility χ″ is close to zero.
Candidate 0 <(K,raw_units,9,10), (60,raw_value,17,18), (EuO,compound_names,28,29), (Tc,specifier,44,45)>
Candidate 1 <(K,raw_units,9,10), (60,raw_value,17,18), (EuO,compound_names,41,42), (Tc,specifier,44,45)>
Candidate 2 <(60,raw_value,17,18), (K,raw_units,18,19), (EuO,compound_names,28,29), (Tc,specifier,44,45)>
Candidate 3 <(60,raw_value,17,18), (K,raw_units,18,19), (EuO,compound_names,41,42), (Tc,specifier,44,45)>
Candidate 4 <(60,raw_value,17,18), (EuO,compound_names,28,29), (Tc,specifier,44,45), (K,raw_units,60,61)>
Candidate 5 <(60,raw_value,17,18), (EuO,compound_names,41,42), (Tc,specifier,44,45), (K,raw_units,60,61)>
...: n
3/6: c6cp00375c.html
The much smaller value of TB compared to the Curie temperature (627 K) of bulk Ni suggests that the thickness of the Ni shell is very small.
Candidate 0 <(Curie temperature,specifier,9,11), (627,raw_value,12,13), (K,raw_units,13,14), (Ni,compound_names,17,18)>
Candidate 1 <(Curie temperature,specifier,9,11), (627,raw_value,12,13), (K,raw_units,13,14), (Ni,compound_names,24,25)>
...: 0
4/6: c1jm13879k.html
CoS2 is ferromagnetic with a Curie temperature of 116 K and Co9S8 is antiferromagnetic with a Néel temperature above the decomposition temperature.28 The magnetic susceptibility of Ni3S2 was found to be temperature-independent, which is consistent with Pauli paramagnetism.
Candidate 0 <(CoS2,compound_names,0,1), (Curie temperature,specifier,5,7), (116,raw_value,8,9), (K,raw_units,9,10)>
Candidate 1 <(Curie temperature,specifier,5,7), (116,raw_value,8,9), (K,raw_units,9,10), (Co9S8,compound_names,11,12)>
Candidate 2 <(Curie temperature,specifier,5,7), (116,raw_value,8,9), (K,raw_units,9,10), (Ni3S2,compound_names,26,27)>
...: 0
5/6: c3nr33950e.html
6/6: c2nr11767c.html
As another notable spintronic material, CrO2 has a high spin polarization (>95%)118 and a Curie temperature of 395 K.
Candidate 0 <(CrO2,compound_names,6,7), (Curie temperature,specifier,19,21), (395,raw_value,22,23), (K,raw_units,23,24)>
...: 0
This training process automatically clusters the sentences you accept and updates the knowlede base. You can check what has been learned by searching in the relex/data folder.
You can always stop training and start again, or come back to the same
training process if you wish, simply load in an existing snowball system
using: Snowball.load()
Finding New Relationships¶
We can now use the Snowball object just like any other parser. Simply add the trained Snowball parser to your models.
In [22]:
CurieTemperature.parsers += [snowball]
Now we can parse a completely new sentence and get the associated records.
In [25]:
import pprint
s = Sentence('BiFeO3 is a ferromagnetic transition metal exhibiting a Curie temperature of 1103 K (')
s.add_models([CurieTemperature])
pprint.pprint(s.records.serialize())
[{'Compound': {'names': ['BiFeO3']}},
{'Compound': {'names': ['transition metal']}},
{'CurieTemperature': {'compound': {'Compound': {'names': ['BiFeO3']}},
'raw_units': 'K',
'raw_value': '1103',
'specifier': 'Curie temperature',
'units': 'Kelvin^(1.0)',
'value': [1103.0]}},
{'CurieTemperature': {'compound': {'Compound': {'names': ['BiFeO3']}},
'confidence': 0.9802186932726803,
'raw_units': 'K',
'raw_value': '1103',
'specifier': 'Curie temperature',
'units': 'Kelvin^(1.0)',
'value': [1103.0]}}]
As shown, the Snowball parser correctly picks out the relation and adds a confidence score to the output. The record without the confidence score is the one picked up by the AutoSentenceParser()