.relex¶

For performing semi-supervised chemical Relationship Extraction using the Snowball Algorithm.

.relex.cluster¶

Cluster of phrase objects and associated cluster dictionaries

class chemdataextractor.relex.cluster.Cluster(label=None, learning_rate=0.5)[source]¶

Bases: object

Base Snowball Cluster, used to combine similar phrases

__init__(label=None, learning_rate=0.5)[source]¶

Create a new cluster

Keyword Arguments:

{str} -- The label of this cluster (default (label) – {None})
{list} -- The order of entities that all phrases in this cluster must share (default (order) – {None})
{float} -- How quickly to update confidences based on new information (default (learning_rate) – {0.5})

add_phrase(phrase)[source]¶

Add phrase to this cluster, update the word dictionary and token weights

Parameters:: phrase (chemdataextractor.relex.phrase.Phrase) – The phrase to add to the cluster

update_dictionaries(phrase)[source]¶

Update all dictionaries in this cluster

Parameters:: phrase (chemdataextractor.relex.phrase.Phrase) – The phrase to update

static add_tokens(dictionary, tokens)[source]¶

Add specified tokens to the specified dictionary

Parameters:

dictionary (OrderedDict) – The dictionary to add tokens to
tokens – tokens to add

Type:

list of str

update_weights()[source]¶: Update the weights on each token in the phrases

update_pattern()[source]¶

Use the cluster phrases to generate a new centroid extraction Pattern object

Parameters:

relations – List of known relations to look for
sentences (List of str) – List of sentences known to contain relations

Type:

list of Relation objects

update_pattern_confidence()[source]¶: Determine the confidence of this centroid pattern

get_relations(tokens)[source]¶

Retrieve relations from a set of tokens using this clusters extraction pattern

Parameters:: {list} -- Tokens to extract from (tokens) –
Returns:: Relations – The found Relations

.relex.entity¶

Extraction pattern object

class chemdataextractor.relex.entity.Entity(text, tag, parse_expression, start, end)[source]¶

Bases: object

A base entity, the fundamental unit of a Relation

__init__(text, tag, parse_expression, start, end)[source]¶

Create a new Entity

Parameters:

{str} -- The text of the entity (text) –
{str or list} -- name of the entity (tag) –
-- how the entity is identified in text (parse_expression) –
{int} -- The index of the Entity in tokens (start) –
{int} -- The end index of the entity in tokens (end) –

serialize()[source]¶

.relex.pattern¶

Extraction pattern object

class chemdataextractor.relex.pattern.Pattern(entities=None, elements=None, label=None, sentences=None, order=None, relations=None, confidence=0)[source]¶

Bases: object

Pattern object, fundamentally the same as a phrase except assigned a confidence

__init__(entities=None, elements=None, label=None, sentences=None, order=None, relations=None, confidence=0)[source]¶: Initialize self. See help(type(self)) for accurate signature.

to_string()[source]¶

generate_cde_parse_expression()[source]¶: Create a CDE parse expression for this extraction pattern

.relex.phrase¶

Phrase object

class chemdataextractor.relex.phrase.Phrase(sentence_tokens, relations, prefix_length, suffix_length)[source]¶

Bases: object

__init__(sentence_tokens, relations, prefix_length, suffix_length)[source]¶

Phrase Object

Class for handling which relations and entities appear in a sentence, the base type used for clustering and generating extraction patterns

Parameters:

{[list} -- The sentence tokens from which to generate the Phrase (sentence_tokens) –
{list} -- List of Relation objects to be tagged in the sentence (relations) –
{int} -- Number of tokens to assign to the prefix (prefix_length) –
{int} -- Number of tokens to assign to the suffix (suffix_length) –

to_string()[source]¶

create()[source]¶: Create a phrase from known relations

reset_vectors()[source]¶: Set all element vectors to None

.relex.relationship¶

Classes for defining new chemical relationships

class chemdataextractor.relex.relationship.Relation(entities, confidence)[source]¶

Bases: object

Relation class

Essentially a placeholder for related of entities

__init__(entities, confidence)[source]¶

Init

Parameters:

{list} -- List of Entity objects that are present in this relationship (entities) –
{float} -- The confidence of the relation (confidence) –

serialize()[source]¶

is_valid()[source]¶

.relex.snowball¶

.relex.utils¶

Various utility functions

chemdataextractor.relex.utils.match_score(pi, pj, prefix_weight=0.1, middle_weight=0.8, suffix_weight=0.1)[source]¶: Compute match between phrases using a dot product of vectors :param pi Phrase or pattern :param pj phrase or pattern # add weights to dot products to put more emphasis on matching the middles

chemdataextractor.relex.utils.vectorise(phrase, cluster)[source]¶

Vectorise a phrase object against a given cluster

Parameters:

{[type]} -- [description] (cluster) –
{[type]} -- [description] –

chemdataextractor.relex.utils.match(phrase, cluster, prefix_weight, middles_weight, suffix_weight)[source]¶

Vectorise the phrase against this cluster to determine the match score

Parameters:

{[type]} -- [description] (cluster) –
{[type]} -- [description] –

chemdataextractor.relex.utils.mode_rows(a)[source]¶: Find the modal row of a 2d array :param a: The 2d array to process :type a: np.array() :return: The most frequent row

chemdataextractor.relex.utils.KnuthMorrisPratt(text, pattern)[source]¶

Yields all starting positions of copies of the pattern in the text.: Calling conventions are similar to string.find, but its arguments can be lists or iterators, not just strings, it returns all matches, not just the first one, and it does not need the whole text in memory at once. Whenever it yields, it will have read the text exactly up to and including the match that caused the yield.

Source: http://code.activestate.com/recipes/117214/

chemdataextractor.relex.utils.subfinder(mylist, pattern)[source]¶

.reader

.scrape