.doc

Logic for reading/creating documents. That is, splitting documents down into its various elements. The API for documents has been slightly changed as of version 2.0. Please refer to the migration guide and the examples for an overview of the changes.

Document processing.

.doc.document

Document model.

class chemdataextractor.doc.document.BaseDocument[source]

Bases: collections.abc.Sequence

Abstract base class for a Document.

elements

Return a list of document elements.

records

Chemical records that have been parsed from this Document.

class chemdataextractor.doc.document.Document(*elements, **kwargs)[source]

Bases: chemdataextractor.doc.document.BaseDocument

A document to extract data from. Contains a list of document elements.

__init__(*elements, **kwargs)[source]

Initialize a Document manually by passing one or more Document elements (Paragraph, Heading, Table, etc.)

Strings that are passed to this constructor are automatically wrapped into Paragraph elements.

Parameters:

elements (list[chemdataextractor.doc.element.BaseElement|string]) – Elements in this Document.

Keyword Arguments:
  • config (Config) – (Optional) Config file for the Document.

  • models (list[BaseModel]) – (Optional) Models that the Document should extract data for.

  • list[str])] adjacent_sections_for_merging (list[(list[str],) – (Optional) Sections that will be treated as though they are adjacent for the purpose of contextual merging. All elements should be in lowercase.

  • subclass] skip_elements (list[chemdataextractor.doc.element.BaseElement) – (Optional) Element types to be skipped in parsing

add_models(models)[source]

Add models to all elements.

Usage:

d = Document.from_file(f)
d.set_models([myModelClass1, myModelClass2,..])
Arguments::

models – List of model classes

models
classmethod from_file(f, fname=None, readers=None)[source]

Create a Document from a file.

Usage:

with open('paper.html', 'rb') as f:
    doc = Document.from_file(f)

Note

Always open files in binary mode by using the ‘rb’ parameter.

Parameters:
classmethod from_string(fstring, fname=None, readers=None)[source]

Create a Document from a byte string containing the contents of a file.

Usage:

contents = open('paper.html', 'rb').read()
doc = Document.from_string(contents)

Note

This method expects a byte string, not a unicode string (in contrast to most methods in ChemDataExtractor).

Parameters:
elements

A list of all the elements in this document. All elements subclass from BaseElement, and represent things such as paragraphs or tables, and can be found in chemdataextractor.doc.figure, chemdataextractor.doc.table, and chemdataextractor.doc.text.

records

All records found in this Document, as a list of BaseModel.

get_element_with_id(id)[source]

Get element with the specified ID. If one is not found, None is returned.

Parameters:

id – Identifier to search for.

Returns:

Element with specified ID

Return type:

BaseElement or None

figures

A list of all Figure elements in this Document.

tables

A list of all Table elements in this Document.

citations

A list of all Citation elements in this Document.

footnotes

A list of all Footnote elements in this Document.

Note

Elements (e.g. Tables) can contain nested Footnotes which are not taken into account.

titles

A list of all Title elements in this Document.

headings

A list of all Heading elements in this Document.

paragraphs

A list of all Paragraph elements in this Document.

captions

A list of all Caption elements in this Document.

captioned_elements

A list of all CaptionedElement elements in this Document.

metadata

Return metadata information

abbreviation_definitions

A list of all abbreviation definitions in this Document. Each abbreviation is in the form (str abbreviation, str long form of abbreviation, str ner_tag)

ner_tags

A list of all Named Entity Recognition tags in this Document. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.

cems

A list of all Chemical Entity Mentions in this document as Span

definitions

Return a list of all recognised definitions within this Document

serialize()[source]

Convert Document to Python dictionary. The dictionary will always contain the key ‘type’, which will be ‘document’, and the key ‘elements’, which contains a dictionary representation of each of the elements of the document.

to_json(*args, **kwargs)[source]

Convert Document to JSON string. The content of the JSON will be equivalent to that of serialize(). The document itself will be under the key ‘elements’, and there will also be the key ‘type’, which will always be ‘document’. Any arguments for json.dumps() can be passed into this function.

sentences

.doc.element

Document elements.

class chemdataextractor.doc.element.BaseElement(document=None, references=None, id=None, models=None, **kwargs)[source]

Bases: object

Abstract base class for a Document Element.

Variables:
  • id – (Optional) An identifier for this Element.

  • models (list[chemdataextractor.models.BaseModel]) – A list of models that this element will parse

__init__(document=None, references=None, id=None, models=None, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • document (Document) – (Optional) The document containing this element.

  • references (list[Citation]) – (Optional) Any references contained in the element.

  • id (Any) – (Optional) An identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a chemdataextractor.doc.document.Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

document

The chemdataextractor.doc.document.Document that this element belongs to.

records

All records found in this Element, as a chemdataextractor.model.base.ModelList of chemdataextractor.model.base.BaseModel.

add_models(models)[source]

Set all models on this element

models
to_json(*args, **kwargs)[source]

Convert element to JSON string. The content of the JSON will be equivalent to that of serialize().

elements

A list of child elements. Returns None by default.

class chemdataextractor.doc.element.CaptionedElement(caption, label=None, **kwargs)[source]

Bases: chemdataextractor.doc.element.BaseElement

Document Element with a caption.

Variables:

caption (BaseElement) – The caption for this element.

__init__(caption, label=None, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • caption (BaseElement) – The caption for the element.

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

document

The Document that this element belongs to.

records

All records found in the object, as a list of BaseModel.

abbreviation_definitions

A list of all abbreviation definitions in this Document. Each abbreviation is in the form (str abbreviation, str long form of abbreviation, str ner_tag)

ner_tags

A list of all Named Entity Recognition tags in the caption for this element. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.

cems

A list of all Chemical Entity Mentions in this document as Span

definitions

Return a list of all specifier definitions in the caption

Returns:

list– The specifier definitions

chemical_definitions
models
serialize()[source]

Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of caption, which is a BaseElement

elements

A list of child elements. Returns None by default.

.doc.figure

Figure document elements. :codeauthor:: Callum Court (cc889@cam.ac.uk)

class chemdataextractor.doc.figure.Figure(caption, label=None, links=None, models=None, **kwargs)[source]

Bases: chemdataextractor.doc.element.CaptionedElement

__init__(caption, label=None, links=None, models=None, **kwargs)[source]

Create a new Figure element, to interface with FDE

records

Return FigureData records

Returns:

[type] – [description]

.doc.meta

MetaData Document elements

class chemdataextractor.doc.meta.MetaData(data)[source]

Bases: chemdataextractor.doc.element.BaseElement

__init__(data)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • document (Document) – (Optional) The document containing this element.

  • references (list[Citation]) – (Optional) Any references contained in the element.

  • id (Any) – (Optional) An identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a chemdataextractor.doc.document.Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

records

All records found in this Element, as a chemdataextractor.model.base.ModelList of chemdataextractor.model.base.BaseModel.

serialize()[source]
title

The article title

authors

The article Authors type:: list()

publisher

The source publisher

journal

The source journal

volume

The source volume

issue

The source issue

firstpage

The source first page title

lastpage

The source last page

doi

The source DOI

pdf_url

The source url to the PDF version

html_url

The source url to the HTML version

date

The source publish date

data

Returns all data as a dict()

abbreviation_definitions
definitions
chemical_definitions
cems
is_unidentified

.doc.table

Table document elements

class chemdataextractor.doc.table.Table(caption, label=None, table_data=[], models=None, **kwargs)[source]

Bases: chemdataextractor.doc.element.CaptionedElement

Main Table object. Relies on TableDataExtractor.

__init__(caption, label=None, table_data=[], models=None, **kwargs)[source]

In addition to the parameters below, any keyword arguments supported by TableDataExtractor.TdeTable can be passed in as keyword arguments and they will be passed on to TableDataExtractor.TdeTable.

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a chemdataextractor.doc.document.Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • caption (BaseElement) – The caption for the element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • table_data (list) – (Optional) Table data to be passed on to TableDataExtractor to be parsed. Refer to documentation for TableDataExtractor.TdeTable for more information on how this should be structured.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

  • document (Document) – (Optional) The document containing this element.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

tde_table = None

TableDataExtractor TrivialTable object. Can pass any kwargs into TDE directly.

cde_tables

CDE tables are lists of lists of Cells, that are used for the purpose of parsing in CDE. For other purposes, the underlying TDE table (tde_table) is probably more useful.

serialize()[source]

Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of caption, which is a BaseElement

definitions

Return a list of all specifier definitions in the caption

Returns:

list– The specifier definitions

records

All records found in the object, as a list of BaseModel.

elements

A list of child elements. Returns None by default.

.doc.text

Text-based document elements.

class chemdataextractor.doc.text.BaseText(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]

Bases: chemdataextractor.doc.element.BaseElement

Abstract base class for a text Document Element.

__init__(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words.

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

taggers = []

A list of BaseTagger instances. This is a list of taggers that will be called by ChemDataExtractor to assign tags to each of the tokens in this element.

text

The raw text str for this passage of text.

word_tokenizer

The WordTokenizer used by this element.

lexicon

The Lexicon used by this element.

pos_tagger

The part of speech tagger used by this element. A subclass of BaseTagger

ner_tagger

The named entity recognition tagger used by this element. A subclass of BaseTagger

tokens

A list of RichToken s for this object.

tags

A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific ner_tagger and pos_tagger used for this class.

definitions

A list of all specifier definitions

chemical_definitions

A list of all chemical label definitiond

serialize()[source]

Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘content’ will contain a serialized representation of text, which is a str

class chemdataextractor.doc.text.Text(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]

Bases: collections.abc.Sequence, chemdataextractor.doc.text.BaseText

A passage of text, comprising one or more sentences.

word_tokenizer = <chemdataextractor.nlp.tokenize.BertWordTokenizer object>
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
abbreviation_detector = <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>
taggers = [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]
subsentence_extractor = None
__init__(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

sentence_tokenizer = <chemdataextractor.nlp.tokenize.ChemSentenceTokenizer object>
set_config()[source]

Load settings from configuration file

Note

Called when Document instance is created

sentences

A list of Sentence s that make up this text passage.

elements

A list of child elements. Returns None by default.

raw_sentences

A list of str for the sentences that make up this text passage.

tokens

A list of RichToken s for this object.

raw_tokens

A list of str representations for the tokens of each sentence in this text passage.

pos_tagged_tokens

A list of (Token token, str tag) tuples for each sentence in this text passage.

pos_tags

A list of str part of speech tags for each sentence in this text passage.

unprocessed_ner_tagged_tokens

A list of (Token token, str named entity recognition tag) from the text.

No corrections from abbreviation detection are performed.

unprocessed_ner_tags

A list of str unprocessed named entity tags for the tokens in this sentence.

No corrections from abbreviation detection are performed.

ner_tagged_tokens

A list of (Token token, str named entity recognition tag) from the text.

ner_tags

A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific ner_tagger used for this object.

cems

A list of all Chemical Entity Mentions in this text as chemdataextractor.doc.text.span

definitions

Return a list of tagged definitions for each sentence in this text passage

chemical_definitions

Return a list of tagged definitions for each sentence in this text passage

tagged_tokens

A list of lists of RichToken instances found in the text.

Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.

tags

A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific ner_tagger and pos_tagger used for this class.

abbreviation_definitions

A list of all abbreviation definitions in this Document. Each abbreviation is in the form (str abbreviation, str long form of abbreviation, str ner_tag)

records

All records found in the object, as a list of BaseModel.

class chemdataextractor.doc.text.Title(text, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

__init__(text, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

class chemdataextractor.doc.text.Heading(text, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

__init__(text, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

class chemdataextractor.doc.text.Paragraph(text, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

__init__(text, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

class chemdataextractor.doc.text.Footnote(text, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

__init__(text, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

class chemdataextractor.doc.text.Citation(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

taggers = [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.dependency.IndexTagger object>]
abbreviation_detector = None
subsentence_extractor = <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>
class chemdataextractor.doc.text.Caption(text, **kwargs)[source]

Bases: chemdataextractor.doc.text.Text

__init__(text, **kwargs)[source]

Note

If intended as part of a Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default ChemSentenceTokenizer.

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

definitions

Return a list of tagged definitions for each sentence in this text passage

class chemdataextractor.doc.text.Sentence(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]

Bases: chemdataextractor.doc.text.BaseText

A single sentence within a text passage.

word_tokenizer = <chemdataextractor.nlp.tokenize.BertWordTokenizer object>
lexicon = <chemdataextractor.nlp.lexicon.ChemLexicon object>
abbreviation_detector = <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>
taggers = [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]
__init__(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]

Note

If intended as part of a chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a chemdataextractor.doc.document.Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.

  • end (int) – (Optional) The end index of the sentence within the containing element. Defualt None

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

start = None

The start index of this sentence within the text passage.

end = None

The end index of this sentence within the text passage.

specifier_definition = <chemdataextractor.parse.elements.And object>
subsentence_extractor = <chemdataextractor.nlp.subsentence.SubsentenceExtractor object>
tokens

A list of RichToken s for this object.

raw_tokens

A list of str representations for the tokens in the object.

pos_tagged_tokens

A list of (Token token, str tag) tuples for each sentence in this sentence.

pos_tags

A list of str part of speech tags for each sentence in this sentence.

unprocessed_ner_tagged_tokens

A list of (Token token, str named entity recognition tag) from the text.

No corrections from abbreviation detection are performed.

unprocessed_ner_tags

A list of str unprocessed named entity tags for the tokens in this sentence.

No corrections from abbreviation detection are performed.

abbreviation_definitions

A list of all abbreviation definitions in this Document. Each abbreviation is in the form (str abbreviation, str long form of abbreviation, str ner_tag)

ner_tagged_tokens

A list of (Token token, str named entity recognition tag) from the sentence.

ner_tags

A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific ner_tagger used for this object.

cems

A list of all Chemical Entity Mentions in this text as Span

definitions

Return specifier definitions from this sentence

A definition consists of: a) A definition – The quantitity being defined e.g. “Curie Temperature” b) A specifier – The symbol used to define the quantity e.g. “Tc” c) Start – The index of the starting point of the definition d) End – The index of the end point of the definition

Returns:

list – The specifier definitions

chemical_definitions

Return a list of chemical entity mentions and their associated label

tags

A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific ner_tagger and pos_tagger used for this class.

tagged_tokens

A list of RichToken instances found in the text.

Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.

quantity_re
subsentences
full_subsentence
records

All records found in the object, as a list of BaseModel.

class chemdataextractor.doc.text.Subsentence(parent_sentence, tokens, is_full_sentence=False)[source]

Bases: chemdataextractor.doc.text.Sentence

A sub-sentence level logical division of text. Used to store clauses in CDE based on clause extraction as described in the paper Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications (https://www.nature.com/articles/s41597-023-02511-6). An example of subsentences would be “A has quality α” and “A has quality β” from the sentence “A has quality α and quality β”. This enables rule-based and template-based parsing to adapt to a wider range of sentences.

__init__(parent_sentence, tokens, is_full_sentence=False)[source]

Note

If intended as part of a chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a chemdataextractor.doc.document.Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.

  • end (int) – (Optional) The end index of the sentence within the containing element. Defualt None

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

tokens = []
records

All records found in the object, as a list of BaseModel.

class chemdataextractor.doc.text.Cell(*args, **kwargs)[source]

Bases: chemdataextractor.doc.text.Sentence

Data cell for tables. One row of the category table

subsentence_extractor = <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>
__init__(*args, **kwargs)[source]

Note

If intended as part of a chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or its document attribute should be set as soon as possible. If the element is being passed in to a chemdataextractor.doc.document.Document to initialise it, the document attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.

Parameters:
  • text (str) – The text contained in this element.

  • start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.

  • end (int) – (Optional) The end index of the sentence within the containing element. Defualt None

  • word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default ChemWordTokenizer.

  • lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default ChemLexicon

  • abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default ChemAbbreviationDetector.

  • pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default ChemCrfPosTagger.

  • ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default CemTagger

  • document (Document) – (Optional) The document containing this element.

  • label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.

  • id (Any) – (Optional) Some identifier for this element. Must be equatable.

  • models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a Sentence inside a Paragraph), or is part of a Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.

classmethod from_tdecell(tde_cell, document=None, **kwargs)[source]
abbreviation_definitions

Empty list. Abbreviation detection is disabled within table cells.

records

Empty list. Individual cells don’t provide records, this is handled by the parent Table.

elements

A list of child elements. Returns None by default.

class chemdataextractor.doc.text.Span(text, start, end)[source]

Bases: object

A text span within a sentence.

__init__(text, start, end)[source]
Parameters:
  • text (str) – The text contained by this span.

  • start (int) – The start offset of this token in the original text.

  • end (int) – The end offsent of this token in the original text.

text = None

The str text content of this span.

start = None

The int start offset of this token in the original text.

end = None

The int end offset of this token in the original text.

length

The int offset length of this span in the original text.

class chemdataextractor.doc.text.Token(text, start, end, lexicon)[source]

Bases: chemdataextractor.doc.text.Span

A single token within a sentence. Corresponds to a word, character, punctuation etc.

__init__(text, start, end, lexicon)[source]
Parameters:
  • text (str) – The text contained by this token.

  • start (int) – The start offset of this token in the original text.

  • end (int) – The end offset of this token in the original text.

  • lexicon (Lexicon) – The lexicon which contains this token.

lexicon = None

The lexicon for this token.

lex

The corresponding chemdataextractor.nlp.lexicon.Lexeme entry in the Lexicon for this token.

class chemdataextractor.doc.text.RichToken(text, start, end, lexicon, sentence)[source]

Bases: chemdataextractor.doc.text.Token

RichToken provides a flexible way to store properties related to tokens. RichToken instances hold a reference to the parent sentence they come from, and if the user desires a certain tag, the parent sentence is called and its taggers used to tag the sentence on demand. This structure means that tokens are tagged if and only if the user requires them. These tags are then cached by the RichToken so that any single token is only ever tagged once.

Such tags can be accessed either via dot syntax (token.ner_tag) or via dictionary syntax (token[‘ner_tag’]). To maintain compatibility with the return value for tagged_tokens() from previous versions of ChemDataExtractor, the keys of 0 and 1 are reserved for the text of the token and the combined NER and PoS tags, respectively. Furthermore, any properties included in the Token class are reserved as well.

Note

By default, ChemDataExtractor provides, and assumes that calling .ner_tag and .pos_tag on a RichToken will not fail, which should be taken into account when setting the taggers property on any BaseText subclasses.

__init__(text, start, end, lexicon, sentence)[source]
Parameters:
  • text (str) – The text contained by this token.

  • start (int) – The start offset of this token in the original text.

  • end (int) – The end offset of this token in the original text.

  • lexicon (Lexicon) – The lexicon which contains this token.

legacy_pos_tag