.doc¶
Logic for reading/creating documents. That is, splitting documents down into its various elements. The API for documents has been slightly changed as of version 2.0. Please refer to the migration guide and the examples for an overview of the changes.
Document processing.
.doc.document¶
Document model.
-
class
chemdataextractor.doc.document.BaseDocument[source]¶ Bases:
collections.abc.SequenceAbstract base class for a Document.
-
elements¶ Return a list of document elements.
-
records¶ Chemical records that have been parsed from this Document.
-
-
class
chemdataextractor.doc.document.Document(*elements, **kwargs)[source]¶ Bases:
chemdataextractor.doc.document.BaseDocumentA document to extract data from. Contains a list of document elements.
-
__init__(*elements, **kwargs)[source]¶ Initialize a Document manually by passing one or more Document elements (Paragraph, Heading, Table, etc.)
Strings that are passed to this constructor are automatically wrapped into Paragraph elements.
- Parameters:
elements (list[chemdataextractor.doc.element.BaseElement|string]) – Elements in this Document.
- Keyword Arguments:
config (Config) – (Optional) Config file for the Document.
models (list[BaseModel]) – (Optional) Models that the Document should extract data for.
list[str])] adjacent_sections_for_merging (list[(list[str],) – (Optional) Sections that will be treated as though they are adjacent for the purpose of contextual merging. All elements should be in lowercase.
subclass] skip_elements (list[chemdataextractor.doc.element.BaseElement) – (Optional) Element types to be skipped in parsing
-
add_models(models)[source]¶ Add models to all elements.
Usage:
d = Document.from_file(f) d.set_models([myModelClass1, myModelClass2,..])
- Arguments::
models – List of model classes
-
models¶
-
classmethod
from_file(f, fname=None, readers=None)[source]¶ Create a Document from a file.
Usage:
with open('paper.html', 'rb') as f: doc = Document.from_file(f)
Note
Always open files in binary mode by using the ‘rb’ parameter.
- Parameters:
f (file or str) – A file-like object or path to a file.
fname (str) – (Optional) The filename. Used to help determine file format.
readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers, which are
AcsHtmlReader,RscHtmlReader,NlmXmlReader,UsptoXmlReader,CsspHtmlReader,ElsevierXmlReader,XmlReader,HtmlReader,PdfReader, andPlainTextReader.
-
classmethod
from_string(fstring, fname=None, readers=None)[source]¶ Create a Document from a byte string containing the contents of a file.
Usage:
contents = open('paper.html', 'rb').read() doc = Document.from_string(contents)
Note
This method expects a byte string, not a unicode string (in contrast to most methods in ChemDataExtractor).
- Parameters:
fstring (bytes) – A byte string containing the contents of a file.
fname (str) – (Optional) The filename. Used to help determine file format.
readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers, which are
AcsHtmlReader,RscHtmlReader,NlmXmlReader,UsptoXmlReader,CsspHtmlReader,ElsevierXmlReader,XmlReader,HtmlReader,PdfReader, andPlainTextReader.
-
elements¶ A list of all the elements in this document. All elements subclass from
BaseElement, and represent things such as paragraphs or tables, and can be found inchemdataextractor.doc.figure,chemdataextractor.doc.table, andchemdataextractor.doc.text.
-
get_element_with_id(id)[source]¶ Get element with the specified ID. If one is not found, None is returned.
- Parameters:
id – Identifier to search for.
- Returns:
Element with specified ID
- Return type:
BaseElement or None
-
footnotes¶ A list of all
Footnoteelements in this Document.Note
Elements (e.g. Tables) can contain nested Footnotes which are not taken into account.
-
captioned_elements¶ A list of all
CaptionedElementelements in this Document.
-
metadata¶ Return metadata information
-
abbreviation_definitions¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
strabbreviation,strlong form of abbreviation,strner_tag)
A list of all Named Entity Recognition tags in this Document. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions¶ Return a list of all recognised definitions within this Document
-
serialize()[source]¶ Convert Document to Python dictionary. The dictionary will always contain the key ‘type’, which will be ‘document’, and the key ‘elements’, which contains a dictionary representation of each of the elements of the document.
-
to_json(*args, **kwargs)[source]¶ Convert Document to JSON string. The content of the JSON will be equivalent to that of
serialize(). The document itself will be under the key ‘elements’, and there will also be the key ‘type’, which will always be ‘document’. Any arguments forjson.dumps()can be passed into this function.
-
sentences¶
.doc.element¶
Document elements.
-
class
chemdataextractor.doc.element.BaseElement(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Bases:
objectAbstract base class for a Document Element.
- Variables:
-
__init__(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
document (Document) – (Optional) The document containing this element.
references (list[Citation]) – (Optional) Any references contained in the element.
id (Any) – (Optional) An identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of achemdataextractor.doc.document.Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
document¶ The
chemdataextractor.doc.document.Documentthat this element belongs to.
-
records¶ All records found in this Element, as a
chemdataextractor.model.base.ModelListofchemdataextractor.model.base.BaseModel.
-
models¶
-
to_json(*args, **kwargs)[source]¶ Convert element to JSON string. The content of the JSON will be equivalent to that of
serialize().
-
elements¶ A list of child elements. Returns None by default.
-
class
chemdataextractor.doc.element.CaptionedElement(caption, label=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElementDocument Element with a caption.
- Variables:
caption (BaseElement) – The caption for this element.
-
__init__(caption, label=None, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
caption (BaseElement) – The caption for the element.
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
strabbreviation,strlong form of abbreviation,strner_tag)
A list of all Named Entity Recognition tags in the caption for this element. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions¶ Return a list of all specifier definitions in the caption
- Returns:
list– The specifier definitions
-
chemical_definitions¶
-
models¶
-
serialize()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption, which is aBaseElement
-
elements¶ A list of child elements. Returns None by default.
.doc.figure¶
Figure document elements. :codeauthor:: Callum Court (cc889@cam.ac.uk)
-
class
chemdataextractor.doc.figure.Figure(caption, label=None, links=None, models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElement-
__init__(caption, label=None, links=None, models=None, **kwargs)[source]¶ Create a new Figure element, to interface with FDE
-
records¶ Return FigureData records
- Returns:
[type] – [description]
-
.doc.meta¶
MetaData Document elements
-
class
chemdataextractor.doc.meta.MetaData(data)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement-
__init__(data)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
document (Document) – (Optional) The document containing this element.
references (list[Citation]) – (Optional) Any references contained in the element.
id (Any) – (Optional) An identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of achemdataextractor.doc.document.Document, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
records¶ All records found in this Element, as a
chemdataextractor.model.base.ModelListofchemdataextractor.model.base.BaseModel.
-
title¶ The article title
The article Authors type:: list()
-
publisher¶ The source publisher
-
journal¶ The source journal
-
volume¶ The source volume
-
issue¶ The source issue
-
firstpage¶ The source first page title
-
lastpage¶ The source last page
-
doi¶ The source DOI
-
pdf_url¶ The source url to the PDF version
-
html_url¶ The source url to the HTML version
-
date¶ The source publish date
-
data¶ Returns all data as a dict()
-
abbreviation_definitions¶
-
definitions¶
-
chemical_definitions¶
-
cems¶
-
is_unidentified¶
.doc.table¶
Table document elements
-
class
chemdataextractor.doc.table.Table(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElementMain Table object. Relies on TableDataExtractor.
-
__init__(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ In addition to the parameters below, any keyword arguments supported by TableDataExtractor.TdeTable can be passed in as keyword arguments and they will be passed on to TableDataExtractor.TdeTable.
Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Documentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
caption (BaseElement) – The caption for the element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
table_data (list) – (Optional) Table data to be passed on to TableDataExtractor to be parsed. Refer to documentation for TableDataExtractor.TdeTable for more information on how this should be structured.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.document (Document) – (Optional) The document containing this element.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
-
tde_table= None¶ TableDataExtractor TrivialTable object. Can pass any kwargs into TDE directly.
-
cde_tables¶ CDE tables are lists of lists of Cells, that are used for the purpose of parsing in CDE. For other purposes, the underlying TDE table (tde_table) is probably more useful.
-
serialize()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption, which is aBaseElement
-
definitions¶ Return a list of all specifier definitions in the caption
- Returns:
list– The specifier definitions
-
elements¶ A list of child elements. Returns None by default.
.doc.text¶
Text-based document elements.
-
class
chemdataextractor.doc.text.BaseText(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElementAbstract base class for a text Document Element.
-
__init__(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words.
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
taggers= []¶ A list of
BaseTaggerinstances. This is a list of taggers that will be called by ChemDataExtractor to assign tags to each of the tokens in this element.
-
word_tokenizer¶ The
WordTokenizerused by this element.
-
pos_tagger¶ The part of speech tagger used by this element. A subclass of
BaseTagger
-
ner_tagger¶ The named entity recognition tagger used by this element. A subclass of
BaseTagger
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_taggerandpos_taggerused for this class.
-
definitions¶ A list of all specifier definitions
-
chemical_definitions¶ A list of all chemical label definitiond
-
serialize()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘content’ will contain a serialized representation of
text, which is astr
-
class
chemdataextractor.doc.text.Text(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
collections.abc.Sequence,chemdataextractor.doc.text.BaseTextA passage of text, comprising one or more sentences.
-
word_tokenizer= <chemdataextractor.nlp.tokenize.BertWordTokenizer object>¶
-
lexicon= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
taggers= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]¶
-
subsentence_extractor= None¶
-
__init__(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
sentence_tokenizer= <chemdataextractor.nlp.tokenize.ChemSentenceTokenizer object>¶
-
set_config()[source]¶ Load settings from configuration file
Note
Called when Document instance is created
-
elements¶ A list of child elements. Returns None by default.
A list of
strpart of speech tags for each sentence in this text passage.
-
unprocessed_ner_tagged_tokens¶ A list of (
Tokentoken,strnamed entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
strunprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_taggerused for this object.
-
cems¶ A list of all Chemical Entity Mentions in this text as
chemdataextractor.doc.text.span
-
definitions¶ Return a list of tagged definitions for each sentence in this text passage
-
chemical_definitions¶ Return a list of tagged definitions for each sentence in this text passage
-
tagged_tokens¶ A list of lists of
RichTokeninstances found in the text.Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_taggerandpos_taggerused for this class.
-
abbreviation_definitions¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
strabbreviation,strlong form of abbreviation,strner_tag)
-
class
chemdataextractor.doc.text.Title(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
__init__(text, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.Heading(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
__init__(text, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.Paragraph(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
__init__(text, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.Footnote(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
__init__(text, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.Citation(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
taggers= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.dependency.IndexTagger object>]¶
-
abbreviation_detector= None¶
-
subsentence_extractor= <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>¶
-
-
class
chemdataextractor.doc.text.Caption(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text-
__init__(text, **kwargs)[source]¶ Note
If intended as part of a
Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to aDocumentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
definitions¶ Return a list of tagged definitions for each sentence in this text passage
-
class
chemdataextractor.doc.text.Sentence(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.BaseTextA single sentence within a text passage.
-
word_tokenizer= <chemdataextractor.nlp.tokenize.BertWordTokenizer object>¶
-
lexicon= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
taggers= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]¶
-
__init__(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Documentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
start= None¶ The start index of this sentence within the text passage.
-
end= None¶ The end index of this sentence within the text passage.
-
specifier_definition= <chemdataextractor.parse.elements.And object>¶
-
subsentence_extractor= <chemdataextractor.nlp.subsentence.SubsentenceExtractor object>¶
A list of
strpart of speech tags for each sentence in this sentence.
-
unprocessed_ner_tagged_tokens¶ A list of (
Tokentoken,strnamed entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
strunprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
-
abbreviation_definitions¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
strabbreviation,strlong form of abbreviation,strner_tag)
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_taggerused for this object.
-
definitions¶ Return specifier definitions from this sentence
A definition consists of: a) A definition – The quantitity being defined e.g. “Curie Temperature” b) A specifier – The symbol used to define the quantity e.g. “Tc” c) Start – The index of the starting point of the definition d) End – The index of the end point of the definition
- Returns:
list – The specifier definitions
-
chemical_definitions¶ Return a list of chemical entity mentions and their associated label
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_taggerandpos_taggerused for this class.
-
tagged_tokens¶ A list of
RichTokeninstances found in the text.Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.
-
quantity_re¶
-
subsentences¶
-
full_subsentence¶
-
class
chemdataextractor.doc.text.Subsentence(parent_sentence, tokens, is_full_sentence=False)[source]¶ Bases:
chemdataextractor.doc.text.SentenceA sub-sentence level logical division of text. Used to store clauses in CDE based on clause extraction as described in the paper Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications (https://www.nature.com/articles/s41597-023-02511-6). An example of subsentences would be “A has quality α” and “A has quality β” from the sentence “A has quality α and quality β”. This enables rule-based and template-based parsing to adapt to a wider range of sentences.
-
__init__(parent_sentence, tokens, is_full_sentence=False)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Documentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
tokens= []¶
-
class
chemdataextractor.doc.text.Cell(*args, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.SentenceData cell for tables. One row of the category table
-
subsentence_extractor= <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>¶
-
__init__(*args, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document, an element should either be initialized with a reference to its containing document, or itsdocumentattribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Documentto initialise it, thedocumentattribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexiconabbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTaggerdocument (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentenceinside aParagraph), or is part of aDocument, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions¶ Empty list. Abbreviation detection is disabled within table cells.
-
records¶ Empty list. Individual cells don’t provide records, this is handled by the parent Table.
-
elements¶ A list of child elements. Returns None by default.
-
class
chemdataextractor.doc.text.Span(text, start, end)[source]¶ Bases:
objectA text span within a sentence.
-
class
chemdataextractor.doc.text.Token(text, start, end, lexicon)[source]¶ Bases:
chemdataextractor.doc.text.SpanA single token within a sentence. Corresponds to a word, character, punctuation etc.
-
lexicon= None¶ The lexicon for this token.
-
lex¶ The corresponding
chemdataextractor.nlp.lexicon.Lexemeentry in the Lexicon for this token.
-
-
class
chemdataextractor.doc.text.RichToken(text, start, end, lexicon, sentence)[source]¶ Bases:
chemdataextractor.doc.text.TokenRichTokenprovides a flexible way to store properties related to tokens.RichTokeninstances hold a reference to the parent sentence they come from, and if the user desires a certain tag, the parent sentence is called and its taggers used to tag the sentence on demand. This structure means that tokens are tagged if and only if the user requires them. These tags are then cached by theRichTokenso that any single token is only ever tagged once.Such tags can be accessed either via dot syntax (
token.ner_tag) or via dictionary syntax (token[‘ner_tag’]). To maintain compatibility with the return value fortagged_tokens()from previous versions of ChemDataExtractor, the keys of0and1are reserved for the text of the token and the combined NER and PoS tags, respectively. Furthermore, any properties included in theTokenclass are reserved as well.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-