.doc¶
Logic for reading/creating documents. That is, splitting documents down into its various elements. The API for documents has been slightly changed as of version 2.0. Please refer to the migration guide and the examples for an overview of the changes.
Document processing.
.doc.document¶
Document model.
-
class
chemdataextractor.doc.document.
BaseDocument
[source]¶ Bases:
collections.abc.Sequence
Abstract base class for a Document.
-
elements
¶ Return a list of document elements.
-
records
¶ Chemical records that have been parsed from this Document.
-
-
class
chemdataextractor.doc.document.
Document
(*elements, **kwargs)[source]¶ Bases:
chemdataextractor.doc.document.BaseDocument
A document to extract data from. Contains a list of document elements.
-
__init__
(*elements, **kwargs)[source]¶ Initialize a Document manually by passing one or more Document elements (Paragraph, Heading, Table, etc.)
Strings that are passed to this constructor are automatically wrapped into Paragraph elements.
- Parameters:
elements (list[chemdataextractor.doc.element.BaseElement|string]) – Elements in this Document.
- Keyword Arguments:
config (Config) – (Optional) Config file for the Document.
models (list[BaseModel]) – (Optional) Models that the Document should extract data for.
list[str])] adjacent_sections_for_merging (list[(list[str],) – (Optional) Sections that will be treated as though they are adjacent for the purpose of contextual merging. All elements should be in lowercase.
subclass] skip_elements (list[chemdataextractor.doc.element.BaseElement) – (Optional) Element types to be skipped in parsing
-
add_models
(models)[source]¶ Add models to all elements.
Usage:
d = Document.from_file(f) d.set_models([myModelClass1, myModelClass2,..])
- Arguments::
models – List of model classes
-
models
¶
-
classmethod
from_file
(f, fname=None, readers=None)[source]¶ Create a Document from a file.
Usage:
with open('paper.html', 'rb') as f: doc = Document.from_file(f)
Note
Always open files in binary mode by using the ‘rb’ parameter.
- Parameters:
f (file or str) – A file-like object or path to a file.
fname (str) – (Optional) The filename. Used to help determine file format.
readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers, which are
AcsHtmlReader
,RscHtmlReader
,NlmXmlReader
,UsptoXmlReader
,CsspHtmlReader
,ElsevierXmlReader
,XmlReader
,HtmlReader
,PdfReader
, andPlainTextReader
.
-
classmethod
from_string
(fstring, fname=None, readers=None)[source]¶ Create a Document from a byte string containing the contents of a file.
Usage:
contents = open('paper.html', 'rb').read() doc = Document.from_string(contents)
Note
This method expects a byte string, not a unicode string (in contrast to most methods in ChemDataExtractor).
- Parameters:
fstring (bytes) – A byte string containing the contents of a file.
fname (str) – (Optional) The filename. Used to help determine file format.
readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers, which are
AcsHtmlReader
,RscHtmlReader
,NlmXmlReader
,UsptoXmlReader
,CsspHtmlReader
,ElsevierXmlReader
,XmlReader
,HtmlReader
,PdfReader
, andPlainTextReader
.
-
elements
¶ A list of all the elements in this document. All elements subclass from
BaseElement
, and represent things such as paragraphs or tables, and can be found inchemdataextractor.doc.figure
,chemdataextractor.doc.table
, andchemdataextractor.doc.text
.
-
get_element_with_id
(id)[source]¶ Get element with the specified ID. If one is not found, None is returned.
- Parameters:
id – Identifier to search for.
- Returns:
Element with specified ID
- Return type:
BaseElement or None
-
footnotes
¶ A list of all
Footnote
elements in this Document.Note
Elements (e.g. Tables) can contain nested Footnotes which are not taken into account.
A list of all
Caption
elements in this Document.
A list of all
CaptionedElement
elements in this Document.
-
metadata
¶ Return metadata information
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of all Named Entity Recognition tags in this Document. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions
¶ Return a list of all recognised definitions within this Document
-
serialize
()[source]¶ Convert Document to Python dictionary. The dictionary will always contain the key ‘type’, which will be ‘document’, and the key ‘elements’, which contains a dictionary representation of each of the elements of the document.
-
to_json
(*args, **kwargs)[source]¶ Convert Document to JSON string. The content of the JSON will be equivalent to that of
serialize()
. The document itself will be under the key ‘elements’, and there will also be the key ‘type’, which will always be ‘document’. Any arguments forjson.dumps()
can be passed into this function.
-
sentences
¶
.doc.element¶
Document elements.
-
class
chemdataextractor.doc.element.
BaseElement
(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Bases:
object
Abstract base class for a Document Element.
- Variables:
-
__init__
(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
document (Document) – (Optional) The document containing this element.
references (list[Citation]) – (Optional) Any references contained in the element.
id (Any) – (Optional) An identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of achemdataextractor.doc.document.Document
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
document
¶ The
chemdataextractor.doc.document.Document
that this element belongs to.
-
records
¶ All records found in this Element, as a
chemdataextractor.model.base.ModelList
ofchemdataextractor.model.base.BaseModel
.
-
models
¶
-
to_json
(*args, **kwargs)[source]¶ Convert element to JSON string. The content of the JSON will be equivalent to that of
serialize()
.
-
elements
¶ A list of child elements. Returns None by default.
-
class
chemdataextractor.doc.element.
CaptionedElement
(caption, label=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
Document Element with a caption.
- Variables:
caption (BaseElement) – The caption for this element.
-
__init__
(caption, label=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
caption (BaseElement) – The caption for the element.
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of all Named Entity Recognition tags in the caption for this element. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions
¶ Return a list of all specifier definitions in the caption
- Returns:
list– The specifier definitions
-
chemical_definitions
¶
-
models
¶
-
serialize
()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption
, which is aBaseElement
-
elements
¶ A list of child elements. Returns None by default.
.doc.figure¶
Figure document elements. :codeauthor:: Callum Court (cc889@cam.ac.uk)
-
class
chemdataextractor.doc.figure.
Figure
(caption, label=None, links=None, models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElement
-
__init__
(caption, label=None, links=None, models=None, **kwargs)[source]¶ Create a new Figure element, to interface with FDE
-
records
¶ Return FigureData records
- Returns:
[type] – [description]
-
.doc.meta¶
MetaData Document elements
-
class
chemdataextractor.doc.meta.
MetaData
(data)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
-
__init__
(data)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
document (Document) – (Optional) The document containing this element.
references (list[Citation]) – (Optional) Any references contained in the element.
id (Any) – (Optional) An identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of achemdataextractor.doc.document.Document
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
records
¶ All records found in this Element, as a
chemdataextractor.model.base.ModelList
ofchemdataextractor.model.base.BaseModel
.
-
title
¶ The article title
The article Authors type:: list()
-
publisher
¶ The source publisher
-
journal
¶ The source journal
-
volume
¶ The source volume
-
issue
¶ The source issue
-
firstpage
¶ The source first page title
-
lastpage
¶ The source last page
-
doi
¶ The source DOI
-
pdf_url
¶ The source url to the PDF version
-
html_url
¶ The source url to the HTML version
-
date
¶ The source publish date
-
data
¶ Returns all data as a dict()
-
abbreviation_definitions
¶
-
definitions
¶
-
chemical_definitions
¶
-
cems
¶
-
is_unidentified
¶
.doc.table¶
Table document elements
-
class
chemdataextractor.doc.table.
Table
(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElement
Main Table object. Relies on TableDataExtractor.
-
__init__
(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ In addition to the parameters below, any keyword arguments supported by TableDataExtractor.TdeTable can be passed in as keyword arguments and they will be passed on to TableDataExtractor.TdeTable.
Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
caption (BaseElement) – The caption for the element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
table_data (list) – (Optional) Table data to be passed on to TableDataExtractor to be parsed. Refer to documentation for TableDataExtractor.TdeTable for more information on how this should be structured.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.document (Document) – (Optional) The document containing this element.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
-
tde_table
= None¶ TableDataExtractor TrivialTable object. Can pass any kwargs into TDE directly.
-
cde_tables
¶ CDE tables are lists of lists of Cells, that are used for the purpose of parsing in CDE. For other purposes, the underlying TDE table (tde_table) is probably more useful.
-
serialize
()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption
, which is aBaseElement
-
definitions
¶ Return a list of all specifier definitions in the caption
- Returns:
list– The specifier definitions
-
elements
¶ A list of child elements. Returns None by default.
.doc.text¶
Text-based document elements.
-
class
chemdataextractor.doc.text.
BaseText
(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
Abstract base class for a text Document Element.
-
__init__
(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, taggers=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words.
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
taggers
= []¶ A list of
BaseTagger
instances. This is a list of taggers that will be called by ChemDataExtractor to assign tags to each of the tokens in this element.
-
word_tokenizer
¶ The
WordTokenizer
used by this element.
-
pos_tagger
¶ The part of speech tagger used by this element. A subclass of
BaseTagger
-
ner_tagger
¶ The named entity recognition tagger used by this element. A subclass of
BaseTagger
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
definitions
¶ A list of all specifier definitions
-
chemical_definitions
¶ A list of all chemical label definitiond
-
serialize
()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘content’ will contain a serialized representation of
text
, which is astr
-
class
chemdataextractor.doc.text.
Text
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
collections.abc.Sequence
,chemdataextractor.doc.text.BaseText
A passage of text, comprising one or more sentences.
-
word_tokenizer
= <chemdataextractor.nlp.tokenize.BertWordTokenizer object>¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector
= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
taggers
= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]¶
-
subsentence_extractor
= None¶
-
__init__
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
sentence_tokenizer
= <chemdataextractor.nlp.tokenize.ChemSentenceTokenizer object>¶
-
set_config
()[source]¶ Load settings from configuration file
Note
Called when Document instance is created
-
elements
¶ A list of child elements. Returns None by default.
A list of
str
part of speech tags for each sentence in this text passage.
-
unprocessed_ner_tagged_tokens
¶ A list of (
Token
token,str
named entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
str
unprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
used for this object.
-
cems
¶ A list of all Chemical Entity Mentions in this text as
chemdataextractor.doc.text.span
-
definitions
¶ Return a list of tagged definitions for each sentence in this text passage
-
chemical_definitions
¶ Return a list of tagged definitions for each sentence in this text passage
-
tagged_tokens
¶ A list of lists of
RichToken
instances found in the text.Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
-
class
chemdataextractor.doc.text.
Title
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.
Heading
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.
Paragraph
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.
Footnote
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
class
chemdataextractor.doc.text.
Citation
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
taggers
= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.tag.NoneTagger object>, <chemdataextractor.nlp.dependency.IndexTagger object>]¶
-
abbreviation_detector
= None¶
-
subsentence_extractor
= <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>¶
-
-
class
chemdataextractor.doc.text.
Caption
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element. Default
ChemSentenceTokenizer
.word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
definitions
¶ Return a list of tagged definitions for each sentence in this text passage
-
class
chemdataextractor.doc.text.
Sentence
(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.BaseText
A single sentence within a text passage.
-
word_tokenizer
= <chemdataextractor.nlp.tokenize.BertWordTokenizer object>¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector
= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
taggers
= [<chemdataextractor.nlp.pos.ChemCrfPosTagger object>, <chemdataextractor.nlp.new_cem.CemTagger object>, <chemdataextractor.nlp.dependency.DependencyTagger object>]¶
-
__init__
(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, specifier_definition=None, subsentence_extractor=None, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
start
= None¶ The start index of this sentence within the text passage.
-
end
= None¶ The end index of this sentence within the text passage.
-
specifier_definition
= <chemdataextractor.parse.elements.And object>¶
-
subsentence_extractor
= <chemdataextractor.nlp.subsentence.SubsentenceExtractor object>¶
A list of
str
part of speech tags for each sentence in this sentence.
-
unprocessed_ner_tagged_tokens
¶ A list of (
Token
token,str
named entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
str
unprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
used for this object.
-
definitions
¶ Return specifier definitions from this sentence
A definition consists of: a) A definition – The quantitity being defined e.g. “Curie Temperature” b) A specifier – The symbol used to define the quantity e.g. “Tc” c) Start – The index of the starting point of the definition d) End – The index of the end point of the definition
- Returns:
list – The specifier definitions
-
chemical_definitions
¶ Return a list of chemical entity mentions and their associated label
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
tagged_tokens
¶ A list of
RichToken
instances found in the text.Deprecated since version 2.1: Deprecated due to the introduction of RichTokens, and is now just an alias for .tokens.
-
quantity_re
¶
-
subsentences
¶
-
full_subsentence
¶
-
class
chemdataextractor.doc.text.
Subsentence
(parent_sentence, tokens, is_full_sentence=False)[source]¶ Bases:
chemdataextractor.doc.text.Sentence
A sub-sentence level logical division of text. Used to store clauses in CDE based on clause extraction as described in the paper Automated Construction of a Photocatalysis Dataset for Water-Splitting Applications (https://www.nature.com/articles/s41597-023-02511-6). An example of subsentences would be “A has quality α” and “A has quality β” from the sentence “A has quality α and quality β”. This enables rule-based and template-based parsing to adapt to a wider range of sentences.
-
__init__
(parent_sentence, tokens, is_full_sentence=False)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
tokens
= []¶
-
class
chemdataextractor.doc.text.
Cell
(*args, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Sentence
Data cell for tables. One row of the category table
-
subsentence_extractor
= <chemdataextractor.nlp.subsentence.NoneSubsentenceExtractor object>¶
-
__init__
(*args, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.- Parameters:
text (str) – The text contained in this element.
start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element. Default
ChemWordTokenizer
.lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words. Default
ChemLexicon
abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element. Default
ChemAbbreviationDetector
.pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element. Default
ChemCrfPosTagger
.ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element. Default
CemTagger
document (Document) – (Optional) The document containing this element.
label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
id (Any) – (Optional) Some identifier for this element. Must be equatable.
models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse. If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions
¶ Empty list. Abbreviation detection is disabled within table cells.
-
records
¶ Empty list. Individual cells don’t provide records, this is handled by the parent Table.
-
elements
¶ A list of child elements. Returns None by default.
-
class
chemdataextractor.doc.text.
Span
(text, start, end)[source]¶ Bases:
object
A text span within a sentence.
-
class
chemdataextractor.doc.text.
Token
(text, start, end, lexicon)[source]¶ Bases:
chemdataextractor.doc.text.Span
A single token within a sentence. Corresponds to a word, character, punctuation etc.
-
lexicon
= None¶ The lexicon for this token.
-
lex
¶ The corresponding
chemdataextractor.nlp.lexicon.Lexeme
entry in the Lexicon for this token.
-
-
class
chemdataextractor.doc.text.
RichToken
(text, start, end, lexicon, sentence)[source]¶ Bases:
chemdataextractor.doc.text.Token
RichToken
provides a flexible way to store properties related to tokens.RichToken
instances hold a reference to the parent sentence they come from, and if the user desires a certain tag, the parent sentence is called and its taggers used to tag the sentence on demand. This structure means that tokens are tagged if and only if the user requires them. These tags are then cached by theRichToken
so that any single token is only ever tagged once.Such tags can be accessed either via dot syntax (
token.ner_tag
) or via dictionary syntax (token[‘ner_tag’]
). To maintain compatibility with the return value fortagged_tokens()
from previous versions of ChemDataExtractor, the keys of0
and1
are reserved for the text of the token and the combined NER and PoS tags, respectively. Furthermore, any properties included in theToken
class are reserved as well.
-
-
-
-
-
-
-
-
-
-
-
-
-
-