.doc¶
Logic for reading/creating documents. That is, splitting documents down into its various elements. The API for documents has been slightly changed as of version 1.5.0. Please refer to the migration guide and the examples for an overview of the changes.
Document processing.
.doc.document¶
Document model.
-
class
chemdataextractor.doc.document.
BaseDocument
[source]¶ Bases:
collections.abc.Sequence
Abstract base class for a Document.
-
elements
¶ Return a list of document elements.
-
records
¶ Chemical records that have been parsed from this Document.
-
-
class
chemdataextractor.doc.document.
Document
(*elements, **kwargs)[source]¶ Bases:
chemdataextractor.doc.document.BaseDocument
A document to extract data from. Contains a list of document elements.
-
__init__
(*elements, **kwargs)[source]¶ Initialize a Document manually by passing one or more Document elements (Paragraph, Heading, Table, etc.)
Strings that are passed to this constructor are automatically wrapped into Paragraph elements.
Parameters: elements (list[chemdataextractor.doc.element.BaseElement|string]) – Elements in this Document.
Keyword Arguments:
-
add_models
(models)[source]¶ Add models to all elements.
Usage:
d = Document.from_file(f) d.set_models([myModelClass1, myModelClass2,..])
- Arguments::
- models – List of model classes
-
models
¶
-
classmethod
from_file
(f, fname=None, readers=None)[source]¶ Create a Document from a file.
Usage:
with open('paper.html', 'rb') as f: doc = Document.from_file(f)
Note
Always open files in binary mode by using the ‘rb’ parameter.
Parameters: - f (file or str) – A file-like object or path to a file.
- fname (str) – (Optional) The filename. Used to help determine file format.
- readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers,
which are
AcsHtmlReader
,RscHtmlReader
,NlmXmlReader
,UsptoXmlReader
,CsspHtmlReader
,ElsevierXmlReader
,XmlReader
,HtmlReader
,PdfReader
, andPlainTextReader
.
-
classmethod
from_string
(fstring, fname=None, readers=None)[source]¶ Create a Document from a byte string containing the contents of a file.
Usage:
contents = open('paper.html', 'rb').read() doc = Document.from_string(contents)
Note
This method expects a byte string, not a unicode string (in contrast to most methods in ChemDataExtractor).
Parameters: - fstring (bytes) – A byte string containing the contents of a file.
- fname (str) – (Optional) The filename. Used to help determine file format.
- readers (list[chemdataextractor.reader.base.BaseReader]) – (Optional) List of readers to use. If not set, Document will try all default readers,
which are
AcsHtmlReader
,RscHtmlReader
,NlmXmlReader
,UsptoXmlReader
,CsspHtmlReader
,ElsevierXmlReader
,XmlReader
,HtmlReader
,PdfReader
, andPlainTextReader
.
-
elements
¶ A list of all the elements in this document. All elements subclass from
BaseElement
, and represent things such as paragraphs or tables, and can be found inchemdataextractor.doc.figure
,chemdataextractor.doc.table
, andchemdataextractor.doc.text
.
-
get_element_with_id
(id)[source]¶ Get element with the specified ID. If one is not found, None is returned.
Parameters: id – Identifier to search for. Returns: Element with specified ID Return type: BaseElement or None
-
footnotes
¶ A list of all
Footnote
elements in this Document.Note
Elements (e.g. Tables) can contain nested Footnotes which are not taken into account.
A list of all
Caption
elements in this Document.
A list of all
CaptionedElement
elements in this Document.
-
metadata
¶ Return metadata information
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of all Named Entity Recognition tags in this Document. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions
¶ Return a list of all recognised definitions within this Document
-
serialize
()[source]¶ Convert Document to Python dictionary. The dictionary will always contain the key ‘type’, which will be ‘document’, and the key ‘elements’, which contains a dictionary representation of each of the elements of the document.
-
to_json
(*args, **kwargs)[source]¶ Convert Document to JSON string. The content of the JSON will be equivalent to that of
serialize()
. The document itself will be under the key ‘elements’, and there will also be the key ‘type’, which will always be ‘document’. Any arguments forjson.dumps()
can be passed into this function.
-
.doc.element¶
Document elements.
-
class
chemdataextractor.doc.element.
BaseElement
(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Bases:
object
Abstract base class for a Document Element.
Variables: -
__init__
(document=None, references=None, id=None, models=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - document (Document) – (Optional) The document containing this element.
- references (list[Citation]) – (Optional) Any references contained in the element.
- id (Any) – (Optional) An identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of achemdataextractor.doc.document.Document
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
document
¶ The
chemdataextractor.doc.document.Document
that this element belongs to.
-
records
¶ All records found in this Document, as a list of
chemdataextractor.model.base.BaseModel
.
-
models
¶
-
-
class
chemdataextractor.doc.element.
CaptionedElement
(caption, label=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
Document Element with a caption.
Variables: caption (BaseElement) – The caption for this element. -
__init__
(caption, label=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - caption (BaseElement) – The caption for the element.
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of all Named Entity Recognition tags in the caption for this element. If a word was found not to be a named entity, the named entity tag is None, and if it was found to be a named entity, it can have either a tag of ‘B-CM’ for a beginning of a mention of a chemical or ‘I-CM’ for the continuation of a mention.
-
definitions
¶ Return a list of all specifier definitions in the caption
Returns: list– The specifier definitions
-
chemical_definitions
¶
-
models
¶
-
serialize
()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption
, which is aBaseElement
-
.doc.figure¶
Figure document elements. :codeauthor:: Callum Court (cc889@cam.ac.uk)
-
class
chemdataextractor.doc.figure.
Figure
(caption, label=None, links=None, models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElement
-
__init__
(caption, label=None, links=None, models=None, **kwargs)[source]¶ Create a new Figure element, to interface with FDE
-
records
¶ Return FigureData records
Returns: [type] – [description]
-
.doc.meta¶
MetaData Document elements
-
class
chemdataextractor.doc.meta.
MetaData
(data)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
-
__init__
(data)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - document (Document) – (Optional) The document containing this element.
- references (list[Citation]) – (Optional) Any references contained in the element.
- id (Any) – (Optional) An identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of achemdataextractor.doc.document.Document
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
records
¶ All records found in this Document, as a list of
chemdataextractor.model.base.BaseModel
.
-
title
¶ The article title
The article Authors type:: list()
-
publisher
¶ The source publisher
-
journal
¶ The source journal
-
volume
¶ The source volume
-
issue
¶ The source issue
-
firstpage
¶ The source first page title
-
lastpage
¶ The source last page
-
doi
¶ The source DOI
-
pdf_url
¶ The source url to the PDF version
-
html_url
¶ The source url to the HTML version
-
date
¶ The source publish date
-
data
¶ Returns all data as a dict()
-
abbreviation_definitions
¶
-
definitions
¶
-
chemical_definitions
¶
-
cems
¶
-
is_unidentified
¶
-
.doc.table¶
Table document elements
-
class
chemdataextractor.doc.table.
Table
(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.CaptionedElement
Main Table object. Relies on TableDataExtractor.
-
__init__
(caption, label=None, table_data=[], models=None, **kwargs)[source]¶ In addition to the parameters below, any keyword arguments supported by TableDataExtractor.TdeTable can be passed in as keyword arguments and they will be passed on to TableDataExtractor.TdeTable.
Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - caption (BaseElement) – The caption for the element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- table_data (list) – (Optional) Table data to be passed on to TableDataExtractor to be parsed. Refer to documentation for TableDataExtractor.TdeTable for more information on how this should be structured.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise. - document (Document) – (Optional) The document containing this element.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
-
tde_table
= None¶ TableDataExtractor TrivialTable object. Can pass any kwargs into TDE directly.
-
serialize
()[source]¶ Convert self to a dictionary. The key ‘type’ will contain the name of the class being serialized, and the key ‘caption’ will contain a serialized representation of
caption
, which is aBaseElement
-
definitions
¶ Return a list of all specifier definitions in the caption
Returns: list– The specifier definitions
-
.doc.text¶
Text-based document elements.
-
class
chemdataextractor.doc.text.
BaseText
(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.element.BaseElement
Abstract base class for a text Document Element.
-
__init__
(text, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
- lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide Brown clusters for the words.
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
- pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
- ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
word_tokenizer
¶ The
WordTokenizer
used by this element.
-
pos_tagger
¶ The part of speech tagger used by this element. A subclass of
BaseTagger
-
ner_tagger
¶ The named entity recognition tagger used by this element. A subclass of
BaseTagger
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
definitions
¶ A list of all specifier definitions
-
chemical_definitions
¶ A list of all chemical label definitiond
-
-
class
chemdataextractor.doc.text.
Text
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
collections.abc.Sequence
,chemdataextractor.doc.text.BaseText
A passage of text, comprising one or more sentences.
-
word_tokenizer
= <chemdataextractor.nlp.tokenize.ChemWordTokenizer object>¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector
= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
pos_tagger
= <chemdataextractor.nlp.pos.ChemCrfPosTagger object>¶
-
ner_tagger
= <chemdataextractor.nlp.cem.CemTagger object>¶
-
__init__
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
sentence_tokenizer
= <chemdataextractor.nlp.tokenize.ChemSentenceTokenizer object>¶
-
set_config
()[source]¶ Load settings from configuration file
Note
Called when Document instance is created
A list of
str
part of speech tags for each sentence in this text passage.
-
unprocessed_ner_tagged_tokens
¶ A list of (
Token
token,str
named entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
str
unprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
used for this object.
-
cems
¶ A list of all Chemical Entity Mentions in this text as
chemdataextractor.doc.text.span
-
definitions
¶ Return a list of tagged definitions for each sentence in this text passage
-
chemical_definitions
¶ Return a list of tagged definitions for each sentence in this text passage
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
-
class
chemdataextractor.doc.text.
Title
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
-
class
chemdataextractor.doc.text.
Heading
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
-
class
chemdataextractor.doc.text.
Paragraph
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
-
class
chemdataextractor.doc.text.
Footnote
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
-
class
chemdataextractor.doc.text.
Citation
(text, sentence_tokenizer=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, parsers=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
ner_tagger
= <chemdataextractor.nlp.tag.NoneTagger object>¶ No tagging is done for citations
-
abbreviation_detector
= None¶
-
-
class
chemdataextractor.doc.text.
Caption
(text, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Text
-
__init__
(text, **kwargs)[source]¶ Note
If intended as part of a
Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to aDocument
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- sentence_tokenizer (SentenceTokenizer) – (Optional) Sentence tokenizer for this element.
Default
ChemSentenceTokenizer
. - word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
definitions
¶ Return a list of tagged definitions for each sentence in this text passage
-
-
class
chemdataextractor.doc.text.
Sentence
(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.BaseText
A single sentence within a text passage.
-
word_tokenizer
= <chemdataextractor.nlp.tokenize.ChemWordTokenizer object>¶
-
lexicon
= <chemdataextractor.nlp.lexicon.ChemLexicon object>¶
-
abbreviation_detector
= <chemdataextractor.nlp.abbrev.ChemAbbreviationDetector object>¶
-
pos_tagger
= <chemdataextractor.nlp.pos.ChemCrfPosTagger object>¶
-
ner_tagger
= <chemdataextractor.nlp.cem.CemTagger object>¶
-
__init__
(text, start=0, end=None, word_tokenizer=None, lexicon=None, abbreviation_detector=None, pos_tagger=None, ner_tagger=None, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
- end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
- word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
start
= None¶ The start index of this sentence within the text passage.
-
end
= None¶ The end index of this sentence within the text passage.
A list of
str
part of speech tags for each sentence in this sentence.
-
unprocessed_ner_tagged_tokens
¶ A list of (
Token
token,str
named entity recognition tag) from the text.No corrections from abbreviation detection are performed.
A list of
str
unprocessed named entity tags for the tokens in this sentence.No corrections from abbreviation detection are performed.
-
abbreviation_definitions
¶ A list of all abbreviation definitions in this Document. Each abbreviation is in the form (
str
abbreviation,str
long form of abbreviation,str
ner_tag)
A list of named entity tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
used for this object.
-
definitions
¶ Return specifier definitions from this sentence
A definition consists of: a) A definition – The quantitity being defined e.g. “Curie Temperature” b) A specifier – The symbol used to define the quantity e.g. “Tc” c) Start – The index of the starting point of the definition d) End – The index of the end point of the definition
Returns: list – The specifier definitions
-
chemical_definitions
¶ Return a list of chemical entity mentions and their associated label
A list of tags corresponding to each of the tokens in the object. For information on what each of the tags can be, check the documentation on the specific
ner_tagger
andpos_tagger
used for this class.
-
quantity_re
¶
-
-
class
chemdataextractor.doc.text.
Cell
(*args, **kwargs)[source]¶ Bases:
chemdataextractor.doc.text.Sentence
Data cell for tables. One row of the category table
-
__init__
(*args, **kwargs)[source]¶ Note
If intended as part of a
chemdataextractor.doc.document.Document
, an element should either be initialized with a reference to its containing document, or itsdocument
attribute should be set as soon as possible. If the element is being passed in to achemdataextractor.doc.document.Document
to initialise it, thedocument
attribute is automatically set during the initialisation of the document, so the user does not need to worry about this.Parameters: - text (str) – The text contained in this element.
- start (int) – (Optional) The starting index of the sentence within the containing element. Default 0.
- end (int) – (Optional) The end index of the sentence within the containing element. Defualt None
- word_tokenizer (WordTokenizer) – (Optional) Word tokenizer for this element.
Default
ChemWordTokenizer
. - lexicon (Lexicon) – (Optional) Lexicon for this element. The lexicon stores all the occurences of unique words and can provide
Brown clusters for the words. Default
ChemLexicon
- abbreviation_detector (AbbreviationDetector) – (Optional) The abbreviation detector for this element.
Default
ChemAbbreviationDetector
. - pos_tagger (BaseTagger) – (Optional) The part of speech tagger for this element.
Default
ChemCrfPosTagger
. - ner_tagger (BaseTagger) – (Optional) The named entity recognition tagger for this element.
Default
CemTagger
- document (Document) – (Optional) The document containing this element.
- label (str) – (Optional) The label for the captioned element, e.g. Table 1 would have a label of 1.
- id (Any) – (Optional) Some identifier for this element. Must be equatable.
- models (list[chemdataextractor.models.BaseModel]) – (Optional) A list of models for this element to parse.
If the element is part of another element (e.g. a
Sentence
inside aParagraph
), or is part of aDocument
, this is set automatically to be the same as that of the containing element, unless manually set otherwise.
-
abbreviation_definitions
¶ Empty list. Abbreviation detection is disabled within table cells.
-
records
¶ Empty list. Individual cells don’t provide records, this is handled by the parent Table.
-
-
class
chemdataextractor.doc.text.
Span
(text, start, end)[source]¶ Bases:
object
A text span within a sentence.
-
class
chemdataextractor.doc.text.
Token
(text, start, end, lexicon)[source]¶ Bases:
chemdataextractor.doc.text.Span
A single token within a sentence. Corresponds to a word, character, punctuation etc.
-
lexicon
= None¶ The lexicon for this token.
-
lex
¶ The corresponding
chemdataextractor.nlp.lexicon.Lexeme
entry in the Lexicon for this token.
-