.reader¶

Document readers

Reader classes that read a file and produce a ChemDataExtractor Document object.

.reader.acs¶

Readers for documents from the ACS.

chemdataextractor.reader.acs.clean_acs_html = <chemdataextractor.scrape.clean.Cleaner object>¶

Move to ignore_css?

Type:: Additional cleaner for ACS HTML TODO

class chemdataextractor.reader.acs.AcsHtmlReader[source]¶

Bases: chemdataextractor.reader.markup.HtmlReader

Reader for HTML documents from the ACS.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <chemdataextractor.scrape.clean.Cleaner object>]¶

root_css = '#articleMain, article'¶

title_css = 'h1.articleTitle'¶

heading_css = 'h2, h3, h4, h5, h6, .title1, span.title2, span.title3'¶

table_css = '.NLM_table-wrap'¶

table_caption_css = '.NLM_caption'¶

table_footnote_css = '.footnote'¶

figure_css = '.figure'¶

figure_caption_css = '.caption'¶

citation_css = '.reference'¶

ignore_css = 'a[href="JavaScript:void(0);"], a.ref sup'¶

detect(fstring, fname=None)[source]¶

.reader.base¶

Abstract base classes for document readers.

class chemdataextractor.reader.base.BaseReader[source]¶

Bases: object

All Document Readers should implement a parse method.

__init__()[source]¶: Initialize self. See help(type(self)) for accurate signature.

detect(fstring, fname=None)[source]¶

Quickly check if this reader can parse the input. Reader subclasses should override this.

Used to quickly skip attempting to parse when trying different readers. If in doubt, return True, and then raise ReaderError in the parse method if it fails.

parse(fstring)[source]¶: Parse the input and return a Document. Raises ReaderError if the parse fails.

read(f)[source]¶: Read a file-like object and return a Document.

readstring(fstring)[source]¶: Read a file string and return a Document.

.reader.cssp¶

Readers for ChemSpider SyntheticPages.

class chemdataextractor.reader.cssp.CsspHtmlReader[source]¶

Bases: chemdataextractor.reader.markup.HtmlReader

Reader for ChemSpider SyntheticPages HTML documents.

root_css = '.article-container'¶

title_css = '.article-container > h2'¶

heading_css = 'h3, h4, h5, h6'¶

citation_css = '#csm-article-part-lead_ref > p, #csm-article-part-other_refs > p'¶

detect(fstring, fname=None)[source]¶

.reader.markup¶

XML and HTML readers based on lxml.

class chemdataextractor.reader.markup.LxmlReader[source]¶

Bases: chemdataextractor.reader.base.BaseReader

Abstract base class for lxml-based readers.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>]¶: A Cleaner instance to

root_css = 'html'¶

title_css = 'h1'¶

heading_css = 'h2, h3, h4, h5, h6'¶

table_css = 'table'¶

table_caption_css = 'caption'¶

table_head_row_css = 'thead tr'¶

table_body_row_css = 'tbody tr'¶

table_cell_css = 'th, td'¶

table_footnote_css = 'tfoot tr th'¶

reference_css = 'a.ref'¶

figure_css = 'figure'¶

figure_caption_css = 'figcaption'¶

figure_label_css = 'figcaption span[class^="CaptionNumber"]'¶

figure_download_link_css = 'a::attr(href), img::attr(src)'¶

citation_css = 'cite'¶

metadata_css = 'head'¶

metadata_publisher_css = 'meta[name="DC.publisher"]::attr("content"), meta[name="citation_publisher"]::attr("content")'¶

metadata_author_css = 'meta[name="DC.Creator"]::attr("content"), meta[name="citation_author"]::attr("content")'¶

metadata_title_css = 'meta[name="DC.title"]::attr("content"), meta[name="citation_title"]::attr("content")'¶

metadata_date_css = 'meta[name="DC.Date"]::attr("content"), meta[name="citation_date"]::attr("content"), meta[name="citation_online_date"]::attr("content")'¶

metadata_doi_css = 'meta[name="DC.Identifier"]::attr("content"), meta[name="citation_doi"]::attr("content")'¶

metadata_language_css = 'meta[name="DC.Language"]::attr("content"), meta[name="citation_language"]::attr("content")'¶

metadata_journal_css = 'meta[name="citation_journal_title"]::attr("content")'¶

metadata_volume_css = 'meta[name="citation_volume"]::attr("content")'¶

metadata_issue_css = 'meta[name="citation_issue"]::attr("content")'¶

metadata_firstpage_css = 'meta[name="citation_firstpage"]::attr("content")'¶

metadata_lastpage_css = 'meta[name="citation_lastpage"]::attr("content")'¶

metadata_pdf_url_css = 'meta[name="citation_pdf_url"]::attr("content")'¶

metadata_html_url_css = 'meta[name="citation_fulltext_html_url"]::attr("content"), meta[name="citation_abstract_html_url"]::attr("content")'¶

ignore_css = 'a.ref sup'¶

inline_elements = {'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'blink', 'br', 'button', 'cite', 'code', 'dfn', 'em', 'font', 'i', 'img', 'input', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'textarea', 'tt', 'u', 'var', 'wbr'}¶: Inline elements

parse(fstring)[source]¶: Parse the input and return a Document. Raises ReaderError if the parse fails.

class chemdataextractor.reader.markup.XmlReader[source]¶

Bases: chemdataextractor.reader.markup.LxmlReader

Reader for generic XML documents.

detect(fstring, fname=None)[source]¶

class chemdataextractor.reader.markup.HtmlReader[source]¶

Bases: chemdataextractor.reader.markup.LxmlReader

Reader for generic HTML documents.

detect(fstring, fname=None)[source]¶

.reader.nlm¶

Readers for NLM Journal Archiving and Interchange DTD XML files. (i.e. from PubMed Central)

class chemdataextractor.reader.nlm.NlmXmlReader[source]¶

Bases: chemdataextractor.reader.markup.XmlReader

Reader for NLM XML documents.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <function tidy_nlm_references>, <function space_labels>]¶

root_css = 'article'¶

title_css = 'front article-meta article-title'¶

heading_css = 'title'¶

table_css = 'table-wrap'¶

table_caption_css = 'caption p'¶

table_head_row_css = 'table thead tr'¶

table_body_row_css = 'table tbody tr'¶

table_footnote_css = 'table-wrap-foot p'¶

figure_css = 'fig'¶

figure_caption_css = 'caption p'¶

reference_css = 'xref'¶

citation_css = 'ref-list ref'¶

ignore_css = 'xref[ref-type="bibr"], tex-math'¶

inline_elements = {'a', 'abbr', 'acronym', 'alternatives', 'b', 'bdo', 'big', 'blink', 'bold', 'br', 'button', 'cite', 'code', 'dfn', 'em', 'font', 'i', 'img', 'inline-formula', 'input', 'italic', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'tex-math', 'textarea', 'tt', 'u', 'underline', 'var', 'wbr', 'xref', '{http://www.w3.org/1998/math/mathml}math', '{http://www.w3.org/1998/math/mathml}mi', '{http://www.w3.org/1998/math/mathml}mn', '{http://www.w3.org/1998/math/mathml}mo', '{http://www.w3.org/1998/math/mathml}mrow', '{http://www.w3.org/1998/math/mathml}msubsup'}¶

detect(fstring, fname=None)[source]¶

.reader.pdf¶

PDF document reader.

class chemdataextractor.reader.pdf.PdfReader[source]¶

Bases: chemdataextractor.reader.base.BaseReader

detect(fstring, fname=None)[source]¶

parse(fstring)[source]¶: Parse the input and return a Document. Raises ReaderError if the parse fails.

.reader.plaintext¶

Plain text document reader.

class chemdataextractor.reader.plaintext.PlainTextReader[source]¶

Bases: chemdataextractor.reader.base.BaseReader

Read plain text and split into Paragraphs based on newline patterns.

detect(fstring, fname=None)[source]¶: Have a stab at most files.

parse(fstring)[source]¶: Parse the input and return a Document. Raises ReaderError if the parse fails.

.reader.rsc¶

Readers for documents from the RSC.

chemdataextractor.reader.rsc.rsc_html_whitespace(document)[source]¶: Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace

chemdataextractor.reader.rsc.join_rsc_table_captions(document)[source]¶

Add wrapper tag around Tables and their respective captions

Parameters:: {[type]} -- [description] (document) –

class chemdataextractor.reader.rsc.RscHtmlReader[source]¶

Bases: chemdataextractor.reader.markup.HtmlReader

Reader for HTML documents from the RSC.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <function rsc_html_whitespace>, <function replace_rsc_img_chars>, <function join_rsc_table_captions>, <chemdataextractor.scrape.clean.Cleaner object>]¶

root_css = 'html'¶

title_css = 'h1, .title_heading'¶

heading_css = 'h2, h3, h4, h5, h6, .a_heading, .b_heading, .c_heading, .c_heading_indent, .d_heading, .d_heading_indent'¶

citation_css = 'span[id^="cit"]'¶

table_css = 'div[class^="rtable__wrapper"]'¶

table_caption_css = '.table_caption'¶

table_head_row_css = 'table thead tr'¶

table_body_row_css = 'table tbody tr'¶

table_footnote_css = 'table tfoot tr th .sup_inf'¶

reference_css = 'small sup a, a[href^="#cit"], a[href^="#fn"], a[href^="#tab"]'¶

figure_css = '.image_table'¶

figure_caption_css = '.graphic_title'¶

figure_label_css = 'td.image_title b'¶

figure_download_link_css = 'img::attr(src)'¶

ignore_css = '.table_caption + table, .left_head, sup span.sup_ref, small sup a, a[href^="#fn"], .PMedLink'¶

detect(fstring, fname=None)[source]¶

.reader.uspto¶

Readers for USPTO patents.

class chemdataextractor.reader.uspto.UsptoXmlReader[source]¶

Bases: chemdataextractor.reader.markup.XmlReader

Reader for USPTO XML documents.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>]¶

root_css = 'us-patent-grant'¶

title_css = 'invention-title'¶

heading_css = 'heading, p[id^="h-"]'¶

table_css = 'table'¶

table_body_row_css = 'table row'¶

table_cell_css = 'entry'¶

reference_css = 'claim-ref'¶

ignore_css = 'us-bibliographic-data-grant *:not(invention-title)'¶

inline_elements = {'a', 'abbr', 'acronym', 'alternatives', 'b', 'bdo', 'big', 'blink', 'bold', 'br', 'button', 'cite', 'claim-ref', 'code', 'dfn', 'em', 'figref', 'font', 'i', 'img', 'inline-formula', 'input', 'italic', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'tex-math', 'textarea', 'tt', 'u', 'underline', 'var', 'wbr', 'xref', '{http://www.w3.org/1998/math/mathml}math', '{http://www.w3.org/1998/math/mathml}mi', '{http://www.w3.org/1998/math/mathml}mn', '{http://www.w3.org/1998/math/mathml}mo', '{http://www.w3.org/1998/math/mathml}mrow', '{http://www.w3.org/1998/math/mathml}msubsup'}¶

detect(fstring, fname=None)[source]¶

.reader.elsevier¶

Elsevier XML reader

Readers for Elsevier XML files.

chemdataextractor.reader.elsevier.remove_if_reference(el)[source]¶

chemdataextractor.reader.elsevier.fix_elsevier_xml_whitespace(document)[source]¶: Fix tricky xml tags

chemdataextractor.reader.elsevier.els_xml_whitespace(document)[source]¶: Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace

class chemdataextractor.reader.elsevier.ElsevierXmlReader[source]¶

Bases: chemdataextractor.reader.markup.XmlReader

Reader for Elsevier XML documents.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <function fix_elsevier_xml_whitespace>, <function els_xml_whitespace>, <chemdataextractor.scrape.clean.Cleaner object>]¶

root_css = 'default|full-text-retrieval-response'¶

title_css = 'dc|title'¶

heading_css = 'ce|section-title'¶

table_css = 'ce|table'¶

table_caption_css = 'ce|table ce|caption'¶

table_head_row_css = 'cals|thead cals|row'¶

table_body_row_css = 'cals|tbody cals|row'¶

table_cell_css = 'ce|entry'¶

table_footnote_css = 'table-wrap-foot p'¶

figure_css = 'ce|figure'¶

figure_caption_css = 'ce|figure ce|caption'¶

figure_label_css = 'ce|figure ce|label'¶

figure_download_link_css = ''¶

reference_css = 'ce|cross-ref, ce|cross-refs'¶

citation_css = 'ce|bib-reference'¶

metadata_css = 'xocs|meta'¶

metadata_title_css = 'xocs|normalized-article-title'¶

metadata_author_css = 'xocs|normalized-first-auth-surname'¶

metadata_journal_css = 'xocs|srctitle'¶

metadata_volume_css = 'xocs|vol-first, xocs|volume-list xocs|volume'¶

metadata_issue_css = 'xocs|issns xocs|issn-primary-formatted'¶

metadata_publisher_css = 'xocs|copyright-line'¶

metadata_date_css = 'xocs|available-online-date, xocs|orig-load-date'¶

metadata_firstpage_css = 'xocs|first-fp'¶

metadata_lastpage_css = 'xocs|last-lp'¶

metadata_doi_css = 'xocs|doi, xocs|eii'¶

metadata_pii_css = 'xocs|pii-unformatted'¶

ignore_css = 'ce|bibliography, ce|acknowledgment, ce|correspondence, ce|author, ce|doi, ja|jid, ja|aid, ce|pii, xocs|oa-sponsor-type, xocs|open-access, default|openaccess,default|openaccessArticle, dc|format, dc|creator, dc|identifier,default|eid, default|pii, xocs|meta, xocs|ref-info, default|scopus-eid,xocs|normalized-srctitle,xocs|eid, xocs|hub-eid, xocs|normalized-first-auth-surname,xocs|normalized-first-auth-initial, xocs|refkeys,xocs|attachment-eid, xocs|attachment-type,ja|jid, ce|given-name, ce|surname, ce|affiliation,ce|grant-sponsor, ce|grant-number, prism|copyright,xocs|pii-unformatted, xocs|ucs-locator, ce|copyright,prism|publisher, prism|*, xocs|copyright-line, xocs|cp-notice,dc|description, xocs|document-subtype, ce|keywords, default|openaccessType,default|openArchiveArticle, default|openaccessSponsorName, default|openaccessSponsorType, default|openaccessUserLicense, dcterms|subject,ce|dochead, ce|label, default|pubType'¶

url_prefix = 'https://sciencedirect.com/science/article/pii/'¶

detect(fstring, fname=None)[source]¶: Elsevier document detection based on string found in xml

.reader.springer¶

Readers for documents from Springer.

class chemdataextractor.reader.springer.SpringerMaterialsHtmlReader[source]¶

Bases: chemdataextractor.reader.markup.HtmlReader

Reader for HTML documents from SpringerMaterials.

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <chemdataextractor.scrape.clean.Cleaner object>]¶

root_css = 'html'¶

citation_css = 'span[class="CitationRef"]'¶

title_css = 'title'¶

heading_css = 'h2, h3, h4, h5, h6, .title1, span.title2, span.title3'¶

table_css = 'div[class="Table"]'¶

table_caption_css = 'div[class="Table"] p'¶

table_head_row_css = 'thead'¶

table_body_row_css = 'tbody'¶

table_cell_css = 'th, td'¶

ignore_css = 'sub, sup, em[class^="EmphasisTypeItalic "], li[class="article-metrics__item"], div[class="CitationContent"]'¶

detect(fstring, fname=None)[source]¶

chemdataextractor.reader.springer.springer_html_whitespace(document)[source]¶: Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace

chemdataextractor.reader.springer.fix_springer_table_whitespace(document)[source]¶

remove leading and trailing whitespace from table cells

Parameters:: {[type]} -- [description] (document) –
Returns:: [type] – [description]

class chemdataextractor.reader.springer.SpringerHtmlReader[source]¶

Bases: chemdataextractor.reader.markup.HtmlReader

cleaners = [<chemdataextractor.scrape.clean.Cleaner object>, <function springer_html_whitespace>, <chemdataextractor.scrape.clean.Cleaner object>, <bound method Cleaner.clean_html of <chemdataextractor.scrape.clean.Cleaner object>>, <function tidy_springer_references>, <function fix_springer_table_whitespace>]¶

root_css = 'html'¶

title_css = 'h1[class^="ArticleTitle"]'¶

heading_css = 'h2, h3, h4'¶

table_css = 'div[class="Table"]'¶

table_caption_css = 'div[class^="Caption"] p'¶

table_head_row_css = 'thead tr'¶

table_body_row_css = 'tbody tr'¶

table_cell_css = 'td, th'¶

figure_css = 'figure'¶

figure_caption_css = 'figcaption'¶

figure_label_css = 'figcaption span[class^="CaptionNumber"]'¶

ignore_css = 'a[class="skip-to__link pseudo-focus"], div[class="nojs-banner u-interface"], a[class="skip-to__link skip-to__link--contents pseudo-focus"], p[class="leaderboard__label"], div[class="u-screenreader-only"], label[for="search-springerlink"], span[class="search-button__title"], span[class="u-overflow-ellipsis"], span[class="u-overflow-ellipsis"], a[class="c-button c-button--blue c-button__icon-right gtm-pdf-link"], div[class="leaderboard u-hide"], title, li[class="article-metrics__item"], aside[class="section section--collapsible"], a[class="gtm-cite-link"], span[class="u-screenreader-only"], div[class="authors__list"], a[class="gtm-tab-authorsandaffiliations"], ol[class="BibliographyWrapper"], h2[id="copyrightInformation"], div[class="content authors-affiliations u-interface"], p[class="footer__copyright"], p[class="footer__user-access-info"], span[class="u-screenreader-only"], a[href="/contactus"], a[class="gtm-footer-accessibility"], ul[class="footer__nav"], div[class="footer__aside-wrapper"], aside[class="main-sidebar-right u-interface"], a[class="c-button share-this gtm-shareby-sharelink-link test-shareby-sharelink-link"], a[class="gtm-export-citation"], ul[class="citations__content"], h3[data-role="button-dropdown__title"], div[class="section section--collapsible uptodate-recommendations gtm-recommendations"], span[class="InlineEquation"], div[class="EquationContent"], div[class="EquationNumber"], footer'¶

detect(fstring, fname=None)[source]¶

.parse

.relex