.reader¶
Document readers
Reader classes that read a file and produce a ChemDataExtractor Document object.
.reader.acs¶
Readers for documents from the ACS.
-
chemdataextractor.reader.acs.
clean_acs_html
= <chemdataextractor.scrape.clean.Cleaner object>¶ Move to ignore_css?
- Type:
Additional cleaner for ACS HTML TODO
-
class
chemdataextractor.reader.acs.
AcsHtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.HtmlReader
Reader for HTML documents from the ACS.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <chemdataextractor.scrape.clean.Cleaner object>]¶
-
root_css
= '#articleMain, article'¶
-
title_css
= 'h1.articleTitle'¶
-
heading_css
= 'h2, h3, h4, h5, h6, .title1, span.title2, span.title3'¶
-
table_css
= '.NLM_table-wrap'¶
-
table_footnote_css
= '.footnote'¶
-
figure_css
= '.figure'¶
-
citation_css
= '.reference'¶
-
ignore_css
= 'a[href="JavaScript:void(0);"], a.ref sup'¶
-
.reader.base¶
Abstract base classes for document readers.
-
class
chemdataextractor.reader.base.
BaseReader
[source]¶ Bases:
object
All Document Readers should implement a parse method.
-
detect
(fstring, fname=None)[source]¶ Quickly check if this reader can parse the input. Reader subclasses should override this.
Used to quickly skip attempting to parse when trying different readers. If in doubt, return True, and then raise ReaderError in the parse method if it fails.
-
.reader.cssp¶
Readers for ChemSpider SyntheticPages.
-
class
chemdataextractor.reader.cssp.
CsspHtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.HtmlReader
Reader for ChemSpider SyntheticPages HTML documents.
-
root_css
= '.article-container'¶
-
title_css
= '.article-container > h2'¶
-
heading_css
= 'h3, h4, h5, h6'¶
-
citation_css
= '#csm-article-part-lead_ref > p, #csm-article-part-other_refs > p'¶
-
.reader.markup¶
XML and HTML readers based on lxml.
-
class
chemdataextractor.reader.markup.
LxmlReader
[source]¶ Bases:
chemdataextractor.reader.base.BaseReader
Abstract base class for lxml-based readers.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>]¶ A
Cleaner
instance to
-
root_css
= 'html'¶
-
title_css
= 'h1'¶
-
heading_css
= 'h2, h3, h4, h5, h6'¶
-
table_css
= 'table'¶
-
table_head_row_css
= 'thead tr'¶
-
table_body_row_css
= 'tbody tr'¶
-
table_cell_css
= 'th, td'¶
-
table_footnote_css
= 'tfoot tr th'¶
-
reference_css
= 'a.ref'¶
-
figure_css
= 'figure'¶
-
figure_label_css
= 'figcaption span[class^="CaptionNumber"]'¶
-
figure_download_link_css
= 'a::attr(href), img::attr(src)'¶
-
citation_css
= 'cite'¶
-
metadata_css
= 'head'¶
-
metadata_publisher_css
= 'meta[name="DC.publisher"]::attr("content"), meta[name="citation_publisher"]::attr("content")'¶
-
metadata_title_css
= 'meta[name="DC.title"]::attr("content"), meta[name="citation_title"]::attr("content")'¶
-
metadata_date_css
= 'meta[name="DC.Date"]::attr("content"), meta[name="citation_date"]::attr("content"), meta[name="citation_online_date"]::attr("content")'¶
-
metadata_doi_css
= 'meta[name="DC.Identifier"]::attr("content"), meta[name="citation_doi"]::attr("content")'¶
-
metadata_language_css
= 'meta[name="DC.Language"]::attr("content"), meta[name="citation_language"]::attr("content")'¶
-
metadata_journal_css
= 'meta[name="citation_journal_title"]::attr("content")'¶
-
metadata_volume_css
= 'meta[name="citation_volume"]::attr("content")'¶
-
metadata_issue_css
= 'meta[name="citation_issue"]::attr("content")'¶
-
metadata_firstpage_css
= 'meta[name="citation_firstpage"]::attr("content")'¶
-
metadata_lastpage_css
= 'meta[name="citation_lastpage"]::attr("content")'¶
-
metadata_pdf_url_css
= 'meta[name="citation_pdf_url"]::attr("content")'¶
-
metadata_html_url_css
= 'meta[name="citation_fulltext_html_url"]::attr("content"), meta[name="citation_abstract_html_url"]::attr("content")'¶
-
ignore_css
= 'a.ref sup'¶
-
inline_elements
= {'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'blink', 'br', 'button', 'cite', 'code', 'dfn', 'em', 'font', 'i', 'img', 'input', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'textarea', 'tt', 'u', 'var', 'wbr'}¶ Inline elements
-
-
class
chemdataextractor.reader.markup.
XmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.LxmlReader
Reader for generic XML documents.
-
class
chemdataextractor.reader.markup.
HtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.LxmlReader
Reader for generic HTML documents.
.reader.nlm¶
Readers for NLM Journal Archiving and Interchange DTD XML files. (i.e. from PubMed Central)
-
class
chemdataextractor.reader.nlm.
NlmXmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.XmlReader
Reader for NLM XML documents.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <function tidy_nlm_references>, <function space_labels>]¶
-
root_css
= 'article'¶
-
title_css
= 'front article-meta article-title'¶
-
heading_css
= 'title'¶
-
table_css
= 'table-wrap'¶
-
table_head_row_css
= 'table thead tr'¶
-
table_body_row_css
= 'table tbody tr'¶
-
table_footnote_css
= 'table-wrap-foot p'¶
-
figure_css
= 'fig'¶
-
reference_css
= 'xref'¶
-
citation_css
= 'ref-list ref'¶
-
ignore_css
= 'xref[ref-type="bibr"], tex-math'¶
-
inline_elements
= {'a', 'abbr', 'acronym', 'alternatives', 'b', 'bdo', 'big', 'blink', 'bold', 'br', 'button', 'cite', 'code', 'dfn', 'em', 'font', 'i', 'img', 'inline-formula', 'input', 'italic', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'tex-math', 'textarea', 'tt', 'u', 'underline', 'var', 'wbr', 'xref', '{http://www.w3.org/1998/math/mathml}math', '{http://www.w3.org/1998/math/mathml}mi', '{http://www.w3.org/1998/math/mathml}mn', '{http://www.w3.org/1998/math/mathml}mo', '{http://www.w3.org/1998/math/mathml}mrow', '{http://www.w3.org/1998/math/mathml}msubsup'}¶
-
.reader.pdf¶
PDF document reader.
.reader.plaintext¶
Plain text document reader.
-
class
chemdataextractor.reader.plaintext.
PlainTextReader
[source]¶ Bases:
chemdataextractor.reader.base.BaseReader
Read plain text and split into Paragraphs based on newline patterns.
.reader.rsc¶
Readers for documents from the RSC.
-
chemdataextractor.reader.rsc.
rsc_html_whitespace
(document)[source]¶ Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace
Add wrapper tag around Tables and their respective captions
- Parameters:
{[type]} -- [description] (document) –
-
class
chemdataextractor.reader.rsc.
RscHtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.HtmlReader
Reader for HTML documents from the RSC.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <function rsc_html_whitespace>, <function replace_rsc_img_chars>, <function join_rsc_table_captions>, <chemdataextractor.scrape.clean.Cleaner object>]¶
-
root_css
= 'html'¶
-
title_css
= 'h1, .title_heading'¶
-
heading_css
= 'h2, h3, h4, h5, h6, .a_heading, .b_heading, .c_heading, .c_heading_indent, .d_heading, .d_heading_indent'¶
-
citation_css
= 'span[id^="cit"]'¶
-
table_css
= 'div[class^="rtable__wrapper"]'¶
-
table_head_row_css
= 'table thead tr'¶
-
table_body_row_css
= 'table tbody tr'¶
-
table_footnote_css
= 'table tfoot tr th .sup_inf'¶
-
reference_css
= 'small sup a, a[href^="#cit"], a[href^="#fn"], a[href^="#tab"]'¶
-
figure_css
= '.image_table'¶
-
figure_label_css
= 'td.image_title b'¶
-
figure_download_link_css
= 'img::attr(src)'¶
-
ignore_css
= '.table_caption + table, .left_head, sup span.sup_ref, small sup a, a[href^="#fn"], .PMedLink'¶
-
.reader.uspto¶
Readers for USPTO patents.
-
class
chemdataextractor.reader.uspto.
UsptoXmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.XmlReader
Reader for USPTO XML documents.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>]¶
-
root_css
= 'us-patent-grant'¶
-
title_css
= 'invention-title'¶
-
heading_css
= 'heading, p[id^="h-"]'¶
-
table_css
= 'table'¶
-
table_body_row_css
= 'table row'¶
-
table_cell_css
= 'entry'¶
-
reference_css
= 'claim-ref'¶
-
ignore_css
= 'us-bibliographic-data-grant *:not(invention-title)'¶
-
inline_elements
= {'a', 'abbr', 'acronym', 'alternatives', 'b', 'bdo', 'big', 'blink', 'bold', 'br', 'button', 'cite', 'claim-ref', 'code', 'dfn', 'em', 'figref', 'font', 'i', 'img', 'inline-formula', 'input', 'italic', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'tex-math', 'textarea', 'tt', 'u', 'underline', 'var', 'wbr', 'xref', '{http://www.w3.org/1998/math/mathml}math', '{http://www.w3.org/1998/math/mathml}mi', '{http://www.w3.org/1998/math/mathml}mn', '{http://www.w3.org/1998/math/mathml}mo', '{http://www.w3.org/1998/math/mathml}mrow', '{http://www.w3.org/1998/math/mathml}msubsup'}¶
-
.reader.elsevier¶
Elsevier XML reader
Readers for Elsevier XML files.
-
chemdataextractor.reader.elsevier.
fix_elsevier_xml_whitespace
(document)[source]¶ Fix tricky xml tags
-
chemdataextractor.reader.elsevier.
els_xml_whitespace
(document)[source]¶ Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace
-
class
chemdataextractor.reader.elsevier.
ElsevierXmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.XmlReader
Reader for Elsevier XML documents.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <function fix_elsevier_xml_whitespace>, <function els_xml_whitespace>, <chemdataextractor.scrape.clean.Cleaner object>]¶
-
root_css
= 'default|full-text-retrieval-response'¶
-
title_css
= 'dc|title'¶
-
heading_css
= 'ce|section-title'¶
-
table_css
= 'ce|table'¶
-
table_head_row_css
= 'cals|thead cals|row'¶
-
table_body_row_css
= 'cals|tbody cals|row'¶
-
table_cell_css
= 'ce|entry'¶
-
table_footnote_css
= 'table-wrap-foot p'¶
-
figure_css
= 'ce|figure'¶
-
figure_label_css
= 'ce|figure ce|label'¶
-
figure_download_link_css
= ''¶
-
reference_css
= 'ce|cross-ref, ce|cross-refs'¶
-
citation_css
= 'ce|bib-reference'¶
-
metadata_css
= 'xocs|meta'¶
-
metadata_title_css
= 'xocs|normalized-article-title'¶
-
metadata_journal_css
= 'xocs|srctitle'¶
-
metadata_volume_css
= 'xocs|vol-first, xocs|volume-list xocs|volume'¶
-
metadata_issue_css
= 'xocs|issns xocs|issn-primary-formatted'¶
-
metadata_publisher_css
= 'xocs|copyright-line'¶
-
metadata_date_css
= 'xocs|available-online-date, xocs|orig-load-date'¶
-
metadata_firstpage_css
= 'xocs|first-fp'¶
-
metadata_lastpage_css
= 'xocs|last-lp'¶
-
metadata_doi_css
= 'xocs|doi, xocs|eii'¶
-
metadata_pii_css
= 'xocs|pii-unformatted'¶
-
ignore_css
= 'ce|bibliography, ce|acknowledgment, ce|correspondence, ce|author, ce|doi, ja|jid, ja|aid, ce|pii, xocs|oa-sponsor-type, xocs|open-access, default|openaccess,default|openaccessArticle, dc|format, dc|creator, dc|identifier,default|eid, default|pii, xocs|meta, xocs|ref-info, default|scopus-eid,xocs|normalized-srctitle,xocs|eid, xocs|hub-eid, xocs|normalized-first-auth-surname,xocs|normalized-first-auth-initial, xocs|refkeys,xocs|attachment-eid, xocs|attachment-type,ja|jid, ce|given-name, ce|surname, ce|affiliation,ce|grant-sponsor, ce|grant-number, prism|copyright,xocs|pii-unformatted, xocs|ucs-locator, ce|copyright,prism|publisher, prism|*, xocs|copyright-line, xocs|cp-notice,dc|description, xocs|document-subtype, ce|keywords, default|openaccessType,default|openArchiveArticle, default|openaccessSponsorName, default|openaccessSponsorType, default|openaccessUserLicense, dcterms|subject,ce|dochead, ce|label, default|pubType'¶
-
url_prefix
= 'https://sciencedirect.com/science/article/pii/'¶
-
.reader.springer¶
Readers for documents from Springer.
-
class
chemdataextractor.reader.springer.
SpringerMaterialsHtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.HtmlReader
Reader for HTML documents from SpringerMaterials.
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <chemdataextractor.scrape.clean.Cleaner object>]¶
-
root_css
= 'html'¶
-
citation_css
= 'span[class="CitationRef"]'¶
-
title_css
= 'title'¶
-
heading_css
= 'h2, h3, h4, h5, h6, .title1, span.title2, span.title3'¶
-
table_css
= 'div[class="Table"]'¶
-
table_head_row_css
= 'thead'¶
-
table_body_row_css
= 'tbody'¶
-
table_cell_css
= 'th, td'¶
-
ignore_css
= 'sub, sup, em[class^="EmphasisTypeItalic "], li[class="article-metrics__item"], div[class="CitationContent"]'¶
-
-
chemdataextractor.reader.springer.
springer_html_whitespace
(document)[source]¶ Remove whitespace in xml.text or xml.tails for all elements, if it is only whitespace
-
chemdataextractor.reader.springer.
fix_springer_table_whitespace
(document)[source]¶ remove leading and trailing whitespace from table cells
- Parameters:
{[type]} -- [description] (document) –
- Returns:
[type] – [description]
-
class
chemdataextractor.reader.springer.
SpringerHtmlReader
[source]¶ Bases:
chemdataextractor.reader.markup.HtmlReader
-
cleaners
= [<chemdataextractor.scrape.clean.Cleaner object>, <function springer_html_whitespace>, <chemdataextractor.scrape.clean.Cleaner object>, <bound method Cleaner.clean_html of <chemdataextractor.scrape.clean.Cleaner object>>, <function tidy_springer_references>, <function fix_springer_table_whitespace>]¶
-
root_css
= 'html'¶
-
title_css
= 'h1[class^="ArticleTitle"]'¶
-
heading_css
= 'h2, h3, h4'¶
-
table_css
= 'div[class="Table"]'¶
-
table_head_row_css
= 'thead tr'¶
-
table_body_row_css
= 'tbody tr'¶
-
table_cell_css
= 'td, th'¶
-
figure_css
= 'figure'¶
-
figure_label_css
= 'figcaption span[class^="CaptionNumber"]'¶
-
ignore_css
= 'a[class="skip-to__link pseudo-focus"], div[class="nojs-banner u-interface"], a[class="skip-to__link skip-to__link--contents pseudo-focus"], p[class="leaderboard__label"], div[class="u-screenreader-only"], label[for="search-springerlink"], span[class="search-button__title"], span[class="u-overflow-ellipsis"], span[class="u-overflow-ellipsis"], a[class="c-button c-button--blue c-button__icon-right gtm-pdf-link"], div[class="leaderboard u-hide"], title, li[class="article-metrics__item"], aside[class="section section--collapsible"], a[class="gtm-cite-link"], span[class="u-screenreader-only"], div[class="authors__list"], a[class="gtm-tab-authorsandaffiliations"], ol[class="BibliographyWrapper"], h2[id="copyrightInformation"], div[class="content authors-affiliations u-interface"], p[class="footer__copyright"], p[class="footer__user-access-info"], span[class="u-screenreader-only"], a[href="/contactus"], a[class="gtm-footer-accessibility"], ul[class="footer__nav"], div[class="footer__aside-wrapper"], aside[class="main-sidebar-right u-interface"], a[class="c-button share-this gtm-shareby-sharelink-link test-shareby-sharelink-link"], a[class="gtm-export-citation"], ul[class="citations__content"], h3[data-role="button-dropdown__title"], div[class="section section--collapsible uptodate-recommendations gtm-recommendations"], span[class="InlineEquation"], div[class="EquationContent"], div[class="EquationNumber"], footer'¶
-