.scrape¶
Scrapers for the various data sources
Declarative scraping framework for extracting structured data from HTML and XML documents.
-
chemdataextractor.scrape.BLOCK_ELEMENTS= {'address', 'article', 'aside', 'audio', 'blockquote', 'body', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'header', 'hgroup', 'hr', 'li', 'noscript', 'ol', 'output', 'p', 'pre', 'section', 'table', 'tbody', 'td', 'tfoot', 'th', 'thead', 'title', 'tr', 'ul'}¶ Block level HTML elements
-
chemdataextractor.scrape.INLINE_ELEMENTS= {'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'blink', 'br', 'button', 'cite', 'code', 'dfn', 'em', 'font', 'i', 'img', 'input', 'kbd', 'label', 'map', 'marquee', 'nobr', 'object', 'q', 's', 'samp', 'script', 'select', 'small', 'span', 'strike', 'strong', 'sub', 'sup', 'textarea', 'tt', 'u', 'var', 'wbr'}¶ Inline level HTML elements
.scrape.pub¶
Scraping tools for specific publishers.
.scrape.pub.nlm¶
Tools for scraping documents from NLM Journal Archiving and Interchange DTD XML files.
-
chemdataextractor.scrape.pub.nlm.strip_pmc_xml= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that kills reference links, footnote links, equations, footnotes
-
chemdataextractor.scrape.pub.nlm.strip_pmc_abstract_xml= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills headings
-
chemdataextractor.scrape.pub.nlm.strip_pmc_paragraph_xml= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills tables and figures
-
chemdataextractor.scrape.pub.nlm.space_labels(document)[source]¶ Ensure space around bold compound labels.
-
chemdataextractor.scrape.pub.nlm.tidy_nlm_references(document)[source]¶ Remove punctuation around references like brackets, commas, hyphens.
-
class
chemdataextractor.scrape.pub.nlm.NlmXmlAuthor(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityAuthor information from NLM XML file.
-
givennames¶ A string field.
-
lastname¶ A string field.
-
email¶ A string field.
-
process_givennames= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_lastname= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'email': <chemdataextractor.scrape.fields.StringField object>, 'givennames': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.NlmXmlImage(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityFigure information from NLM XML file.
-
label¶ A string field.
-
caption¶ A string field.
-
reference¶ A string field.
-
clean_caption= <chemdataextractor.text.processors.Chain object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.NlmXmlTable(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityTable information from NLM XML file.
-
label¶ A string field.
-
caption¶ A string field.
-
reference¶ A string field.
-
src¶ A string field.
-
clean_caption= <chemdataextractor.text.processors.Chain object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.NlmXmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityDocument information from a NLM XML file.
-
doi¶ A string field.
-
pmid¶ An integer number field.
-
pmcid¶ An integer number field.
-
title¶ A string field.
A field that contains another Entity.
-
journal_title¶ A string field.
-
journal_abbreviation¶ A string field.
-
publisher¶ A string field.
-
volume¶ A string field.
-
firstpage¶ A string field.
-
lastpage¶ A string field.
-
issue¶ A string field.
-
issn¶ A string field.
-
coden¶ A string field.
-
abstract¶ A string field.
-
online_year¶ An integer number field.
-
online_month¶ An integer number field.
-
online_day¶ An integer number field.
-
published_year¶ An integer number field.
-
published_month¶ An integer number field.
-
published_day¶ An integer number field.
-
accepted_year¶ An integer number field.
-
accepted_month¶ An integer number field.
-
accepted_day¶ An integer number field.
-
received_year¶ An integer number field.
-
received_month¶ An integer number field.
-
received_day¶ An integer number field.
-
license¶ A field with optional URL processing.
-
clean_title= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_abstract= <chemdataextractor.scrape.clean.Cleaner object>¶
-
process_title= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_publisher= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_abstract= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'coden': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal_abbreviation': <chemdataextractor.scrape.fields.StringField object>, 'journal_title': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_day': <chemdataextractor.scrape.fields.IntField object>, 'online_month': <chemdataextractor.scrape.fields.IntField object>, 'online_year': <chemdataextractor.scrape.fields.IntField object>, 'pmcid': <chemdataextractor.scrape.fields.IntField object>, 'pmid': <chemdataextractor.scrape.fields.IntField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
.scrape.pub.rsc¶
Tools for scraping documents from The Royal Society of Chemistry.
-
chemdataextractor.scrape.pub.rsc.CHAR_REPLACEMENTS= [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ('\\[?\\[2 with combining macron\\]\\]?', '2̄'), ('\\[?\\[3 with combining macron\\]\\]?', '3̄'), ('\\[?\\[4 with combining macron\\]\\]?', '4̄'), ('\\[?\\[approximate\\]\\]?', '≈'), ('\\[?\\[bottom\\]\\]?', '⊥'), ('\\[?\\[c with combining tilde\\]\\]?', 'C̃'), ('\\[?\\[capital delta\\]\\]?', 'Δ'), ('\\[?\\[capital lambda\\]\\]?', 'Λ'), ('\\[?\\[capital omega\\]\\]?', 'Ω'), ('\\[?\\[capital phi\\]\\]?', 'Φ'), ('\\[?\\[capital pi\\]\\]?', 'Π'), ('\\[?\\[capital psi\\]\\]?', 'Ψ'), ('\\[?\\[capital sigma\\]\\]?', 'Σ'), ('\\[?\\[caret\\]\\]?', '^'), ('\\[?\\[congruent with\\]\\]?', '≅'), ('\\[?\\[curly or open phi\\]\\]?', 'ϕ'), ('\\[?\\[dagger\\]\\]?', '†'), ('\\[?\\[dbl greater-than\\]\\]?', '≫'), ('\\[?\\[dbl vertical bar\\]\\]?', '‖'), ('\\[?\\[degree\\]\\]?', '°'), ('\\[?\\[double bond, length as m-dash\\]\\]?', '='), ('\\[?\\[double bond, length half m-dash\\]\\]?', '='), ('\\[?\\[double dagger\\]\\]?', '‡'), ('\\[?\\[double equals\\]\\]?', '≧'), ('\\[?\\[double less-than\\]\\]?', '≪'), ('\\[?\\[double prime\\]\\]?', '″'), ('\\[?\\[downward arrow\\]\\]?', '↓'), ('\\[?\\[fraction five-over-two\\]\\]?', '5/2'), ('\\[?\\[fraction three-over-two\\]\\]?', '3/2'), ('\\[?\\[gamma\\]\\]?', 'γ'), ('\\[?\\[greater-than-or-equal\\]\\]?', '≥'), ('\\[?\\[greater, similar\\]\\]?', '≳'), ('\\[?\\[gt-or-equal\\]\\]?', '≥'), ('\\[?\\[i without dot\\]\\]?', 'ı'), ('\\[?\\[identical with\\]\\]?', '≡'), ('\\[?\\[infinity\\]\\]?', '∞'), ('\\[?\\[intersection\\]\\]?', '∩'), ('\\[?\\[iota\\]\\]?', 'ι'), ('\\[?\\[is proportional to\\]\\]?', '∝'), ('\\[?\\[leftrightarrow\\]\\]?', '↔'), ('\\[?\\[leftrightarrows\\]\\]?', '⇄'), ('\\[?\\[less-than-or-equal\\]\\]?', '≤'), ('\\[?\\[less, similar\\]\\]?', '≲'), ('\\[?\\[logical and\\]\\]?', '∧'), ('\\[?\\[middle dot\\]\\]?', '·'), ('\\[?\\[not equal\\]\\]?', '≠'), ('\\[?\\[parallel\\]\\]?', '∥'), ('\\[?\\[per thousand\\]\\]?', '‰'), ('\\[?\\[prime or minute\\]\\]?', '′'), ('\\[?\\[quadruple bond, length as m-dash\\]\\]?', '≣'), ('\\[?\\[radical dot\\]\\]?', ' ̇'), ('\\[?\\[ratio\\]\\]?', '∶'), ('\\[?\\[registered sign\\]\\]?', '®'), ('\\[?\\[reverse similar\\]\\]?', '∽'), ('\\[?\\[right left arrows\\]\\]?', '⇄'), ('\\[?\\[right left harpoons\\]\\]?', '⇌'), ('\\[?\\[rightward arrow\\]\\]?', '→'), ('\\[?\\[round bullet, filled\\]\\]?', '•'), ('\\[?\\[sigma\\]\\]?', 'σ'), ('\\[?\\[similar\\]\\]?', '∼'), ('\\[?\\[small alpha\\]\\]?', 'α'), ('\\[?\\[small beta\\]\\]?', 'β'), ('\\[?\\[small chi\\]\\]?', 'χ'), ('\\[?\\[small delta\\]\\]?', 'δ'), ('\\[?\\[small eta\\]\\]?', 'η'), ('\\[?\\[small gamma, Greek, dot above\\]\\]?', 'γ̇'), ('\\[?\\[small kappa\\]\\]?', 'κ'), ('\\[?\\[small lambda\\]\\]?', 'λ'), ('\\[?\\[small micro\\]\\]?', 'µ'), ('\\[?\\[small mu \\]\\]?', 'μ'), ('\\[?\\[small nu\\]\\]?', 'ν'), ('\\[?\\[small omega\\]\\]?', 'ω'), ('\\[?\\[small phi\\]\\]?', 'φ'), ('\\[?\\[small pi\\]\\]?', 'π'), ('\\[?\\[small psi\\]\\]?', 'ψ'), ('\\[?\\[small tau\\]\\]?', 'τ'), ('\\[?\\[small theta\\]\\]?', 'θ'), ('\\[?\\[small upsilon\\]\\]?', 'υ'), ('\\[?\\[small xi\\]\\]?', 'ξ'), ('\\[?\\[small zeta\\]\\]?', 'ζ'), ('\\[?\\[space\\]\\]?', ' '), ('\\[?\\[square\\]\\]?', '□'), ('\\[?\\[subset or is implied by\\]\\]?', '⊂'), ('\\[?\\[summation operator\\]\\]?', '∑'), ('\\[?\\[times\\]\\]?', '×'), ('\\[?\\[trade mark sign\\]\\]?', '™'), ('\\[?\\[triple bond, length as m-dash\\]\\]?', '≡'), ('\\[?\\[triple bond, length half m-dash\\]\\]?', '≡'), ('\\[?\\[triple prime\\]\\]?', '‴'), ('\\[?\\[upper bond 1 end\\]\\]?', ''), ('\\[?\\[upper bond 1 start\\]\\]?', ''), ('\\[?\\[upward arrow\\]\\]?', '↑'), ('\\[?\\[varepsilon\\]\\]?', 'ε'), ('\\[?\\[x with combining tilde\\]\\]?', 'X̃')]¶ Map placeholder text to unicode characters.
-
chemdataextractor.scrape.pub.rsc.RSC_IMG_CHARS= {'2041': '^', '224a': '≈', 'e001': '=', 'e002': '≡', 'e003': '≣', 'e006': '=', 'e007': '≡', 'e009': '>', 'e00a': '<', 'e00c': '⚟', 'e00d': '⚞', 'e010': '┌', 'e011': '┐', 'e012': '└', 'e013': '┘', 'e038': '⬡', 'e059': '◍', 'e05a': '◍', 'e069': '▩', 'e077': '⬓', 'e082': '⬘', 'e083': '⬙', 'e084': '⟐', 'e090': '┄', 'e091': '┄', 'e0a2': 'γ̇', 'e0b3': 'μ͂', 'e0b7': 'ρ͂', 'e0c2': 'α̅', 'e0c3': 'β̅', 'e0c5': 'δ̅', 'e0c6': 'ε̅', 'e0c9': 'θ̅', 'e0cb': 'κ̅', 'e0cc': 'λ̅', 'e0cd': 'μ̅', 'e0ce': 'v̅', 'e0d1': 'ρ̅', 'e0d4': 'τ̅', 'e0d5': 'ν̅', 'e0d6': 'ϕ̅', 'e0d7': 'φ̅', 'e0d8': 'χ̅', 'e0da': 'ν̅', 'e0db': 'Φ̃', 'e0dd': 'γ̃', 'e0de': 'ε̃', 'e0e0': 'μ̃', 'e0e1': 'ṽ', 'e0e4': 'ρ̃', 'e0e7': 'ε⃗', 'e0e9': 'μ⃗', 'e0eb': '⦵', 'e0ec': '|', 'e0ed': '|', 'e0ee': '3/2', 'e0f1': '𝌂', 'e0f5': 'ν', 'e0f6': '⟿', 'e100': '┆', 'e103': '★', 'e107': 'ε͂', 'e108': 'ῆ', 'e109': 'κ͂', 'e10d': 'σ̃', 'e110': 'η̃', 'e112': '𝒢', 'e113': '𝈙', 'e116': '⤳', 'e117': '━', 'e11a': 'λ͂', 'e11b': 'χ̃', 'e11f': '5/2', 'e120': '5/4', 'e124': '⬢', 'e131': 'ν̃', 'e132': 'Γ͂', 'e13d': '⬟', 'e142': 'ℋ', 'e144': 'ℒ', 'e146': 'ℓ', 'e170': '𝕄', 'e175': 'ℝ', 'e177': '𝕋', 'e17e': '𝖀', 'e18f': '𝕽', 'e1c0': '⬡', 'e520': '𝒜', 'e523': '𝒟', 'e529': '𝒥', 'e52d': '𝒩', 'e52f': '𝒫', 'e531': 'ℛ', 'e533': '𝒯'}¶ Map image URL components to unicode characters.
-
chemdataextractor.scrape.pub.rsc.strip_rsc_html= <chemdataextractor.scrape.clean.Cleaner object>¶ none;” (typically tooltips)
Type: HTML stripper that kills superscript references and anything with style=”display
-
chemdataextractor.scrape.pub.rsc.strip_cit_html= <chemdataextractor.scrape.clean.Cleaner object>¶ HTML stripper that also kills text from buttons in references.
-
chemdataextractor.scrape.pub.rsc.rsc_substitute= <chemdataextractor.text.processors.Substitutor object>¶ Substitutor that replaces RSC escape codes with the actual unicode character
-
chemdataextractor.scrape.pub.rsc.parse_rsc_html(htmlstring)[source]¶ Messy RSC HTML needs this special parser to fix problems before creating selector.
-
chemdataextractor.scrape.pub.rsc.replace_rsc_img_chars(document)[source]¶ Replace image characters with unicode equivalents.
-
chemdataextractor.scrape.pub.rsc.space_references(document)[source]¶ Ensure a space around reference links, so there’s a gap when they are removed.
-
class
chemdataextractor.scrape.pub.rsc.RscRssDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityDocument information from RSC RSS feed.
-
doi¶ A string field.
-
title¶ A string field.
A string field.
-
landing_url¶ A field with optional URL processing.
-
process_title= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'authors': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscRssScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.RssScraperScraper for RSC RSS feeds.
-
entity¶ alias of
RscRssDocument
-
-
class
chemdataextractor.scrape.pub.rsc.RscSearchDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityDocument information from RSC search results page.
-
doi¶ A string field.
-
title¶ A string field.
-
landing_url¶ A field with optional URL processing.
-
pdf_url¶ A field with optional URL processing.
-
html_url¶ A field with optional URL processing.
-
journal¶ A string field.
-
abstract¶ A string field.
-
clean_title= <chemdataextractor.text.processors.Chain object>¶
-
process_doi= <chemdataextractor.text.processors.LAdd object>¶
-
process_title= <chemdataextractor.text.processors.Chain object>¶
-
process_landing_url= <chemdataextractor.text.processors.Chain object>¶
-
process_pdf_url= <chemdataextractor.text.processors.Chain object>¶
-
process_html_url= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscSearchScraper(max_wait_time=30, driver=None)[source]¶ Bases:
chemdataextractor.scrape.scraper.SearchScraperScraper for RSC search results.
-
entity¶ alias of
RscSearchDocument
-
root= '.capsule.capsule--article'¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscLandingSupplement(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity-
name¶ A string field.
-
url¶ A field with optional URL processing.
-
fields= {'name': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscLandingDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntityDocument information from RSC landing page.
-
supplements¶ A field that contains another Entity.
-
process_abstract= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'supplements': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscLandingScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraperScraper for RSC Landing pages.
-
entity¶ alias of
RscLandingDocument
-
-
class
chemdataextractor.scrape.pub.rsc.RscChemicalMention(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity-
text¶ A string field.
-
chemspider_id¶ A string field.
-
inchi¶ A string field.
-
clean_text= <chemdataextractor.text.processors.Chain object>¶
-
process_text= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_chemspider_id= <chemdataextractor.text.processors.Chain object>¶
-
process_inchi= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'chemspider_id': <chemdataextractor.scrape.fields.StringField object>, 'inchi': <chemdataextractor.scrape.fields.StringField object>, 'text': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscImage(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityEmbedded image. Includes both Schemes and Figures.
-
url¶ A field with optional URL processing.
-
label¶ A string field.
-
reference¶ A string field.
-
caption¶ A string field.
-
clean_caption= <chemdataextractor.text.processors.Chain object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscTable(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityTable within document.
-
reference¶ A string field.
-
label¶ A string field.
-
caption¶ A string field.
-
src¶ A string field.
-
clean_src= <chemdataextractor.text.processors.Chain object>¶
-
clean_caption= <chemdataextractor.text.processors.Chain object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscHtmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntity-
title¶ A string field.
-
abstract¶ A string field.
-
pdf_url¶ A field with optional URL processing.
-
html_url¶ A field with optional URL processing.
-
landing_url¶ A field with optional URL processing.
-
clean_title= <chemdataextractor.text.processors.Chain object>¶
-
clean_abstract= <chemdataextractor.text.processors.Chain object>¶
-
process_title= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.RscHtmlScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraperScraper for RSC Landing pages.
-
entity¶ alias of
RscHtmlDocument
-
.scrape.pub.springer¶
Tools for scraping documents from Springer, Biomed Central and Chemistry Central XML files.
-
chemdataextractor.scrape.pub.springer.strip_springer_xml= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills equations/formulas.
-
chemdataextractor.scrape.pub.springer.strip_springer_abstract_xml= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills headings
-
chemdataextractor.scrape.pub.springer.tidy_springer_references(document)[source]¶ Remove punctuation around references like brackets, commas, hyphens.
-
class
chemdataextractor.scrape.pub.springer.SpringerHtmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntityScraper for Springer HTML articles
-
title¶ A string field.
-
abstract¶ A string field.
-
journal¶ A string field.
-
process_html_url= <chemdataextractor.text.processors.RAdd object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.SpringerXmlAuthor(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityAuthor information from a Springer XML file.
-
firstname¶ A string field.
-
middlename¶ A string field.
-
lastname¶ A string field.
-
suffix¶ A string field.
-
email¶ A string field.
-
process_email= <chemdataextractor.text.processors.Discard object>¶
-
fields= {'email': <chemdataextractor.scrape.fields.StringField object>, 'firstname': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>, 'middlename': <chemdataextractor.scrape.fields.StringField object>, 'suffix': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.SpringerXmlImage(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityFigure information from a Springer XML file.
-
label¶ A string field.
-
caption¶ A string field.
-
reference¶ A string field.
-
clean_caption= <chemdataextractor.scrape.clean.Cleaner object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.SpringerXmlTable(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityTable information from a Springer XML file.
-
label¶ A string field.
-
caption¶ A string field.
-
reference¶ A string field.
-
src¶ A string field.
-
clean_caption= <chemdataextractor.scrape.clean.Cleaner object>¶
-
process_caption= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.SpringerXmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityDocument information from a Springer XML file.
-
ui¶ A string field.
-
doi¶ A string field.
-
title¶ A string field.
A field that contains another Entity.
-
journal¶ A string field.
-
firstpage¶ A string field.
-
year¶ An integer number field.
-
volume¶ A string field.
-
issue¶ A string field.
-
issn¶ A string field.
-
landing_url¶ A field with optional URL processing.
-
abstract¶ A string field.
-
published_year¶ An integer number field.
-
published_month¶ An integer number field.
-
published_day¶ An integer number field.
-
accepted_year¶ An integer number field.
-
accepted_month¶ An integer number field.
-
accepted_day¶ An integer number field.
-
received_year¶ An integer number field.
-
received_month¶ An integer number field.
-
received_day¶ An integer number field.
-
license¶ A field with optional URL processing.
-
figures¶ A field that contains another Entity.
-
schemes¶ A field that contains another Entity.
-
tables¶ A field that contains another Entity.
-
headings¶ A string field.
-
paragraphs¶ A string field.
-
clean_title= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_abstract= <chemdataextractor.text.processors.Chain object>¶
-
clean_headings= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_paragraphs= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_headings= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_paragraphs= <chemdataextractor.text.processors.Chain object>¶
-
process_license= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'schemes': <chemdataextractor.scrape.fields.EntityField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'ui': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>, 'year': <chemdataextractor.scrape.fields.IntField object>}¶
-
.scrape.pub.elsevier¶
Tools for scraping documents from Elsevier.
| copyright: | Copyright 2017 by Callum Court. |
|---|---|
| license: | MIT, see LICENSE file for more details. |
-
chemdataextractor.scrape.pub.elsevier.CHAR_REPLACEMENTS= [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ('\\[?\\[2 with combining macron\\]\\]?', '2̄'), ('\\[?\\[3 with combining macron\\]\\]?', '3̄'), ('\\[?\\[4 with combining macron\\]\\]?', '4̄'), ('\\[?\\[approximate\\]\\]?', '≈'), ('\\[?\\[bottom\\]\\]?', '⊥'), ('\\[?\\[c with combining tilde\\]\\]?', 'C̃'), ('\\[?\\[capital delta\\]\\]?', 'Δ'), ('\\[?\\[capital lambda\\]\\]?', 'Λ'), ('\\[?\\[capital omega\\]\\]?', 'Ω'), ('\\[?\\[capital phi\\]\\]?', 'Φ'), ('\\[?\\[capital pi\\]\\]?', 'Π'), ('\\[?\\[capital psi\\]\\]?', 'Ψ'), ('\\[?\\[capital sigma\\]\\]?', 'Σ'), ('\\[?\\[caret\\]\\]?', '^'), ('\\[?\\[congruent with\\]\\]?', '≅'), ('\\[?\\[curly or open phi\\]\\]?', 'ϕ'), ('\\[?\\[dagger\\]\\]?', '†'), ('\\[?\\[dbl greater-than\\]\\]?', '≫'), ('\\[?\\[dbl vertical bar\\]\\]?', '‖'), ('\\[?\\[degree\\]\\]?', '°'), ('\\[?\\[double bond, length as m-dash\\]\\]?', '='), ('\\[?\\[double bond, length half m-dash\\]\\]?', '='), ('\\[?\\[double dagger\\]\\]?', '‡'), ('\\[?\\[double equals\\]\\]?', '≧'), ('\\[?\\[double less-than\\]\\]?', '≪'), ('\\[?\\[double prime\\]\\]?', '″'), ('\\[?\\[downward arrow\\]\\]?', '↓'), ('\\[?\\[fraction five-over-two\\]\\]?', '5/2'), ('\\[?\\[fraction three-over-two\\]\\]?', '3/2'), ('\\[?\\[gamma\\]\\]?', 'γ'), ('\\[?\\[greater-than-or-equal\\]\\]?', '≥'), ('\\[?\\[greater, similar\\]\\]?', '≳'), ('\\[?\\[gt-or-equal\\]\\]?', '≥'), ('\\[?\\[i without dot\\]\\]?', 'ı'), ('\\[?\\[identical with\\]\\]?', '≡'), ('\\[?\\[infinity\\]\\]?', '∞'), ('\\[?\\[intersection\\]\\]?', '∩'), ('\\[?\\[iota\\]\\]?', 'ι'), ('\\[?\\[is proportional to\\]\\]?', '∝'), ('\\[?\\[leftrightarrow\\]\\]?', '↔'), ('\\[?\\[leftrightarrows\\]\\]?', '⇄'), ('\\[?\\[less-than-or-equal\\]\\]?', '≤'), ('\\[?\\[less, similar\\]\\]?', '≲'), ('\\[?\\[logical and\\]\\]?', '∧'), ('\\[?\\[middle dot\\]\\]?', '·'), ('\\[?\\[not equal\\]\\]?', '≠'), ('\\[?\\[parallel\\]\\]?', '∥'), ('\\[?\\[per thousand\\]\\]?', '‰'), ('\\[?\\[prime or minute\\]\\]?', '′'), ('\\[?\\[quadruple bond, length as m-dash\\]\\]?', '≣'), ('\\[?\\[radical dot\\]\\]?', ' ̇'), ('\\[?\\[ratio\\]\\]?', '∶'), ('\\[?\\[registered sign\\]\\]?', '®'), ('\\[?\\[reverse similar\\]\\]?', '∽'), ('\\[?\\[right left arrows\\]\\]?', '⇄'), ('\\[?\\[right left harpoons\\]\\]?', '⇌'), ('\\[?\\[rightward arrow\\]\\]?', '→'), ('\\[?\\[round bullet, filled\\]\\]?', '•'), ('\\[?\\[sigma\\]\\]?', 'σ'), ('\\[?\\[similar\\]\\]?', '∼'), ('\\[?\\[small alpha\\]\\]?', 'α'), ('\\[?\\[small beta\\]\\]?', 'β'), ('\\[?\\[small chi\\]\\]?', 'χ'), ('\\[?\\[small delta\\]\\]?', 'δ'), ('\\[?\\[small eta\\]\\]?', 'η'), ('\\[?\\[small gamma, Greek, dot above\\]\\]?', 'γ̇'), ('\\[?\\[small kappa\\]\\]?', 'κ'), ('\\[?\\[small lambda\\]\\]?', 'λ'), ('\\[?\\[small micro\\]\\]?', 'µ'), ('\\[?\\[small mu \\]\\]?', 'μ'), ('\\[?\\[small nu\\]\\]?', 'ν'), ('\\[?\\[small omega\\]\\]?', 'ω'), ('\\[?\\[small phi\\]\\]?', 'φ'), ('\\[?\\[small pi\\]\\]?', 'π'), ('\\[?\\[small psi\\]\\]?', 'ψ'), ('\\[?\\[small tau\\]\\]?', 'τ'), ('\\[?\\[small theta\\]\\]?', 'θ'), ('\\[?\\[small upsilon\\]\\]?', 'υ'), ('\\[?\\[small xi\\]\\]?', 'ξ'), ('\\[?\\[small zeta\\]\\]?', 'ζ'), ('\\[?\\[space\\]\\]?', ' '), ('\\[?\\[square\\]\\]?', '□'), ('\\[?\\[subset or is implied by\\]\\]?', '⊂'), ('\\[?\\[summation operator\\]\\]?', '∑'), ('\\[?\\[times\\]\\]?', '×'), ('\\[?\\[trade mark sign\\]\\]?', '™'), ('\\[?\\[triple bond, length as m-dash\\]\\]?', '≡'), ('\\[?\\[triple bond, length half m-dash\\]\\]?', '≡'), ('\\[?\\[triple prime\\]\\]?', '‴'), ('\\[?\\[upper bond 1 end\\]\\]?', ''), ('\\[?\\[upper bond 1 start\\]\\]?', ''), ('\\[?\\[upward arrow\\]\\]?', '↑'), ('\\[?\\[varepsilon\\]\\]?', 'ε'), ('\\[?\\[x with combining tilde\\]\\]?', 'X̃')]¶ Map placeholder text to unicode characters.
-
chemdataextractor.scrape.pub.elsevier.elsevier_substitute= <chemdataextractor.text.processors.Substitutor object>¶ Substitutor that replaces ACS escape codes with the actual unicode character
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierSearchDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityDocument information from Elsevier API search results.
-
test¶ A string field.
-
fields= {'test': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierSearchScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraperScraper for Elsevier search results.
-
entity¶ alias of
ElsevierSearchDocument
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierImage(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityEmbedded figure. Includes both Schemes and Figures.
-
caption¶ A string field.
-
image_url¶ A string field.
-
process_caption= <chemdataextractor.text.processors.Chain object>¶
-
process_image_url= <chemdataextractor.text.processors.LAdd object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'image_url': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierTableData(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityEmbedded row data from document tables
-
rows¶ A string field.
-
fields= {'rows': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierTable(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityTable within document.
-
title¶ A string field.
-
column_headings¶ A string field.
-
data¶ A field that contains another Entity.
-
caption¶ A string field.
-
process_title= <chemdataextractor.text.processors.Chain object>¶
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.StringField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierHtmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntityScraper of document information from Elsevier html papers
-
doi¶ A string field.
-
title¶ A string field.
A string field.
-
abstract¶ A string field.
-
journal¶ A string field.
-
volume¶ A string field.
-
copyright¶ A string field.
-
headings¶ A string field.
-
sub_headings¶ A string field.
-
html_url¶ A field with optional URL processing.
-
paragraphs¶ A string field.
-
figures¶ A field that contains another Entity.
-
published_date¶ A string field.
-
citations¶ A string field.
-
tables¶ A field that contains another Entity.
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'sub_headings': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierHtmlScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraperScraper for Elsever html paper pages
-
entity¶ alias of
ElsevierHtmlDocument
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierXmlImage(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity-
caption¶ A string field.
-
label¶ A string field.
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierXmlTableData(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity-
rows¶ A string field.
-
fields= {'rows': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierXmlTable(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity-
label¶ A string field.
-
caption¶ A string field.
-
column_headings¶ A field that contains another Entity.
-
data¶ A field that contains another Entity.
-
fields= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.EntityField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.ElsevierXmlDocument(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityScraper for Elsevier XML articles
-
doi¶ A string field.
-
title¶ A string field.
A string field.
-
abstract¶ A string field.
-
journal¶ A string field.
-
volume¶ A string field.
-
issue¶ A string field.
-
pages¶ A string field.
-
firstpage¶ A string field.
-
lastpage¶ A string field.
-
copyright¶ A string field.
-
publisher¶ A string field.
-
headings¶ A string field.
-
url¶ A field with optional URL processing.
-
paragraphs¶ A string field.
-
figures¶ A field that contains another Entity.
-
published_date¶ A string field.
-
citations¶ A string field.
-
tables¶ A field that contains another Entity.
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'pages': <chemdataextractor.scrape.fields.StringField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
process_abstract= <chemdataextractor.text.processors.Chain object>¶
-
.scrape.base¶
Abstract base classes that define the interface for Scrapers, Fields, Crawlers, etc.
-
class
chemdataextractor.scrape.base.BaseScraper[source]¶ Bases:
objectAbstract Scraper class from which all Scrapers inherit.
-
root= None¶ CSS selector or XPath expression that returns the root of each entity.
-
root_xpath= False¶ Whether the root is an XPath expression instead of a CSS selector.
-
create_session()[source]¶ Override to set up default data (e.g. headers, authentication) on each request.
-
entity¶ The Entity to scrape.
-
make_request(url, data)[source]¶ Make a HTTP request.
Parameters: - url – The URL to get.
- data – Query data.
Returns: The response to the request.
Return type: requests.Response
-
-
class
chemdataextractor.scrape.base.BaseEntityProcessor[source]¶ Bases:
objectAbstract EntityProcessor class from which all EntityProcessors inherit.
-
process_entity(entity)[source]¶ Process an Entity. Return None to filter Entity from the pipeline.
Parameters: entity (chemdataextractor.scrape.entity.Entity) – The Entity to process. Returns: The processed Entity. Return type: Entity or None
-
-
class
chemdataextractor.scrape.base.BaseEntity[source]¶ Bases:
objectAbstract Entity class from which all Entities inherit.
-
class
chemdataextractor.scrape.base.EntityMeta[source]¶ Bases:
abc.ABCMetaMetaclass for Entity.
-
class
chemdataextractor.scrape.base.BaseField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
objectBase class for all fields.
-
name= None¶
-
__init__(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Parameters: - selection (string) – The CSS selector or XPath expression used to select the content to scrape.
- xpath (bool) – (Optional) Whether selection is an XPath expression instead of a CSS selector. Default False.
- re – (Optional) Regular expression to apply to scraped content.
- all (bool) – (Optional) Whether to scrape all occurrences instead of just the first. Default False.
- default – (Optional) The default value for this field if none is set.
- null (bool) – (Optional) Include in serialized output even if value is None. Default False.
- raw (bool) – (Optional) Whether to scrape the raw HTML/XML instead of the text contents. Default False.
-
.scrape.clean¶
Tools for cleaning up XML/HTML by removing tags entirely or replacing with their contents.
-
class
chemdataextractor.scrape.clean.Cleaner(**kwargs)[source]¶ Bases:
objectClean HTML or XML by removing tags completely or replacing with their contents.
A Cleaner instance provides a
clean_markupmethod:cleaner = Cleaner() htmlstring = '<html><body><script>alert("test")</script><p>Some text</p></body></html>' print(cleaner.clean_markup(htmlstring))
A Cleaner instance is also a callable that can be applied to lxml document trees:
tree = lxml.etree.fromstring(htmlstring) cleaner(tree) print(lxml.etree.tostring(tree))
Elements that are matched by
kill_xpathare removed entirely, along with their contents. By default,kill_xpathmatches all script and style tags, as well as comments and processing instructions.Elements that are matched by
strip_xpathare replaced with their contents. By default, no elements are stripped. A common use-case is to setstrip_xpathto.//*, which specifies that all elements should be stripped.Elements that are matched by
allow_xpathare excepted from stripping, even if they are also matched bystrip_xpath. This is useful when settingstrip_xpathto strip all tags, allowing a few expections to be specified byallow_xpath.-
kill_xpath= './/script | .//style | .//comment() | .//processing-instruction() | .//*[@style="display:none;"]'¶
-
strip_xpath= None¶
-
allow_xpath= None¶
-
fix_whitespace= True¶
-
namespaces= {'dc': 'http://purl.org/dc/elements/1.1/', 'prism': 'http://prismstandard.org/namespaces/basic/2.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns', 're': 'http://exslt.org/regular-expressions', 'set': 'http://exslt.org/sets', 'xml': 'http://www.w3.org/XML/1998/namespace'}¶
-
__init__(**kwargs)[source]¶ Behaviour can be customized by overriding attributes in a subclass or setting them in the constructor.
Parameters: - kill_xpath (string) – XPath expression for tags to remove along with their contents.
- strip_xpath (string) – XPath expression for tags to replace with their contents.
- allow_xpath (string) – XPath expression for tags to except from strip_xpath.
- fix_whitespace (bool) – Normalize whitespace to a single space and ensure newlines around block elements.
- namespaces (dict) – Namespace prefixes to register for the XPaths.
-
-
chemdataextractor.scrape.clean.clean= <chemdataextractor.scrape.clean.Cleaner object>¶ A default Cleaner instance, which kills comments, processing instructions, script tags, style tags.
-
chemdataextractor.scrape.clean.clean_markup= <bound method Cleaner.clean_markup of <chemdataextractor.scrape.clean.Cleaner object>>¶ Convenience function for applying
cleanto a string.
-
chemdataextractor.scrape.clean.clean_html= <bound method Cleaner.clean_html of <chemdataextractor.scrape.clean.Cleaner object>>¶ Convenience function for applying
cleanto a HTML string.
-
chemdataextractor.scrape.clean.strip= <chemdataextractor.scrape.clean.Cleaner object>¶ A Cleaner instance that is configured to strip all tags, replacing them with their text contents.
-
chemdataextractor.scrape.clean.strip_markup= <bound method Cleaner.clean_markup of <chemdataextractor.scrape.clean.Cleaner object>>¶ Convenience function for applying
stripto a string.
-
chemdataextractor.scrape.clean.strip_html= <bound method Cleaner.clean_html of <chemdataextractor.scrape.clean.Cleaner object>>¶ Convenience function for applying
stripto a HTML string.
.scrape.csstranslator¶
Extend cssselect to improve handling of pseudo-elements.
This is derived from csstranslator.py in the Scrapy project. The original file is available at: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/csstranslator.py
The original file was released under the BSD license:
Copyright (c) Scrapy developers. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of Scrapy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
class
chemdataextractor.scrape.csstranslator.CdeXPathExpr(path='', element='*', condition='', star_prefix=False)[source]¶ Bases:
cssselect.xpath.XPathExpr-
textnode= False¶
-
attribute= None¶
-
-
class
chemdataextractor.scrape.csstranslator.CssXmlTranslator[source]¶ Bases:
chemdataextractor.scrape.csstranslator.TranslatorMixin,cssselect.xpath.GenericTranslator
-
class
chemdataextractor.scrape.csstranslator.CssHTMLTranslator(xhtml=False)[source]¶ Bases:
chemdataextractor.scrape.csstranslator.TranslatorMixin,cssselect.xpath.HTMLTranslator
.scrape.entity¶
An entity to extract.
-
class
chemdataextractor.scrape.entity.Entity(selector)[source]¶ Bases:
chemdataextractor.scrape.base.BaseEntity-
fields= {}¶
-
-
class
chemdataextractor.scrape.entity.EntityList(*entities)[source]¶ Bases:
collections.abc.SequenceWrapper around a list of Entities to facilitate operations on all at once.
-
class
chemdataextractor.scrape.entity.DocumentEntity(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.EntityGeneric document entity.
-
doi¶ A string field.
-
title¶ A string field.
A string field.
-
published_date¶ A datetime field. Depends on python-dateutil.
-
online_date¶ A datetime field. Depends on python-dateutil.
-
journal¶ A string field.
-
volume¶ A string field.
-
issue¶ A string field.
-
firstpage¶ A string field.
-
lastpage¶ A string field.
-
abstract¶ A string field.
-
publisher¶ A string field.
-
issn¶ A string field.
-
language¶ A string field.
-
copyright¶ A string field.
-
license¶ A field with optional URL processing.
-
html_url¶ A field with optional URL processing.
-
pdf_url¶ A field with optional URL processing.
-
landing_url¶ A field with optional URL processing.
-
process_title= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_journal= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_publisher= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_abstract= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
.scrape.fields¶
Fields to define on an entity.
-
class
chemdataextractor.scrape.fields.StringField(selection, lower=False, upper=False, strip=False, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldA string field.
-
class
chemdataextractor.scrape.fields.UrlField(selection, strip_querystring=False, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.fields.StringFieldA field with optional URL processing.
-
class
chemdataextractor.scrape.fields.EntityField(entity, selection, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldA field that contains another Entity.
-
class
chemdataextractor.scrape.fields.IntField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldAn integer number field.
-
class
chemdataextractor.scrape.fields.FloatField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldAn floating point number field.
-
class
chemdataextractor.scrape.fields.BoolField(selection, true=re.compile('true|yes|1', re.IGNORECASE), false=re.compile('false|no|0', re.IGNORECASE), **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldA boolean field type.
-
class
chemdataextractor.scrape.fields.DateTimeField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseFieldA datetime field. Depends on python-dateutil.
.scrape.scraper¶
Concrete classes for scraping and searching.
-
class
chemdataextractor.scrape.scraper.HtmlFormat[source]¶ Bases:
chemdataextractor.scrape.base.BaseFormatProcess HTML response and return a Selector.
-
class
chemdataextractor.scrape.scraper.XmlFormat[source]¶ Bases:
chemdataextractor.scrape.base.BaseFormatProcess XML response and return a Selector.
-
namespaces= None¶
-
-
class
chemdataextractor.scrape.scraper.UrlScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.GetRequester,chemdataextractor.scrape.scraper.HtmlFormat,chemdataextractor.scrape.base.BaseScraperScraper that takes a URL as input.
-
class
chemdataextractor.scrape.scraper.RssScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.XmlFormat,chemdataextractor.scrape.scraper.UrlScraperRSS scraper
-
root= 'item'¶
-
namespaces= {'atom': 'http://www.w3.org/2005/Atom', 'feedburner': 'http://rssnamespace.org/feedburner/ext/1.0'}¶
-
-
class
chemdataextractor.scrape.scraper.SearchScraper[source]¶ Bases:
chemdataextractor.scrape.scraper.GetRequester,chemdataextractor.scrape.scraper.HtmlFormat,chemdataextractor.scrape.base.BaseScraperScraper that takes a search query as input.
-
class
chemdataextractor.scrape.scraper.SearchResult[source]¶ Bases:
objectClass to handle results from a search query to websites, regardless of method of scraping used.
-
selector¶ Process the result of the search, giving a selector
Returns: The result of the search Return type: selector
-
.scrape.selector¶
Tool for selecting content from HTML or XML using CSS or XPath expressions.
-
class
chemdataextractor.scrape.selector.Selector(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]¶ Bases:
objectTool for selecting content from HTML or XML using XPath selectors.
-
__init__(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
from_text(text, base_url=None, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None, encoding=None)[source]¶
-
classmethod
from_response(response, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None)[source]¶
-
path¶ Absolute path to the root of this selector.
-
tag¶ Tag name of the root of this selector.
-
-
class
chemdataextractor.scrape.selector.SelectorList(*selectors)[source]¶ Bases:
collections.abc.SequenceWrapper around a list of Selectors to allow selecting from all at once.