.scrape

Scrapers for the various data sources

Declarative scraping framework for extracting structured data from HTML and XML documents.

chemdataextractor.scrape.BLOCK_ELEMENTS = {'address', 'article', 'aside', 'audio', 'blockquo...

Block level HTML elements

chemdataextractor.scrape.INLINE_ELEMENTS = {'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'blink...

Inline level HTML elements

.scrape.pub

Scraping tools for specific publishers.

.scrape.pub.nlm

Tools for scraping documents from NLM Journal Archiving and Interchange DTD XML files.

chemdataextractor.scrape.pub.nlm.strip_pmc_xml = <chemdataextractor.scrape.clean.Cleaner object>

XML stripper that kills reference links, footnote links, equations, footnotes

chemdataextractor.scrape.pub.nlm.strip_pmc_abstract_xml = <chemdataextractor.scrape.clean.Cleaner object>

XML stripper that also kills headings

chemdataextractor.scrape.pub.nlm.strip_pmc_paragraph_xml = <chemdataextractor.scrape.clean.Cleaner object>

XML stripper that also kills tables and figures

chemdataextractor.scrape.pub.nlm.space_labels(document)[source]

Ensure space around bold compound labels.

chemdataextractor.scrape.pub.nlm.tidy_nlm_references(document)[source]

Remove punctuation around references like brackets, commas, hyphens.

class chemdataextractor.scrape.pub.nlm.NlmXmlAuthor(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Author information from NLM XML file.

givennames

A string field.

lastname

A string field.

email

A string field.

process_givennames = <chemdataextractor.text.normalize.Normalizer object>
process_lastname = <chemdataextractor.text.normalize.Normalizer object>
fields = {'email': <chemdataextractor.scrape.fields.StringField object>, 'givennames': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.nlm.NlmXmlImage(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Figure information from NLM XML file.

label

A string field.

caption

A string field.

reference

A string field.

clean_caption = <chemdataextractor.text.processors.Chain object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.nlm.NlmXmlTable(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Table information from NLM XML file.

label

A string field.

caption

A string field.

reference

A string field.

src

A string field.

clean_caption = <chemdataextractor.text.processors.Chain object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.nlm.NlmXmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Document information from a NLM XML file.

doi

A string field.

pmid

An integer number field.

pmcid

An integer number field.

title

A string field.

authors

A field that contains another Entity.

journal_title

A string field.

journal_abbreviation

A string field.

publisher

A string field.

volume

A string field.

firstpage

A string field.

lastpage

A string field.

issue

A string field.

issn

A string field.

coden

A string field.

abstract

A string field.

online_year

An integer number field.

online_month

An integer number field.

online_day

An integer number field.

published_year

An integer number field.

published_month

An integer number field.

published_day

An integer number field.

accepted_year

An integer number field.

accepted_month

An integer number field.

accepted_day

An integer number field.

received_year

An integer number field.

received_month

An integer number field.

received_day

An integer number field.

license

A field with optional URL processing.

clean_title = <chemdataextractor.scrape.clean.Cleaner object>
clean_abstract = <chemdataextractor.scrape.clean.Cleaner object>
process_title = <chemdataextractor.text.normalize.Normalizer object>
process_publisher = <chemdataextractor.text.normalize.Normalizer object>
process_abstract = <chemdataextractor.text.normalize.Normalizer object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'coden': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal_abbreviation': <chemdataextractor.scrape.fields.StringField object>, 'journal_title': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_day': <chemdataextractor.scrape.fields.IntField object>, 'online_month': <chemdataextractor.scrape.fields.IntField object>, 'online_year': <chemdataextractor.scrape.fields.IntField object>, 'pmcid': <chemdataextractor.scrape.fields.IntField object>, 'pmid': <chemdataextractor.scrape.fields.IntField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}

.scrape.pub.rsc

Tools for scraping documents from The Royal Society of Chemistry.

chemdataextractor.scrape.pub.rsc.CHAR_REPLACEMENTS = [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ...

Map placeholder text to unicode characters.

chemdataextractor.scrape.pub.rsc.RSC_IMG_CHARS = {'2041': '^', '224a': '≈', 'e001': '=', 'e002': '≡...

Map image URL components to unicode characters.

chemdataextractor.scrape.pub.rsc.strip_rsc_html = <chemdataextractor.scrape.clean.Cleaner object>

none;” (typically tooltips)

Type:

HTML stripper that kills superscript references and anything with style=”display

chemdataextractor.scrape.pub.rsc.strip_cit_html = <chemdataextractor.scrape.clean.Cleaner object>

HTML stripper that also kills text from buttons in references.

chemdataextractor.scrape.pub.rsc.rsc_substitute = <chemdataextractor.text.processors.Substitutor obj...

Substitutor that replaces RSC escape codes with the actual unicode character

chemdataextractor.scrape.pub.rsc.parse_rsc_html(htmlstring)[source]

Messy RSC HTML needs this special parser to fix problems before creating selector.

chemdataextractor.scrape.pub.rsc.replace_rsc_img_chars(document)[source]

Replace image characters with unicode equivalents.

chemdataextractor.scrape.pub.rsc.space_references(document)[source]

Ensure a space around reference links, so there’s a gap when they are removed.

class chemdataextractor.scrape.pub.rsc.RscRssDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Document information from RSC RSS feed.

doi

A string field.

title

A string field.

authors

A string field.

landing_url

A field with optional URL processing.

process_title = <chemdataextractor.text.processors.Chain object>
finalize_doi(value)[source]

Derive the DOI from the GUID.

fields = {'authors': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscRssScraper[source]

Bases: chemdataextractor.scrape.scraper.RssScraper

Scraper for RSC RSS feeds.

entity

alias of RscRssDocument

class chemdataextractor.scrape.pub.rsc.RscSearchDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Document information from RSC search results page.

doi

A string field.

title

A string field.

landing_url

A field with optional URL processing.

pdf_url

A field with optional URL processing.

html_url

A field with optional URL processing.

journal

A string field.

abstract

A string field.

clean_title = <chemdataextractor.text.processors.Chain object>
process_doi = <chemdataextractor.text.processors.LAdd object>
process_title = <chemdataextractor.text.processors.Chain object>
process_landing_url = <chemdataextractor.text.processors.Chain object>
process_pdf_url = <chemdataextractor.text.processors.Chain object>
process_html_url = <chemdataextractor.text.processors.Chain object>
process_abstract = <chemdataextractor.text.processors.Chain object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscSearchScraper(max_wait_time=30, driver=None)[source]

Bases: chemdataextractor.scrape.scraper.SearchScraper

Scraper for RSC search results.

entity

alias of RscSearchDocument

root = '.capsule.capsule--article'
__init__(max_wait_time=30, driver=None)[source]
Parameters:
  • driver (selenium.webdriver) – driver from which results will be scraped.

  • max_wait_time (float) – Maximum time spent waiting for the page to load. (seconds)

Due to RSC not accepting html requests, Selenium is used. By default, the Firefox webdriver is used.

class chemdataextractor.scrape.pub.rsc.RscLandingSupplement(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

name

A string field.

url

A field with optional URL processing.

fields = {'name': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}
class chemdataextractor.scrape.pub.rsc.RscLandingDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.DocumentEntity

Document information from RSC landing page.

supplements

A field that contains another Entity.

process_abstract = <chemdataextractor.text.processors.Chain object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'supplements': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscLandingScraper[source]

Bases: chemdataextractor.scrape.scraper.UrlScraper

Scraper for RSC Landing pages.

entity

alias of RscLandingDocument

class chemdataextractor.scrape.pub.rsc.RscChemicalMention(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

text

A string field.

chemspider_id

A string field.

inchi

A string field.

clean_text = <chemdataextractor.text.processors.Chain object>
process_text = <chemdataextractor.text.normalize.Normalizer object>
process_chemspider_id = <chemdataextractor.text.processors.Chain object>
process_inchi = <chemdataextractor.text.processors.Chain object>
fields = {'chemspider_id': <chemdataextractor.scrape.fields.StringField object>, 'inchi': <chemdataextractor.scrape.fields.StringField object>, 'text': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscImage(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Embedded image. Includes both Schemes and Figures.

url

A field with optional URL processing.

label

A string field.

reference

A string field.

caption

A string field.

clean_caption = <chemdataextractor.text.processors.Chain object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}
class chemdataextractor.scrape.pub.rsc.RscTable(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Table within document.

reference

A string field.

label

A string field.

caption

A string field.

src

A string field.

clean_src = <chemdataextractor.text.processors.Chain object>
clean_caption = <chemdataextractor.text.processors.Chain object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscHtmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.DocumentEntity

title

A string field.

abstract

A string field.

pdf_url

A field with optional URL processing.

html_url

A field with optional URL processing.

landing_url

A field with optional URL processing.

clean_title = <chemdataextractor.text.processors.Chain object>
clean_abstract = <chemdataextractor.text.processors.Chain object>
process_title = <chemdataextractor.text.processors.Chain object>
process_abstract = <chemdataextractor.text.normalize.Normalizer object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.rsc.RscHtmlScraper[source]

Bases: chemdataextractor.scrape.scraper.UrlScraper

Scraper for RSC Landing pages.

entity

alias of RscHtmlDocument

.scrape.pub.springer

Tools for scraping documents from Springer, Biomed Central and Chemistry Central XML files.

chemdataextractor.scrape.pub.springer.strip_springer_xml = <chemdataextractor.scrape.clean.Cleaner object>

XML stripper that also kills equations/formulas.

chemdataextractor.scrape.pub.springer.strip_springer_abstract_xml = <chemdataextractor.scrape.clean.Cleaner object>

XML stripper that also kills headings

chemdataextractor.scrape.pub.springer.tidy_springer_references(document)[source]

Remove punctuation around references like brackets, commas, hyphens.

class chemdataextractor.scrape.pub.springer.SpringerHtmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.DocumentEntity

Scraper for Springer HTML articles

title

A string field.

abstract

A string field.

journal

A string field.

process_html_url = <chemdataextractor.text.processors.RAdd object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.springer.SpringerXmlAuthor(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Author information from a Springer XML file.

firstname

A string field.

middlename

A string field.

lastname

A string field.

suffix

A string field.

email

A string field.

process_email = <chemdataextractor.text.processors.Discard object>
fields = {'email': <chemdataextractor.scrape.fields.StringField object>, 'firstname': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>, 'middlename': <chemdataextractor.scrape.fields.StringField object>, 'suffix': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.springer.SpringerXmlImage(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Figure information from a Springer XML file.

label

A string field.

caption

A string field.

reference

A string field.

clean_caption = <chemdataextractor.scrape.clean.Cleaner object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.springer.SpringerXmlTable(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Table information from a Springer XML file.

label

A string field.

caption

A string field.

reference

A string field.

src

A string field.

clean_caption = <chemdataextractor.scrape.clean.Cleaner object>
process_caption = <chemdataextractor.text.normalize.Normalizer object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.springer.SpringerXmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Document information from a Springer XML file.

ui

A string field.

doi

A string field.

title

A string field.

authors

A field that contains another Entity.

journal

A string field.

firstpage

A string field.

year

An integer number field.

volume

A string field.

issue

A string field.

issn

A string field.

landing_url

A field with optional URL processing.

abstract

A string field.

published_year

An integer number field.

published_month

An integer number field.

published_day

An integer number field.

accepted_year

An integer number field.

accepted_month

An integer number field.

accepted_day

An integer number field.

received_year

An integer number field.

received_month

An integer number field.

received_day

An integer number field.

license

A field with optional URL processing.

figures

A field that contains another Entity.

schemes

A field that contains another Entity.

tables

A field that contains another Entity.

headings

A string field.

paragraphs

A string field.

clean_title = <chemdataextractor.scrape.clean.Cleaner object>
clean_abstract = <chemdataextractor.text.processors.Chain object>
clean_headings = <chemdataextractor.scrape.clean.Cleaner object>
clean_paragraphs = <chemdataextractor.text.processors.Chain object>
process_abstract = <chemdataextractor.text.normalize.Normalizer object>
process_headings = <chemdataextractor.text.normalize.Normalizer object>
process_paragraphs = <chemdataextractor.text.processors.Chain object>
process_license = <chemdataextractor.text.processors.Chain object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'schemes': <chemdataextractor.scrape.fields.EntityField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'ui': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>, 'year': <chemdataextractor.scrape.fields.IntField object>}

.scrape.pub.elsevier

Tools for scraping documents from Elsevier.

copyright:

Copyright 2017 by Callum Court.

license:

MIT, see LICENSE file for more details.

chemdataextractor.scrape.pub.elsevier.CHAR_REPLACEMENTS = [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ...

Map placeholder text to unicode characters.

chemdataextractor.scrape.pub.elsevier.elsevier_substitute = <chemdataextractor.text.processors.Substitutor obj...

Substitutor that replaces ACS escape codes with the actual unicode character

class chemdataextractor.scrape.pub.elsevier.ElsevierSearchDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Document information from Elsevier API search results.

test

A string field.

fields = {'test': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierSearchScraper[source]

Bases: chemdataextractor.scrape.scraper.UrlScraper

Scraper for Elsevier search results.

entity

alias of ElsevierSearchDocument

make_request(url)[source]

Make a HTTP GET request.

Parameters:

url – The URL to get.

Returns:

The response to the request.

Return type:

requests.Response

run(url)[source]

Request URL, scrape response and return an EntityList.

class chemdataextractor.scrape.pub.elsevier.ElsevierImage(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Embedded figure. Includes both Schemes and Figures.

caption

A string field.

image_url

A string field.

process_caption = <chemdataextractor.text.processors.Chain object>
process_image_url = <chemdataextractor.text.processors.LAdd object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'image_url': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierTableData(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Embedded row data from document tables

rows

A string field.

fields = {'rows': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierTable(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Table within document.

title

A string field.

column_headings

A string field.

data

A field that contains another Entity.

caption

A string field.

process_title = <chemdataextractor.text.processors.Chain object>
fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.StringField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierHtmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.DocumentEntity

Scraper of document information from Elsevier html papers

doi

A string field.

title

A string field.

authors

A string field.

abstract

A string field.

journal

A string field.

volume

A string field.

copyright

A string field.

headings

A string field.

sub_headings

A string field.

html_url

A field with optional URL processing.

paragraphs

A string field.

figures

A field that contains another Entity.

published_date

A string field.

citations

A string field.

tables

A field that contains another Entity.

fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'sub_headings': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierHtmlScraper[source]

Bases: chemdataextractor.scrape.scraper.UrlScraper

Scraper for Elsever html paper pages

entity

alias of ElsevierHtmlDocument

class chemdataextractor.scrape.pub.elsevier.ElsevierXmlImage(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

caption

A string field.

label

A string field.

fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierXmlTableData(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

rows

A string field.

fields = {'rows': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierXmlTable(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

label

A string field.

caption

A string field.

column_headings

A field that contains another Entity.

data

A field that contains another Entity.

fields = {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.EntityField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}
class chemdataextractor.scrape.pub.elsevier.ElsevierXmlDocument(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Scraper for Elsevier XML articles

doi

A string field.

title

A string field.

authors

A string field.

abstract

A string field.

journal

A string field.

volume

A string field.

issue

A string field.

pages

A string field.

firstpage

A string field.

lastpage

A string field.

copyright

A string field.

publisher

A string field.

headings

A string field.

url

A field with optional URL processing.

paragraphs

A string field.

figures

A field that contains another Entity.

published_date

A string field.

citations

A string field.

tables

A field that contains another Entity.

fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'pages': <chemdataextractor.scrape.fields.StringField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}
process_abstract = <chemdataextractor.text.processors.Chain object>

.scrape.base

Abstract base classes that define the interface for Scrapers, Fields, Crawlers, etc.

class chemdataextractor.scrape.base.BaseScraper[source]

Bases: object

Abstract Scraper class from which all Scrapers inherit.

root = None

CSS selector or XPath expression that returns the root of each entity.

root_xpath = False

Whether the root is an XPath expression instead of a CSS selector.

__init__()[source]
create_session()[source]

Override to set up default data (e.g. headers, authentication) on each request.

name()[source]

A unique name for this scraper.

entity

The Entity to scrape.

process_entity(entity)[source]

Override to process each entity.

make_request(url, data)[source]

Make a HTTP request.

Parameters:
  • url – The URL to get.

  • data – Query data.

Returns:

The response to the request.

Return type:

requests.Response

process_response(response)[source]

Return a Selector for the given response.

Parameters:

response (requests.Response) – The response object.

Return type:

Selector

get_roots(selector)[source]
class chemdataextractor.scrape.base.BaseFormat[source]

Bases: object

process_response(response)[source]

Return a Selector for the given response.

Parameters:

response (requests.Response) – The response object.

Return type:

Selector

class chemdataextractor.scrape.base.BaseRequester[source]

Bases: object

make_request(url, data)[source]

Make a HTTP request.

Parameters:
  • url – The URL to get.

  • data – Query data.

Returns:

The response to the request.

Return type:

requests.Response

class chemdataextractor.scrape.base.BaseEntityProcessor[source]

Bases: object

Abstract EntityProcessor class from which all EntityProcessors inherit.

process_entity(entity)[source]

Process an Entity. Return None to filter Entity from the pipeline.

Parameters:

entity (chemdataextractor.scrape.entity.Entity) – The Entity to process.

Returns:

The processed Entity.

Return type:

Entity or None

class chemdataextractor.scrape.base.BaseEntity[source]

Bases: object

Abstract Entity class from which all Entities inherit.

class chemdataextractor.scrape.base.EntityMeta[source]

Bases: abc.ABCMeta

Metaclass for Entity.

class chemdataextractor.scrape.base.BaseField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]

Bases: object

Base class for all fields.

name = None
__init__(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]
Parameters:
  • selection (string) – The CSS selector or XPath expression used to select the content to scrape.

  • xpath (bool) – (Optional) Whether selection is an XPath expression instead of a CSS selector. Default False.

  • re – (Optional) Regular expression to apply to scraped content.

  • all (bool) – (Optional) Whether to scrape all occurrences instead of just the first. Default False.

  • default – (Optional) The default value for this field if none is set.

  • null (bool) – (Optional) Include in serialized output even if value is None. Default False.

  • raw (bool) – (Optional) Whether to scrape the raw HTML/XML instead of the text contents. Default False.

scrape(selector, cleaner=None, processor=None)[source]

Scrape the value for this field from the selector.

serialize(value)[source]

Serialize this field.

process(value)[source]

Override to perform custom processing of a value.

.scrape.clean

Tools for cleaning up XML/HTML by removing tags entirely or replacing with their contents.

class chemdataextractor.scrape.clean.Cleaner(**kwargs)[source]

Bases: object

Clean HTML or XML by removing tags completely or replacing with their contents.

A Cleaner instance provides a clean_markup method:

cleaner = Cleaner()
htmlstring = '<html><body><script>alert("test")</script><p>Some text</p></body></html>'
print(cleaner.clean_markup(htmlstring))

A Cleaner instance is also a callable that can be applied to lxml document trees:

tree = lxml.etree.fromstring(htmlstring)
cleaner(tree)
print(lxml.etree.tostring(tree))

Elements that are matched by kill_xpath are removed entirely, along with their contents. By default, kill_xpath matches all script and style tags, as well as comments and processing instructions.

Elements that are matched by strip_xpath are replaced with their contents. By default, no elements are stripped. A common use-case is to set strip_xpath to .//*, which specifies that all elements should be stripped.

Elements that are matched by allow_xpath are excepted from stripping, even if they are also matched by strip_xpath. This is useful when setting strip_xpath to strip all tags, allowing a few expections to be specified by allow_xpath.

kill_xpath = './/script | .//style | .//comment() | .//processing-instruction() | .//*[@style="display:none;"]'
strip_xpath = None
allow_xpath = None
fix_whitespace = True
process_xpaths = {}
namespaces = {'dc': 'http://purl.org/dc/elements/1.1/', 'prism': 'http://prismstandard.org/namespaces/basic/2.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns', 're': 'http://exslt.org/regular-expressions', 'set': 'http://exslt.org/sets', 'xml': 'http://www.w3.org/XML/1998/namespace'}
__init__(**kwargs)[source]

Behaviour can be customized by overriding attributes in a subclass or setting them in the constructor.

Parameters:
  • kill_xpath (string) – XPath expression for tags to remove along with their contents.

  • strip_xpath (string) – XPath expression for tags to replace with their contents.

  • allow_xpath (string) – XPath expression for tags to except from strip_xpath.

  • fix_whitespace (bool) – Normalize whitespace to a single space and ensure newlines around block elements.

  • namespaces (dict) – Namespace prefixes to register for the XPaths.

clean_html(html)[source]

Apply Cleaner to HTML string or document and return a cleaned string or document.

clean_markup(markup, parser=None)[source]

Apply Cleaner to markup string or document and return a cleaned string or document.

chemdataextractor.scrape.clean.clean = <chemdataextractor.scrape.clean.Cleaner object>

A default Cleaner instance, which kills comments, processing instructions, script tags, style tags.

chemdataextractor.scrape.clean.clean_markup = <bound method Cleaner.clean_markup of <chemdataext...

Convenience function for applying clean to a string.

chemdataextractor.scrape.clean.clean_html = <bound method Cleaner.clean_html of <chemdataextra...

Convenience function for applying clean to a HTML string.

chemdataextractor.scrape.clean.strip = <chemdataextractor.scrape.clean.Cleaner object>

A Cleaner instance that is configured to strip all tags, replacing them with their text contents.

chemdataextractor.scrape.clean.strip_markup = <bound method Cleaner.clean_markup of <chemdataext...

Convenience function for applying strip to a string.

chemdataextractor.scrape.clean.strip_html = <bound method Cleaner.clean_html of <chemdataextra...

Convenience function for applying strip to a HTML string.

.scrape.csstranslator

Extend cssselect to improve handling of pseudo-elements.

This is derived from csstranslator.py in the Scrapy project. The original file is available at: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/csstranslator.py

The original file was released under the BSD license:

Copyright (c) Scrapy developers. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  3. Neither the name of Scrapy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

class chemdataextractor.scrape.csstranslator.CdeXPathExpr(path='', element='*', condition='', star_prefix=False)[source]

Bases: cssselect.xpath.XPathExpr

textnode = False
attribute = None
classmethod from_xpath(xpath, textnode=False, attribute=None)[source]
join(combiner, other)[source]
class chemdataextractor.scrape.csstranslator.TranslatorMixin[source]

Bases: object

xpath_element(selector)[source]
xpath_pseudo_element(xpath, pseudo_element)[source]
xpath_attr_functional_pseudo_element(xpath, function)[source]
xpath_text_simple_pseudo_element(xpath)[source]

Support selecting text nodes using ::text pseudo-element

class chemdataextractor.scrape.csstranslator.CssXmlTranslator[source]

Bases: chemdataextractor.scrape.csstranslator.TranslatorMixin, cssselect.xpath.GenericTranslator

class chemdataextractor.scrape.csstranslator.CssHTMLTranslator(xhtml=False)[source]

Bases: chemdataextractor.scrape.csstranslator.TranslatorMixin, cssselect.xpath.HTMLTranslator

.scrape.entity

An entity to extract.

class chemdataextractor.scrape.entity.Entity(selector)[source]

Bases: chemdataextractor.scrape.base.BaseEntity

fields = {}
__init__(selector)[source]
Parameters:

selector (Selector) – The selector to scrape.

classmethod scrape(selector, root, xpath=False)[source]

Return EntityList for the given selector.

serialize()[source]

Convert Entity to python dictionary.

to_json(*args, **kwargs)[source]

Convert Entity to JSON.

class chemdataextractor.scrape.entity.EntityList(*entities)[source]

Bases: collections.abc.Sequence

Wrapper around a list of Entities to facilitate operations on all at once.

__init__(*entities)[source]

Initialize self. See help(type(self)) for accurate signature.

serialize()[source]

Serialize to a list of python dictionaries.

to_json(*args, **kwargs)[source]

Convert EntityList to JSON.

class chemdataextractor.scrape.entity.DocumentEntity(selector)[source]

Bases: chemdataextractor.scrape.entity.Entity

Generic document entity.

doi

A string field.

title

A string field.

authors

A string field.

published_date

A datetime field. Depends on python-dateutil.

online_date

A datetime field. Depends on python-dateutil.

journal

A string field.

volume

A string field.

issue

A string field.

firstpage

A string field.

lastpage

A string field.

abstract

A string field.

publisher

A string field.

issn

A string field.

language

A string field.

copyright

A string field.

license

A field with optional URL processing.

html_url

A field with optional URL processing.

pdf_url

A field with optional URL processing.

landing_url

A field with optional URL processing.

process_title = <chemdataextractor.text.normalize.Normalizer object>
process_journal = <chemdataextractor.text.normalize.Normalizer object>
process_publisher = <chemdataextractor.text.normalize.Normalizer object>
process_authors = <chemdataextractor.text.normalize.Normalizer object>
process_abstract = <chemdataextractor.text.normalize.Normalizer object>
fields = {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}

.scrape.fields

Fields to define on an entity.

class chemdataextractor.scrape.fields.StringField(selection, lower=False, upper=False, strip=False, **kwargs)[source]

Bases: chemdataextractor.scrape.base.BaseField

A string field.

__init__(selection, lower=False, upper=False, strip=False, **kwargs)[source]
Parameters:
  • lower (bool) – (Optional) Whether to lowercase the string. Default False.

  • upper (bool) – (Optional) Whether to uppercase the string. Default False.

  • strip (bool) – (Optional) Whether to strip whitespace from start/end. Default False.

process(value)[source]

Override to perform custom processing of a value.

class chemdataextractor.scrape.fields.UrlField(selection, strip_querystring=False, **kwargs)[source]

Bases: chemdataextractor.scrape.fields.StringField

A field with optional URL processing.

__init__(selection, strip_querystring=False, **kwargs)[source]
Parameters:

strip_querystring – (Optional) Whether to remove the querystring. Default False.

process(value)[source]

Override to perform custom processing of a value.

class chemdataextractor.scrape.fields.EntityField(entity, selection, **kwargs)[source]

Bases: chemdataextractor.scrape.base.BaseField

A field that contains another Entity.

__init__(entity, selection, **kwargs)[source]
Parameters:

entity – The embedded entity.

scrape(selector, cleaner=None, processor=None)[source]

Scrape the value for this field from the selector.

class chemdataextractor.scrape.fields.IntField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]

Bases: chemdataextractor.scrape.base.BaseField

An integer number field.

process(value)[source]

Convert value to an int.

class chemdataextractor.scrape.fields.FloatField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]

Bases: chemdataextractor.scrape.base.BaseField

An floating point number field.

process(value)[source]

Convert value to a float.

class chemdataextractor.scrape.fields.BoolField(selection, true=re.compile('true|yes|1', re.IGNORECASE), false=re.compile('false|no|0', re.IGNORECASE), **kwargs)[source]

Bases: chemdataextractor.scrape.base.BaseField

A boolean field type.

__init__(selection, true=re.compile('true|yes|1', re.IGNORECASE), false=re.compile('false|no|0', re.IGNORECASE), **kwargs)[source]
Parameters:
  • true – Regular expression match that evaluates to True.

  • false – Regular expression match that evaluates to False.

process(value)[source]

Override to perform custom processing of a value.

class chemdataextractor.scrape.fields.DateTimeField(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]

Bases: chemdataextractor.scrape.base.BaseField

A datetime field. Depends on python-dateutil.

process(value)[source]

Override to perform custom processing of a value.

serialize(value)[source]

Serialize this field.

.scrape.scraper

Concrete classes for scraping and searching.

class chemdataextractor.scrape.scraper.HtmlFormat[source]

Bases: chemdataextractor.scrape.base.BaseFormat

Process HTML response and return a Selector.

process_response(response)[source]

Return a Selector for the given response.

Parameters:

response (requests.Response) – The response object.

Return type:

Selector

class chemdataextractor.scrape.scraper.XmlFormat[source]

Bases: chemdataextractor.scrape.base.BaseFormat

Process XML response and return a Selector.

namespaces = None
process_response(response)[source]

Return a Selector for the given response.

Parameters:

response (requests.Response) – The response object.

Return type:

Selector

class chemdataextractor.scrape.scraper.GetRequester[source]

Bases: chemdataextractor.scrape.base.BaseRequester

make_request(session, url, **kwargs)[source]

Make a HTTP GET request.

Parameters:

url – The URL to get.

Returns:

The response to the request.

Return type:

requests.Response

class chemdataextractor.scrape.scraper.PostRequester[source]

Bases: chemdataextractor.scrape.base.BaseRequester

make_request(session, url, **kwargs)[source]

Make a HTTP POST request.

Parameters:
  • url – The URL to post to.

  • data – The data to post.

Returns:

The response to the request.

Return type:

requests.Response

class chemdataextractor.scrape.scraper.UrlScraper[source]

Bases: chemdataextractor.scrape.scraper.GetRequester, chemdataextractor.scrape.scraper.HtmlFormat, chemdataextractor.scrape.base.BaseScraper

Scraper that takes a URL as input.

process_url(url)[source]

Override to filter or process input URL prior to making request.

run(url)[source]

Request URL, scrape response and return an EntityList.

class chemdataextractor.scrape.scraper.RssScraper[source]

Bases: chemdataextractor.scrape.scraper.XmlFormat, chemdataextractor.scrape.scraper.UrlScraper

RSS scraper

root = 'item'
namespaces = {'atom': 'http://www.w3.org/2005/Atom', 'feedburner': 'http://rssnamespace.org/feedburner/ext/1.0'}
class chemdataextractor.scrape.scraper.SearchScraper[source]

Bases: chemdataextractor.scrape.scraper.GetRequester, chemdataextractor.scrape.scraper.HtmlFormat, chemdataextractor.scrape.base.BaseScraper

Scraper that takes a search query as input.

process_query(query)[source]

Override to filter or process input query prior to making request.

Override to implement search. Take query input and return a SearchResult.

run(query, page=1)[source]
class chemdataextractor.scrape.scraper.SearchResult[source]

Bases: object

Class to handle results from a search query to websites, regardless of method of scraping used.

selector

Process the result of the search, giving a selector

Returns:

The result of the search

Return type:

selector

class chemdataextractor.scrape.scraper.SeleniumSearchResult(driver)[source]

Bases: object

Search results when using Selenium for scraping

__init__(driver)[source]
Parameters:

driver (selenium.webdriver) – driver from which results will be scraped.

selector
class chemdataextractor.scrape.scraper.ResponseSearchResult(response)[source]

Bases: object

Search results when using the requests library for scraping

__init__(response)[source]
Parameters:

response (requests.Response) – HTML response for results

selector

.scrape.selector

Tool for selecting content from HTML or XML using CSS or XPath expressions.

class chemdataextractor.scrape.selector.Selector(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]

Bases: object

Tool for selecting content from HTML or XML using XPath selectors.

__init__(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]

Initialize self. See help(type(self)) for accurate signature.

classmethod from_text(text, base_url=None, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None, encoding=None)[source]
classmethod from_html_text(text, base_url=None, namespaces=None, encoding=None)[source]
classmethod from_xml_text(text, base_url=None, namespaces=None, encoding=None)[source]
classmethod from_response(response, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None)[source]
classmethod from_html(response, namespaces=None)[source]
classmethod from_xml(response, namespaces=None)[source]
path

Absolute path to the root of this selector.

tag

Tag name of the root of this selector.

xpath(query)[source]
css(query)[source]
re(regex)[source]
extract(cleaner=None, raw=False)[source]
class chemdataextractor.scrape.selector.SelectorList(*selectors)[source]

Bases: collections.abc.Sequence

Wrapper around a list of Selectors to allow selecting from all at once.

__init__(*selectors)[source]

Initialize self. See help(type(self)) for accurate signature.

xpath(xpath)[source]
re(regex)[source]
extract(cleaner=None, raw=False)[source]
extract_first(cleaner=None, raw=False, default=None)[source]