.scrape¶
Scrapers for the various data sources
Declarative scraping framework for extracting structured data from HTML and XML documents.
-
chemdataextractor.scrape.
BLOCK_ELEMENTS
= {'address', 'article', 'aside', 'audio', 'blockquo...¶ Block level HTML elements
-
chemdataextractor.scrape.
INLINE_ELEMENTS
= {'a', 'abbr', 'acronym', 'b', 'bdo', 'big', 'blink...¶ Inline level HTML elements
.scrape.pub¶
Scraping tools for specific publishers.
.scrape.pub.nlm¶
Tools for scraping documents from NLM Journal Archiving and Interchange DTD XML files.
-
chemdataextractor.scrape.pub.nlm.
strip_pmc_xml
= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that kills reference links, footnote links, equations, footnotes
-
chemdataextractor.scrape.pub.nlm.
strip_pmc_abstract_xml
= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills headings
-
chemdataextractor.scrape.pub.nlm.
strip_pmc_paragraph_xml
= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills tables and figures
-
chemdataextractor.scrape.pub.nlm.
space_labels
(document)[source]¶ Ensure space around bold compound labels.
-
chemdataextractor.scrape.pub.nlm.
tidy_nlm_references
(document)[source]¶ Remove punctuation around references like brackets, commas, hyphens.
-
class
chemdataextractor.scrape.pub.nlm.
NlmXmlAuthor
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Author information from NLM XML file.
-
givennames
¶ A string field.
-
lastname
¶ A string field.
-
email
¶ A string field.
-
process_givennames
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_lastname
= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields
= {'email': <chemdataextractor.scrape.fields.StringField object>, 'givennames': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.
NlmXmlImage
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Figure information from NLM XML file.
-
label
¶ A string field.
A string field.
-
reference
¶ A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.
NlmXmlTable
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Table information from NLM XML file.
-
label
¶ A string field.
A string field.
-
reference
¶ A string field.
-
src
¶ A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.nlm.
NlmXmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Document information from a NLM XML file.
-
doi
¶ A string field.
-
pmid
¶ An integer number field.
-
pmcid
¶ An integer number field.
-
title
¶ A string field.
A field that contains another Entity.
-
journal_title
¶ A string field.
-
journal_abbreviation
¶ A string field.
-
publisher
¶ A string field.
-
volume
¶ A string field.
-
firstpage
¶ A string field.
-
lastpage
¶ A string field.
-
issue
¶ A string field.
-
issn
¶ A string field.
-
coden
¶ A string field.
-
abstract
¶ A string field.
-
online_year
¶ An integer number field.
-
online_month
¶ An integer number field.
-
online_day
¶ An integer number field.
-
published_year
¶ An integer number field.
-
published_month
¶ An integer number field.
-
published_day
¶ An integer number field.
-
accepted_year
¶ An integer number field.
-
accepted_month
¶ An integer number field.
-
accepted_day
¶ An integer number field.
-
received_year
¶ An integer number field.
-
received_month
¶ An integer number field.
-
received_day
¶ An integer number field.
-
license
¶ A field with optional URL processing.
-
clean_title
= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_abstract
= <chemdataextractor.scrape.clean.Cleaner object>¶
-
process_title
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_publisher
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_abstract
= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'coden': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal_abbreviation': <chemdataextractor.scrape.fields.StringField object>, 'journal_title': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_day': <chemdataextractor.scrape.fields.IntField object>, 'online_month': <chemdataextractor.scrape.fields.IntField object>, 'online_year': <chemdataextractor.scrape.fields.IntField object>, 'pmcid': <chemdataextractor.scrape.fields.IntField object>, 'pmid': <chemdataextractor.scrape.fields.IntField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
.scrape.pub.rsc¶
Tools for scraping documents from The Royal Society of Chemistry.
-
chemdataextractor.scrape.pub.rsc.
CHAR_REPLACEMENTS
= [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ...¶ Map placeholder text to unicode characters.
-
chemdataextractor.scrape.pub.rsc.
RSC_IMG_CHARS
= {'2041': '^', '224a': '≈', 'e001': '=', 'e002': '≡...¶ Map image URL components to unicode characters.
-
chemdataextractor.scrape.pub.rsc.
strip_rsc_html
= <chemdataextractor.scrape.clean.Cleaner object>¶ none;” (typically tooltips)
- Type:
HTML stripper that kills superscript references and anything with style=”display
-
chemdataextractor.scrape.pub.rsc.
strip_cit_html
= <chemdataextractor.scrape.clean.Cleaner object>¶ HTML stripper that also kills text from buttons in references.
-
chemdataextractor.scrape.pub.rsc.
rsc_substitute
= <chemdataextractor.text.processors.Substitutor obj...¶ Substitutor that replaces RSC escape codes with the actual unicode character
-
chemdataextractor.scrape.pub.rsc.
parse_rsc_html
(htmlstring)[source]¶ Messy RSC HTML needs this special parser to fix problems before creating selector.
-
chemdataextractor.scrape.pub.rsc.
replace_rsc_img_chars
(document)[source]¶ Replace image characters with unicode equivalents.
-
chemdataextractor.scrape.pub.rsc.
space_references
(document)[source]¶ Ensure a space around reference links, so there’s a gap when they are removed.
-
class
chemdataextractor.scrape.pub.rsc.
RscRssDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Document information from RSC RSS feed.
-
doi
¶ A string field.
-
title
¶ A string field.
A string field.
-
landing_url
¶ A field with optional URL processing.
-
process_title
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'authors': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscRssScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.RssScraper
Scraper for RSC RSS feeds.
-
entity
¶ alias of
RscRssDocument
-
-
class
chemdataextractor.scrape.pub.rsc.
RscSearchDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Document information from RSC search results page.
-
doi
¶ A string field.
-
title
¶ A string field.
-
landing_url
¶ A field with optional URL processing.
-
pdf_url
¶ A field with optional URL processing.
-
html_url
¶ A field with optional URL processing.
-
journal
¶ A string field.
-
abstract
¶ A string field.
-
clean_title
= <chemdataextractor.text.processors.Chain object>¶
-
process_doi
= <chemdataextractor.text.processors.LAdd object>¶
-
process_title
= <chemdataextractor.text.processors.Chain object>¶
-
process_landing_url
= <chemdataextractor.text.processors.Chain object>¶
-
process_pdf_url
= <chemdataextractor.text.processors.Chain object>¶
-
process_html_url
= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscSearchScraper
(max_wait_time=30, driver=None)[source]¶ Bases:
chemdataextractor.scrape.scraper.SearchScraper
Scraper for RSC search results.
-
entity
¶ alias of
RscSearchDocument
-
root
= '.capsule.capsule--article'¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscLandingSupplement
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
-
name
¶ A string field.
-
url
¶ A field with optional URL processing.
-
fields
= {'name': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscLandingDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntity
Document information from RSC landing page.
-
supplements
¶ A field that contains another Entity.
-
process_abstract
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'supplements': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscLandingScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraper
Scraper for RSC Landing pages.
-
entity
¶ alias of
RscLandingDocument
-
-
class
chemdataextractor.scrape.pub.rsc.
RscChemicalMention
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
-
text
¶ A string field.
-
chemspider_id
¶ A string field.
-
inchi
¶ A string field.
-
clean_text
= <chemdataextractor.text.processors.Chain object>¶
-
process_text
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_chemspider_id
= <chemdataextractor.text.processors.Chain object>¶
-
process_inchi
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'chemspider_id': <chemdataextractor.scrape.fields.StringField object>, 'inchi': <chemdataextractor.scrape.fields.StringField object>, 'text': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscImage
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Embedded image. Includes both Schemes and Figures.
-
url
¶ A field with optional URL processing.
-
label
¶ A string field.
-
reference
¶ A string field.
A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscTable
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Table within document.
-
reference
¶ A string field.
-
label
¶ A string field.
A string field.
-
src
¶ A string field.
-
clean_src
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscHtmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntity
-
title
¶ A string field.
-
abstract
¶ A string field.
-
pdf_url
¶ A field with optional URL processing.
-
html_url
¶ A field with optional URL processing.
-
landing_url
¶ A field with optional URL processing.
-
clean_title
= <chemdataextractor.text.processors.Chain object>¶
-
clean_abstract
= <chemdataextractor.text.processors.Chain object>¶
-
process_title
= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract
= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.rsc.
RscHtmlScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraper
Scraper for RSC Landing pages.
-
entity
¶ alias of
RscHtmlDocument
-
.scrape.pub.springer¶
Tools for scraping documents from Springer, Biomed Central and Chemistry Central XML files.
-
chemdataextractor.scrape.pub.springer.
strip_springer_xml
= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills equations/formulas.
-
chemdataextractor.scrape.pub.springer.
strip_springer_abstract_xml
= <chemdataextractor.scrape.clean.Cleaner object>¶ XML stripper that also kills headings
-
chemdataextractor.scrape.pub.springer.
tidy_springer_references
(document)[source]¶ Remove punctuation around references like brackets, commas, hyphens.
-
class
chemdataextractor.scrape.pub.springer.
SpringerHtmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntity
Scraper for Springer HTML articles
-
title
¶ A string field.
-
abstract
¶ A string field.
-
journal
¶ A string field.
-
process_html_url
= <chemdataextractor.text.processors.RAdd object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.
SpringerXmlAuthor
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Author information from a Springer XML file.
-
firstname
¶ A string field.
-
middlename
¶ A string field.
-
lastname
¶ A string field.
-
suffix
¶ A string field.
-
email
¶ A string field.
-
process_email
= <chemdataextractor.text.processors.Discard object>¶
-
fields
= {'email': <chemdataextractor.scrape.fields.StringField object>, 'firstname': <chemdataextractor.scrape.fields.StringField object>, 'lastname': <chemdataextractor.scrape.fields.StringField object>, 'middlename': <chemdataextractor.scrape.fields.StringField object>, 'suffix': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.
SpringerXmlImage
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Figure information from a Springer XML file.
-
label
¶ A string field.
A string field.
-
reference
¶ A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.
SpringerXmlTable
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Table information from a Springer XML file.
-
label
¶ A string field.
A string field.
-
reference
¶ A string field.
-
src
¶ A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>, 'reference': <chemdataextractor.scrape.fields.StringField object>, 'src': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.springer.
SpringerXmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Document information from a Springer XML file.
-
ui
¶ A string field.
-
doi
¶ A string field.
-
title
¶ A string field.
A field that contains another Entity.
-
journal
¶ A string field.
-
firstpage
¶ A string field.
-
year
¶ An integer number field.
-
volume
¶ A string field.
-
issue
¶ A string field.
-
issn
¶ A string field.
-
landing_url
¶ A field with optional URL processing.
-
abstract
¶ A string field.
-
published_year
¶ An integer number field.
-
published_month
¶ An integer number field.
-
published_day
¶ An integer number field.
-
accepted_year
¶ An integer number field.
-
accepted_month
¶ An integer number field.
-
accepted_day
¶ An integer number field.
-
received_year
¶ An integer number field.
-
received_month
¶ An integer number field.
-
received_day
¶ An integer number field.
-
license
¶ A field with optional URL processing.
-
figures
¶ A field that contains another Entity.
-
schemes
¶ A field that contains another Entity.
-
tables
¶ A field that contains another Entity.
-
headings
¶ A string field.
-
paragraphs
¶ A string field.
-
clean_title
= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_abstract
= <chemdataextractor.text.processors.Chain object>¶
-
clean_headings
= <chemdataextractor.scrape.clean.Cleaner object>¶
-
clean_paragraphs
= <chemdataextractor.text.processors.Chain object>¶
-
process_abstract
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_headings
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_paragraphs
= <chemdataextractor.text.processors.Chain object>¶
-
process_license
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'accepted_day': <chemdataextractor.scrape.fields.IntField object>, 'accepted_month': <chemdataextractor.scrape.fields.IntField object>, 'accepted_year': <chemdataextractor.scrape.fields.IntField object>, 'authors': <chemdataextractor.scrape.fields.EntityField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_day': <chemdataextractor.scrape.fields.IntField object>, 'published_month': <chemdataextractor.scrape.fields.IntField object>, 'published_year': <chemdataextractor.scrape.fields.IntField object>, 'received_day': <chemdataextractor.scrape.fields.IntField object>, 'received_month': <chemdataextractor.scrape.fields.IntField object>, 'received_year': <chemdataextractor.scrape.fields.IntField object>, 'schemes': <chemdataextractor.scrape.fields.EntityField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'ui': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>, 'year': <chemdataextractor.scrape.fields.IntField object>}¶
-
.scrape.pub.elsevier¶
Tools for scraping documents from Elsevier.
- copyright:
Copyright 2017 by Callum Court.
- license:
MIT, see LICENSE file for more details.
-
chemdataextractor.scrape.pub.elsevier.
CHAR_REPLACEMENTS
= [('\\[?\\[1 with combining macron\\]\\]?', '1̄'), ...¶ Map placeholder text to unicode characters.
-
chemdataextractor.scrape.pub.elsevier.
elsevier_substitute
= <chemdataextractor.text.processors.Substitutor obj...¶ Substitutor that replaces ACS escape codes with the actual unicode character
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierSearchDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Document information from Elsevier API search results.
-
test
¶ A string field.
-
fields
= {'test': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierSearchScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraper
Scraper for Elsevier search results.
-
entity
¶ alias of
ElsevierSearchDocument
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierImage
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Embedded figure. Includes both Schemes and Figures.
A string field.
-
image_url
¶ A string field.
-
process_image_url
= <chemdataextractor.text.processors.LAdd object>¶
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'image_url': <chemdataextractor.scrape.fields.StringField object>}¶
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierTableData
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Embedded row data from document tables
-
rows
¶ A string field.
-
fields
= {'rows': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierTable
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Table within document.
-
title
¶ A string field.
-
column_headings
¶ A string field.
-
data
¶ A field that contains another Entity.
A string field.
-
process_title
= <chemdataextractor.text.processors.Chain object>¶
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.StringField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierHtmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.DocumentEntity
Scraper of document information from Elsevier html papers
-
doi
¶ A string field.
-
title
¶ A string field.
A string field.
-
abstract
¶ A string field.
-
journal
¶ A string field.
-
volume
¶ A string field.
-
copyright
¶ A string field.
-
headings
¶ A string field.
-
sub_headings
¶ A string field.
-
html_url
¶ A field with optional URL processing.
-
paragraphs
¶ A string field.
-
figures
¶ A field that contains another Entity.
-
published_date
¶ A string field.
-
citations
¶ A string field.
-
tables
¶ A field that contains another Entity.
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'sub_headings': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierHtmlScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.UrlScraper
Scraper for Elsever html paper pages
-
entity
¶ alias of
ElsevierHtmlDocument
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierXmlImage
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
A string field.
-
label
¶ A string field.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}¶
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierXmlTableData
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
-
rows
¶ A string field.
-
fields
= {'rows': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierXmlTable
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
-
label
¶ A string field.
A string field.
-
column_headings
¶ A field that contains another Entity.
-
data
¶ A field that contains another Entity.
-
fields
= {'caption': <chemdataextractor.scrape.fields.StringField object>, 'column_headings': <chemdataextractor.scrape.fields.EntityField object>, 'data': <chemdataextractor.scrape.fields.EntityField object>, 'label': <chemdataextractor.scrape.fields.StringField object>}¶
-
-
class
chemdataextractor.scrape.pub.elsevier.
ElsevierXmlDocument
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Scraper for Elsevier XML articles
-
doi
¶ A string field.
-
title
¶ A string field.
A string field.
-
abstract
¶ A string field.
-
journal
¶ A string field.
-
volume
¶ A string field.
-
issue
¶ A string field.
-
pages
¶ A string field.
-
firstpage
¶ A string field.
-
lastpage
¶ A string field.
-
copyright
¶ A string field.
-
publisher
¶ A string field.
-
headings
¶ A string field.
-
url
¶ A field with optional URL processing.
-
paragraphs
¶ A string field.
-
figures
¶ A field that contains another Entity.
-
published_date
¶ A string field.
-
citations
¶ A string field.
-
tables
¶ A field that contains another Entity.
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'citations': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'figures': <chemdataextractor.scrape.fields.EntityField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'headings': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'pages': <chemdataextractor.scrape.fields.StringField object>, 'paragraphs': <chemdataextractor.scrape.fields.StringField object>, 'published_date': <chemdataextractor.scrape.fields.StringField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'tables': <chemdataextractor.scrape.fields.EntityField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'url': <chemdataextractor.scrape.fields.UrlField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
process_abstract
= <chemdataextractor.text.processors.Chain object>¶
-
.scrape.base¶
Abstract base classes that define the interface for Scrapers, Fields, Crawlers, etc.
-
class
chemdataextractor.scrape.base.
BaseScraper
[source]¶ Bases:
object
Abstract Scraper class from which all Scrapers inherit.
-
root
= None¶ CSS selector or XPath expression that returns the root of each entity.
-
root_xpath
= False¶ Whether the root is an XPath expression instead of a CSS selector.
-
create_session
()[source]¶ Override to set up default data (e.g. headers, authentication) on each request.
-
entity
¶ The Entity to scrape.
-
make_request
(url, data)[source]¶ Make a HTTP request.
- Parameters:
url – The URL to get.
data – Query data.
- Returns:
The response to the request.
- Return type:
requests.Response
-
-
class
chemdataextractor.scrape.base.
BaseEntityProcessor
[source]¶ Bases:
object
Abstract EntityProcessor class from which all EntityProcessors inherit.
-
process_entity
(entity)[source]¶ Process an Entity. Return None to filter Entity from the pipeline.
- Parameters:
entity (chemdataextractor.scrape.entity.Entity) – The Entity to process.
- Returns:
The processed Entity.
- Return type:
-
-
class
chemdataextractor.scrape.base.
BaseEntity
[source]¶ Bases:
object
Abstract Entity class from which all Entities inherit.
-
class
chemdataextractor.scrape.base.
EntityMeta
[source]¶ Bases:
abc.ABCMeta
Metaclass for Entity.
-
class
chemdataextractor.scrape.base.
BaseField
(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
object
Base class for all fields.
-
name
= None¶
-
__init__
(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ - Parameters:
selection (string) – The CSS selector or XPath expression used to select the content to scrape.
xpath (bool) – (Optional) Whether selection is an XPath expression instead of a CSS selector. Default False.
re – (Optional) Regular expression to apply to scraped content.
all (bool) – (Optional) Whether to scrape all occurrences instead of just the first. Default False.
default – (Optional) The default value for this field if none is set.
null (bool) – (Optional) Include in serialized output even if value is None. Default False.
raw (bool) – (Optional) Whether to scrape the raw HTML/XML instead of the text contents. Default False.
-
.scrape.clean¶
Tools for cleaning up XML/HTML by removing tags entirely or replacing with their contents.
-
class
chemdataextractor.scrape.clean.
Cleaner
(**kwargs)[source]¶ Bases:
object
Clean HTML or XML by removing tags completely or replacing with their contents.
A Cleaner instance provides a
clean_markup
method:cleaner = Cleaner() htmlstring = '<html><body><script>alert("test")</script><p>Some text</p></body></html>' print(cleaner.clean_markup(htmlstring))
A Cleaner instance is also a callable that can be applied to lxml document trees:
tree = lxml.etree.fromstring(htmlstring) cleaner(tree) print(lxml.etree.tostring(tree))
Elements that are matched by
kill_xpath
are removed entirely, along with their contents. By default,kill_xpath
matches all script and style tags, as well as comments and processing instructions.Elements that are matched by
strip_xpath
are replaced with their contents. By default, no elements are stripped. A common use-case is to setstrip_xpath
to.//*
, which specifies that all elements should be stripped.Elements that are matched by
allow_xpath
are excepted from stripping, even if they are also matched bystrip_xpath
. This is useful when settingstrip_xpath
to strip all tags, allowing a few expections to be specified byallow_xpath
.-
kill_xpath
= './/script | .//style | .//comment() | .//processing-instruction() | .//*[@style="display:none;"]'¶
-
strip_xpath
= None¶
-
allow_xpath
= None¶
-
fix_whitespace
= True¶
-
process_xpaths
= {}¶
-
namespaces
= {'dc': 'http://purl.org/dc/elements/1.1/', 'prism': 'http://prismstandard.org/namespaces/basic/2.0/', 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns', 're': 'http://exslt.org/regular-expressions', 'set': 'http://exslt.org/sets', 'xml': 'http://www.w3.org/XML/1998/namespace'}¶
-
__init__
(**kwargs)[source]¶ Behaviour can be customized by overriding attributes in a subclass or setting them in the constructor.
- Parameters:
kill_xpath (string) – XPath expression for tags to remove along with their contents.
strip_xpath (string) – XPath expression for tags to replace with their contents.
allow_xpath (string) – XPath expression for tags to except from strip_xpath.
fix_whitespace (bool) – Normalize whitespace to a single space and ensure newlines around block elements.
namespaces (dict) – Namespace prefixes to register for the XPaths.
-
-
chemdataextractor.scrape.clean.
clean
= <chemdataextractor.scrape.clean.Cleaner object>¶ A default Cleaner instance, which kills comments, processing instructions, script tags, style tags.
-
chemdataextractor.scrape.clean.
clean_markup
= <bound method Cleaner.clean_markup of <chemdataext...¶ Convenience function for applying
clean
to a string.
-
chemdataextractor.scrape.clean.
clean_html
= <bound method Cleaner.clean_html of <chemdataextra...¶ Convenience function for applying
clean
to a HTML string.
-
chemdataextractor.scrape.clean.
strip
= <chemdataextractor.scrape.clean.Cleaner object>¶ A Cleaner instance that is configured to strip all tags, replacing them with their text contents.
-
chemdataextractor.scrape.clean.
strip_markup
= <bound method Cleaner.clean_markup of <chemdataext...¶ Convenience function for applying
strip
to a string.
-
chemdataextractor.scrape.clean.
strip_html
= <bound method Cleaner.clean_html of <chemdataextra...¶ Convenience function for applying
strip
to a HTML string.
.scrape.csstranslator¶
Extend cssselect to improve handling of pseudo-elements.
This is derived from csstranslator.py in the Scrapy project. The original file is available at: https://github.com/scrapy/scrapy/blob/master/scrapy/selector/csstranslator.py
The original file was released under the BSD license:
Copyright (c) Scrapy developers. All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of Scrapy nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
class
chemdataextractor.scrape.csstranslator.
CdeXPathExpr
(path='', element='*', condition='', star_prefix=False)[source]¶ Bases:
cssselect.xpath.XPathExpr
-
textnode
= False¶
-
attribute
= None¶
-
-
class
chemdataextractor.scrape.csstranslator.
CssXmlTranslator
[source]¶ Bases:
chemdataextractor.scrape.csstranslator.TranslatorMixin
,cssselect.xpath.GenericTranslator
-
class
chemdataextractor.scrape.csstranslator.
CssHTMLTranslator
(xhtml=False)[source]¶ Bases:
chemdataextractor.scrape.csstranslator.TranslatorMixin
,cssselect.xpath.HTMLTranslator
.scrape.entity¶
An entity to extract.
-
class
chemdataextractor.scrape.entity.
Entity
(selector)[source]¶ Bases:
chemdataextractor.scrape.base.BaseEntity
-
fields
= {}¶
-
-
class
chemdataextractor.scrape.entity.
EntityList
(*entities)[source]¶ Bases:
collections.abc.Sequence
Wrapper around a list of Entities to facilitate operations on all at once.
-
class
chemdataextractor.scrape.entity.
DocumentEntity
(selector)[source]¶ Bases:
chemdataextractor.scrape.entity.Entity
Generic document entity.
-
doi
¶ A string field.
-
title
¶ A string field.
A string field.
-
published_date
¶ A datetime field. Depends on python-dateutil.
-
online_date
¶ A datetime field. Depends on python-dateutil.
-
journal
¶ A string field.
-
volume
¶ A string field.
-
issue
¶ A string field.
-
firstpage
¶ A string field.
-
lastpage
¶ A string field.
-
abstract
¶ A string field.
-
publisher
¶ A string field.
-
issn
¶ A string field.
-
language
¶ A string field.
-
copyright
¶ A string field.
-
license
¶ A field with optional URL processing.
-
html_url
¶ A field with optional URL processing.
-
pdf_url
¶ A field with optional URL processing.
-
landing_url
¶ A field with optional URL processing.
-
process_title
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_journal
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_publisher
= <chemdataextractor.text.normalize.Normalizer object>¶
-
process_abstract
= <chemdataextractor.text.normalize.Normalizer object>¶
-
fields
= {'abstract': <chemdataextractor.scrape.fields.StringField object>, 'authors': <chemdataextractor.scrape.fields.StringField object>, 'copyright': <chemdataextractor.scrape.fields.StringField object>, 'doi': <chemdataextractor.scrape.fields.StringField object>, 'firstpage': <chemdataextractor.scrape.fields.StringField object>, 'html_url': <chemdataextractor.scrape.fields.UrlField object>, 'issn': <chemdataextractor.scrape.fields.StringField object>, 'issue': <chemdataextractor.scrape.fields.StringField object>, 'journal': <chemdataextractor.scrape.fields.StringField object>, 'landing_url': <chemdataextractor.scrape.fields.UrlField object>, 'language': <chemdataextractor.scrape.fields.StringField object>, 'lastpage': <chemdataextractor.scrape.fields.StringField object>, 'license': <chemdataextractor.scrape.fields.UrlField object>, 'online_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'pdf_url': <chemdataextractor.scrape.fields.UrlField object>, 'published_date': <chemdataextractor.scrape.fields.DateTimeField object>, 'publisher': <chemdataextractor.scrape.fields.StringField object>, 'title': <chemdataextractor.scrape.fields.StringField object>, 'volume': <chemdataextractor.scrape.fields.StringField object>}¶
-
.scrape.fields¶
Fields to define on an entity.
-
class
chemdataextractor.scrape.fields.
StringField
(selection, lower=False, upper=False, strip=False, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
A string field.
-
class
chemdataextractor.scrape.fields.
UrlField
(selection, strip_querystring=False, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.fields.StringField
A field with optional URL processing.
-
class
chemdataextractor.scrape.fields.
EntityField
(entity, selection, **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
A field that contains another Entity.
-
class
chemdataextractor.scrape.fields.
IntField
(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
An integer number field.
-
class
chemdataextractor.scrape.fields.
FloatField
(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
An floating point number field.
-
class
chemdataextractor.scrape.fields.
BoolField
(selection, true=re.compile('true|yes|1', re.IGNORECASE), false=re.compile('false|no|0', re.IGNORECASE), **kwargs)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
A boolean field type.
-
class
chemdataextractor.scrape.fields.
DateTimeField
(selection, xpath=False, re=None, all=False, default=None, null=False, raw=False)[source]¶ Bases:
chemdataextractor.scrape.base.BaseField
A datetime field. Depends on python-dateutil.
.scrape.scraper¶
Concrete classes for scraping and searching.
-
class
chemdataextractor.scrape.scraper.
HtmlFormat
[source]¶ Bases:
chemdataextractor.scrape.base.BaseFormat
Process HTML response and return a Selector.
-
class
chemdataextractor.scrape.scraper.
XmlFormat
[source]¶ Bases:
chemdataextractor.scrape.base.BaseFormat
Process XML response and return a Selector.
-
namespaces
= None¶
-
-
class
chemdataextractor.scrape.scraper.
UrlScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.GetRequester
,chemdataextractor.scrape.scraper.HtmlFormat
,chemdataextractor.scrape.base.BaseScraper
Scraper that takes a URL as input.
-
class
chemdataextractor.scrape.scraper.
RssScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.XmlFormat
,chemdataextractor.scrape.scraper.UrlScraper
RSS scraper
-
root
= 'item'¶
-
namespaces
= {'atom': 'http://www.w3.org/2005/Atom', 'feedburner': 'http://rssnamespace.org/feedburner/ext/1.0'}¶
-
-
class
chemdataextractor.scrape.scraper.
SearchScraper
[source]¶ Bases:
chemdataextractor.scrape.scraper.GetRequester
,chemdataextractor.scrape.scraper.HtmlFormat
,chemdataextractor.scrape.base.BaseScraper
Scraper that takes a search query as input.
-
class
chemdataextractor.scrape.scraper.
SearchResult
[source]¶ Bases:
object
Class to handle results from a search query to websites, regardless of method of scraping used.
-
selector
¶ Process the result of the search, giving a selector
- Returns:
The result of the search
- Return type:
selector
-
.scrape.selector¶
Tool for selecting content from HTML or XML using CSS or XPath expressions.
-
class
chemdataextractor.scrape.selector.
Selector
(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]¶ Bases:
object
Tool for selecting content from HTML or XML using XPath selectors.
-
__init__
(root, fmt='html', translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, namespaces=None)[source]¶ Initialize self. See help(type(self)) for accurate signature.
-
classmethod
from_text
(text, base_url=None, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None, encoding=None)[source]¶
-
classmethod
from_response
(response, parser=<class 'lxml.html.HTMLParser'>, translator=<class 'chemdataextractor.scrape.csstranslator.CssHTMLTranslator'>, fmt='html', namespaces=None)[source]¶
-
path
¶ Absolute path to the root of this selector.
-
tag
¶ Tag name of the root of this selector.
-
-
class
chemdataextractor.scrape.selector.
SelectorList
(*selectors)[source]¶ Bases:
collections.abc.Sequence
Wrapper around a list of Selectors to allow selecting from all at once.