Scraping Structured Data¶
ChemDataExtractor contains a scrape
package for extracting structured information from HTML and XML files. This is most useful for obtaining bibliographic data, but can be used for any kind of data that has been marked up with HTML or XML tags in source documents.
Included Scrapers
ChemDataExtractor comes with ready-made scraping tools for web pages on the RSC and ACS web sites, as wells as for XML files in the NLM JATS format as used by PubMed Central and others:
>>> from chemdataextractor.scrape import Selector
>>> from chemdataextractor.scrape.pub.rsc import RscHtmlDocument
>>>
>>> htmlstring = open('rsc_example.html').read()
>>> sel = Selector.from_text(htmlstring)
>>> scrape = RscHtmlDocument(sel)
>>> print(scrape.publisher)
'Royal Society of Chemistry'
>>> scrape.serialize()
{'publisher': 'Royal Society of Chemistry', 'language': 'en', 'title': 'The Title'}
Custom Scrapers
As an example, here is a very simple HTML file that we want to scrape some data from:
<html>
<head>
<title>Example document</title>
<meta name="citation_publication_date" content="2016-10-03">
</head>
<body>
<p class="abstract">Abstract goes here...</p>
<p class="para">Another paragraph here...</p>
</body>
</html>
Defining an Entity
To use the scrape
package, we define an Entity
that contains Fields
that describe how to extract the desired content in a declarative fashion:
from chemdataextractor.scrape import Entity
class ExampleDocument(Entity):
title = StringField('title')
abstract = StringField('.abstract')
date_published = DateTimeField('meta[name="citation_publication_date"]::attr("content")')
Each field uses a CSS selector to describe where to find the data in the document.
XPath Expressions
It is possible to use XPath expressions instead of CSS selectors, if desired. Just add the parameter xpath=True
to the field arguments:
date_published = DateTimeField('//meta[@name="citation_publication_date"]/@content', xpath=True)
Processors
Processors perform transformations on the extracted text.
The Selector
The Selector
is inspired by the Scrapy text mining tool.
It provides a convenient unified interface for ‘selecting’ parts of XML and HTML documents for extraction.
Entity
classes make use of it behind the scenes, but for simple cases, it can be quicker and easier to use it directly to extract information.
Create a selector from a file:
>>> htmlstring = open('rsc_example.html').read()
>>> sel = Selector.from_text(htmlstring)
Now, instead of passing the selector to an Entity, you can query it with CSS:
>>> sel.css('head')
This returns a SelectorList
, meaning you can chain queries. Call extract()
or extract_first()
on the returned SelectorList
to get the extracted content:
>>> sel.css('head').css('title').extract_first()
'Example document'
>>> sel.css('p')
['Abstract goes here...', 'Another paragraph here...']
Cleaners¶
You will see in the above code that we have specified a number of cleaners. Cleaners attempt to fix systematic formatting errors in the HTML/XML. A classic problem is spacing around references. For example some HTML may look like:
<div>
<p>This is a result that was retrieved from
<a><sup><span class=sup_ref>[1]</span><sup></a>.
</p>
</div>
When parsing, ChemDataExtractor will output:
Paragraph(text='This a result that was retrieved from[1].',...)
So we need a cleaner whose job is to put a space between text and references.
In the RscHtmlReader
class we specify a list of cleaners to act on the text:
cleaners = [clean, replace_rsc_img_chars, space_references]
and the corresponding space_references
cleaner looks like:
def space_references(document):
"""Ensure a space around reference links, so there's a gap when they are removed."""
for ref in document.xpath('.//a/sup/span[@class="sup_ref"]'):
a = ref.getparent().getparent()
if a is not None:
atail = a.tail or ''
if not atail.startswith(')') and not atail.startswith(',') and not atail.startswith(' '):
a.tail = ' ' + atail
return document
Note that we don’t explicitly need to call the cleaner as this is handled by the BaseReader
class.