.biblio

Misc tools for parsing bibliographic information such as bibtex files, author names etc.

Tools for dealing with bibliographic information.

.biblio.bibtex

BibTeX parser.

class chemdataextractor.biblio.bibtex.BibtexParser(data, **kwargs)[source]

Bases: object

A class for parsing a BibTeX string into JSON or a python data structure.

Example usage:

with open(example.bib, 'r') as f:
    bib = BibtexParser(f.read())
    bib.parse()
    print bib.records_list
    print bib.json
__init__(data, **kwargs)[source]

Initialize BibtexParser with data.

Optional metadata passed as keyword arguments will be included in the JSON output. e.g. collection, label, description, id, owner, created, modified, source

Example usage:

bib = BibtexParser(data, created=unicode(datetime.utcnow()), owner='mcs07')
parse()[source]

Parse self.data and store the parsed BibTeX to self.records.

classmethod parse_names(names)[source]

Parse a string of names separated by “and” like in a BibTeX authors field.

size

Return the number of records parsed.

records_list

Return the records as a list of dictionaries.

metadata

Return metadata for the parsed collection of records.

json

Return a list of records as a JSON string. Follows the BibJSON convention.

chemdataextractor.biblio.bibtex.parse_bibtex(data)[source]

.biblio.person

Tools for parsing people’s names from strings into various name components.

class chemdataextractor.biblio.person.PersonName(fullname=None, from_bibtex=False)[source]

Bases: dict

Class for parsing a person’s name into its constituent parts.

Parses a name string into title, firstname, middlename, nickname, prefix, lastname, suffix.

Example usage:

p = PersonName('von Beethoven, Ludwig')

PersonName acts like a dict:

print p
print p['firstname']
print json.dumps(p)

Name components can also be access as attributes:

print p.lastname

Instances can be reused by setting the name property:

p.name = 'Henry Ford Jr. III'
print p

Two PersonName objects are equal if every name component matches exactly. For fuzzy matching, use the could_be method. This returns True for names that are not explicitly inconsistent.

This class was written with the intention of parsing BibTeX author names, so name components enclosed within curly brackets will not be split.

__init__(fullname=None, from_bibtex=False)[source]

Initialize with a name string.

Parameters:
  • fullname (str) – The person’s name.

  • from_bibtex (bool) – (Optional) Whether the fullname parameter is in BibTeX format. Default False.

could_be(other)[source]

Return True if the other PersonName is not explicitly inconsistent.

fullname

.biblio.xmp

Parse metadata stored as XMP (Extensible Metadata Platform).

This is commonly embedded within PDF documents, and can be extracted using the PDFMiner framework.

More information is available on the Adobe website:

class chemdataextractor.biblio.xmp.XmpParser(ns_map={'http://crossref.org/crossmark/1.0/': 'crossmark', 'http://ns.adobe.com/pdf/1.3/': 'pdf', 'http://ns.adobe.com/pdfx/1.3/': 'pdfx', 'http://ns.adobe.com/xap/1.0/': 'xap', 'http://ns.adobe.com/xap/1.0/mm/': 'xapmm', 'http://ns.adobe.com/xap/1.0/rights/': 'rights', 'http://prismstandard.org/namespaces/basic/2.0/': 'prism', 'http://purl.org/dc/elements/1.1/': 'dc', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#': 'rdf', 'http://www.w3.org/XML/1998/namespace': 'xml'})[source]

Bases: object

A parser that converts an XMP metadata string into a python dictionary.

Usage:

parser = XmpParser()
metadata = parser.parse(xmpstring)

Common namespaces are abbreviated in the output using the definitions in xmp.NS_MAP. If an abbreviation for a namespace is not defined in NS_MAP, the full URL is used as the key in the output dictionary. It is possible to override NS_MAP when initializing the parser:

parser = XmpParser(ns_map={'http://www.w3.org/XML/1998/namespace': 'xml'})
metadata = parser.parse(xmpstring)
__init__(ns_map={'http://crossref.org/crossmark/1.0/': 'crossmark', 'http://ns.adobe.com/pdf/1.3/': 'pdf', 'http://ns.adobe.com/pdfx/1.3/': 'pdfx', 'http://ns.adobe.com/xap/1.0/': 'xap', 'http://ns.adobe.com/xap/1.0/mm/': 'xapmm', 'http://ns.adobe.com/xap/1.0/rights/': 'rights', 'http://prismstandard.org/namespaces/basic/2.0/': 'prism', 'http://purl.org/dc/elements/1.1/': 'dc', 'http://www.w3.org/1999/02/22-rdf-syntax-ns#': 'rdf', 'http://www.w3.org/XML/1998/namespace': 'xml'})[source]

Initialize self. See help(type(self)) for accurate signature.

parse(xmp)[source]

Run parser and return a dictionary of all the parsed metadata.

chemdataextractor.biblio.xmp.parse_xmp(xmp)[source]

Shorthand function for parsing an XMP string into a python dictionary.