Afternoon Project: HTML Scraping

My project this afternoon was to turn bibliographies formatted in a particular way as web pages, into structured data that would be more amenable to further computational processing. In this case, we’re interested in doing further scripting to find each item from our bibliography in the HathiTrust Digital Library dataset MSU hosts locally.

Here’s what the original bibliography looked like:

As usual, the primary stumbling block was malformed (or just weirdly-formed) HTML. I used the Python module BeautifulSoup to read and parse the HTML, but unfortunately the “author name” section of the bibliography was not housed in its own HTML element. That meant using a bit of additional string processing to get at the relevant text. 

The resulting object is a list of items taking the following form (as a list of Python dictionaries), which could easily be stored as JSON, or iterated through to perform other tasks:

[{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Moods’,
‘year’: u’1865′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Hospital sketches’,
‘year’: u’1863′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Work: A Story of Experience’,
‘year’: u’1873′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’On Picket Duty, and Other Tales’,
‘year’: u’1864′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Hospital Sketches and Camp and Fireside Stories’,
‘year’: u’1869′},
{‘author’: u’Aldrich, Thomas Bailey, 1836-1907′,
‘title’: u”Daisy’s Necklace: and What Came of It. (A Literary Episode.)”,
‘year’: u’1857′}]

Well, none too pretty, but quite a bit more useful.

Thomas Padilla and I are planning to use these bibliographies to build datasets (of full text and metadata) for students and researchers interested in doing text analysis. This code is very much written to work with bibliographies in this particular form, but similar work could be undertaken to handle a diverse range of biblio-formats.

Code available at GitHub.