Afternoon Project: HTML Scraping

My project this afternoon was to turn bibliographies formatted in a particular way as web pages, into structured data that would be more amenable to further computational processing. In this case, we’re interested in doing further scripting to find each item from our bibliography in the HathiTrust Digital Library dataset MSU hosts locally.

Here’s what the original bibliography looked like:

As usual, the primary stumbling block was malformed (or just weirdly-formed) HTML. I used the Python module BeautifulSoup to read and parse the HTML, but unfortunately the “author name” section of the bibliography was not housed in its own HTML element. That meant using a bit of additional string processing to get at the relevant text. 

The resulting object is a list of items taking the following form (as a list of Python dictionaries), which could easily be stored as JSON, or iterated through to perform other tasks:

[{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Moods’,
‘year’: u’1865′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Hospital sketches’,
‘year’: u’1863′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Work: A Story of Experience’,
‘year’: u’1873′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’On Picket Duty, and Other Tales’,
‘year’: u’1864′},
{‘author’: u’Alcott, Louisa May, 1832-1888′,
‘title’: u’Hospital Sketches and Camp and Fireside Stories’,
‘year’: u’1869′},
{‘author’: u’Aldrich, Thomas Bailey, 1836-1907′,
‘title’: u”Daisy’s Necklace: and What Came of It. (A Literary Episode.)”,
‘year’: u’1857′}]

Well, none too pretty, but quite a bit more useful.

Thomas Padilla and I are planning to use these bibliographies to build datasets (of full text and metadata) for students and researchers interested in doing text analysis. This code is very much written to work with bibliographies in this particular form, but similar work could be undertaken to handle a diverse range of biblio-formats.

Code available at GitHub.


Formative DH Formalism

Following today’s stimulating DH panel “Getting Started in the Digital Humanities: A Multidisciplinary Perspective” at the Michigan State Libraries, I was left contemplating my own initial interest in DH.  Where did it come from? As an English undergrad and grad student in the early to mid aughts, I gave nary a thought to digital methods. It was only as a first-year student in a library and information science program that I began to see “technical skills” as, at least, a valuable addition to my resume. In an entry-level Python programming course, the instructor (the great Vetle Torvik) assigned us an open-ended project: Take what you’ve learned of Python and make something with it.

I realized at that moment that my natural inclinations took me back to literary studies. I also realized that my literary interests were especially compatible with digital work. I ended up writing a simple program that would re-create the analogue algorithm used by Jackson Mac Low to generate his poem “Call Me Ishmael” in a digital form. The algorithm Mac Low used works by generating an acrostic for each word in the first line of Moby-Dick (“Call me Ishamel”) such that the first line spells “call”, the second spells “me”, and the third spells “ishmael.” The process repeats, creating a series of 3-line stanzas, all of which spell Call Me Ishmael. What’s more, the words used to fill out this poetic form are also taken from Moby-Dick, in the order they appear in the text — a beautifully simple technique for capturing something of the ambience of Moby-Dick, but in a completely original form, leaving nothing (or everything) to chance.

The program I wrote allowed the user to submit any text, which would then be transformed into a poem following the same pattern. Of course, it’s a rather simple program (even if a bit hard to explain), but enlightening to me nonetheless. I quickly learned that the algorithm worked better with short, pithy opening lines, than even sentences of moderate length. Jane Eyre‘s “There was no possibility of taking a walk that day” already expands into rather an unwieldy poem.  But I also realized at that point how compatible interest in something like generative literature is with what we know as the digital humanities. Authors affiliated with the Oulipo school, for instance, would write works according to rules that guided the structure, diction, plot, and character of their works. In general, rules are highly amenable to digital instantiation and manipulation. Take this online manifestion of Raymond Queaneau’s canonical One Hundred Thousand Billion Sonnets as an example. Digital techniques can push us not only to reflect on the computationally tractable modes of authorship of the past, but suggest new forms and outlets for criticism and creativity. Christian Bök’s more recent investigations stretch the idea of literature, and of writing, in even more extreme directions, that suggest utopian/dystopian post-human futures.

Ultimately, DH, for me, is about estrangement, or, as the Russian formalists had it, ostranenie: to make language or the world strange again through the application of new frames, arrangements, constructions, combinations. I suggest we keep this principle of estrangement in mind as we define, and do work in, the digital humanities.

(In case anyone is interested, I’ll post the code for my Python script later in the day or tomorrow.)

Digital Library Programmer at MSU

Skip to toolbar