As the reader may or may not know, my main ongoing project is Orange Textable, which is a software tool that lets the user do “visual programming” for text analysis (project homepage is here, documentation is here; it’s an add-on of Orange Canvas).
Now the reader most surely knows that today is Day of DH, and since I could spare some time to be part of this, I thought I’d give an example of what Textable lets you do on, you got it, Day of DH data. Not all the details, but still, enough to give an idea of how this Textable thing works. In particular, we’ll mine the Day of DH RSS stream and visualize the distribution of user actions (comment, join, etc.) reported therein.
It all starts with a data import widget instance (i.e. small blue circle representing a computational unit) placed on a blank design area called the canvas:
Double clicking on a widget instance opens its interface, as shown on the righthand side of the above figure. In the case of a URLs instance like this one, we essentially need to specify where to fetch data on the internet and in what encoding they’ll be.
RSS streams are encoded in XML, and this one is made up of such text blocks as this:
<item> <guid isPermaLink=”false“>82cb38ef79e1059723835b141f896114</guid> <title>Ranti Junus joined the group DH in the Curriculum</title> <link>http://dayofdh2014.matrix.msu.edu/activity/p/902/</link><pubDate>Tue, 08 Apr 2014 13:12:18 +0000</pubDate> <slash:comments>0</slash:comments> </item>
To extract those items from the stream, we’ll connect our first instance to an instance of another widget, namely Segment:
Here we do need a bit of code — let’s call this the cost of flexibility. In particular, we need to know the syntax of regular expressions (aka regexes), so that we can indicate to our Segment instance that the segments we’re interested in are the text blocks that are comprised between <item> and </item> in the RSS stream. There are 50 such segments, as we’re told.
The next step is arguably the most difficult one. We want to extract from each “item” the action that is being reported (in the above example, it would be joined). To do that, we’ll use another instance of Segment. This time, however, we’ll need to switch on the “advanced settings” in order to access some more powerful features…
As usual with instances of Segment, we must describe the segments that will be extracted with a regex: here we state that these segments should begin with a <title> XML tag, and shortly after, should contain one of a number of verb forms (wrote, became, posted, changed, and so on).
But there’s more to it: in each such segment, we’ll extract the verb form and use it as an annotation value associated to a specific key, namely the key action. (Note that the parameter &1 in the Annotation value field above means “the text block that corresponds to the first pair of parentheses in the regex”.)
Now we can specifically access this piece of information by saying, e.g., “count the frequency of each possible action value in the data”. Let’s do just that, by means of an instance of the Count widget:
The interface is rather self-explanatory in this case: basically we’re counting values associated with the action annotation key in the items segmentation that we’ve previously built using Segment.
Then after going through an instance of Convert (whose only purpose is to reorganize the data in a way that’s not very interesting to describe here), we can finally display the result using an instance of Distributions:
Did it change between the moment where I started this post and the moment where I’m finishing it? Let’s find out:
Well not much of a change in this period, maybe a few more posts and a few less comments, but writing remains the most frequent action, followed by posting.
More than an earth-shattering discovery about Day of DH, I hope this example shows that Orange Textable lets the user perform quite specific text analyses (almost) without writing a single line of code (regexes set aside). And I’d very much like to hear DHer’s opinion about this.