All posts by Jürgen Knauth

Doing some refactoring on our dictionary server

In our project SeNeReKo we use two copora: An Old Egyptian corpus and a Pali corpus. We aim to enhance both corpora in such a way that each word is associated with a part-of-speech tag. Unfortunately this is not that easy: Due to the historic nature of these corpora – they are two to three thousand years old – we have to deal with a large variety of morphological forms, especially in Pali.

To address this complexity in Pali we need to maintain a dictionary with various information about words. To be able to actively work with the data we need a server that is capable of managing data associated with each word in such a way that this associations can be modified very easily. As classic database systems do not allow the flexibility required here we choose to create our own server application to address this. It is based on MongoDB and NodeJS. The NoSQL database MongoDB maintains losly structured document oriented data. It is used to actually store the data. A NodeJS server is used as a frontend to that data base in order …

  1. … to control access to this data,
  2. to implement some custom features that implements the REST API,
  3. giving even more flexibility than it would be achieved with MongoDB alone
  4. and – most importantly – enhances performance dramatically.

(I will give you more information about this in some other post.) This server enables us to perform easy machine based processing of our dictionary data. (This way we hope to considerably reduce requirements for manual editing.)

As the performance results are excellent and everything works well it is now time to do some refactoring on the server: That’s what I am currently working on. But let me give you a bit more background information what refactoring is and why it is so important.

If you develop software – especially in larger projects – it is custom to add more and more features. Projects typically are feature driven: Features are the “driving force” that brings progress to a project. Progress is typically measured by the number of code lines or features implemented: This is basically no different in scientific projects.

Unfortunately this is not enough. Projects can not be measured only by looking at features. (Or the number of code lines.) Features are only superficial characteristics. They are only directly visible properties, but there are more properties that need to be taken into consideration in order to judge the quality of a software product on it’s way to a usable state.

If you only look at features you miss the entire internal structure of a software: Though this structure is typically not visible it is of fundamental importance. It is the quality of this structure that decides about wether you will suceed in completing your work and resulting in a good software product or wether you will end up with a buggy system. Therefor not only features (or numbe of code lines) but the structure of a software requires your attention. Software developers with many years of experience know about this.

Screenshot of one of our forms showing a dictionary entry

At some point in time in projects it can be wise to not implement any more features, but to modify the structure of a software. In such a stage changes must be applied to a software that does not result in new features: These changes are therefor not visible directly from an outside point of view. One may be tempted to think that if these changes can not be seen directly from the outside, these changes may not be important. But this perspective is a fatal one: For software a really good internal architectural structure is mandatory. Then and only then new features may be added in the future without breaking the existing implementation and without endangering maintainablity of the code.

Without modification of the structure your software will get messed up internally by hacks and you will end up in an unmaintainable and unmodifyable peace of software – and will certainly fail in the long run. A good software design is crucial to project success. A design of good quality must be maintained all the time. (If you don’t follow this principle sooner or later you will inevitably learn the hard way why this principle is so important.) Therefor this internal structure needs much more attention as someone might think of in the first place.

Think of it like this: If you build a (real) house, you first create a plan, then you build the walls, the floors, the roof and everything else according to plan. Most of these things can be specified in advance. For smaller houses following this pattern is no big deal. But for larger complex of buildings this is not that easy: During creation the need for change arises frequently. The larger the projects are the harder it is to plan every step in advance in great detail. Therefor the architects try to build the complex in such a way that some changes can be tolerarted even after building the complex has already been started.

Software is a bit like raising a complex of buildings. But compared with such hardware software is much more flexible: While in real buildings no one would think of modifying basic structures like main walls or even fundaments after they have been build software must deal with these kind of changes: As software is based on source code not on concrete everything can be changed almost at any time. Typically requirements change over the time and new ideas arise: Therefor there is no point in time where there won’t be the desire for massiv changes some software product should undergo.

Because of that it is sometimes important to do refactoring on your software: Restructuring software without implementing new features. That’s what I am currently doing with our dictionary server at the moment. This way I will get a more developer friendly code: Then new features can be added easily. In our case: Exporting dictionary entries as TEI. For the ease of working with the dictionary data all information is stored in JSON structures. But in the end some applications in digital humanities will want to access the data in a TEI conform way. Refactoring the dictionary software will allow us to end up with a good server architecture, and this good architecture will later enable us to implement such TEI data exports in a plugin-like fashion very easily.


Severe security problem: All users should update their linux systems immediately

It’s a bit sad that I can not start this blog with an a bit less technical entry, but the severeness of this problem must be brought to attention to all users. As explained here there is a major security issue regarding OpenSSL which is used by various software packages including the web server Apache, SSH and others. It is mandatory that all users administrating linux systems update their systems immediately.

The Day of DH is about what digital humanists do. Sometimes updating computer systems is one of the tasks: F.e. in our project “SeNeReKo” we maintain an own web server to host important data for all project members.  As we maintain this server on our own to reduce administrative overhead once in a while a few classic server administration tasks (like such described above) have to be performed. In this case these are really simple. So simple that writing a blog entry about them takes much more effort than performing the actual system update.