Doing some refactoring on our dictionary server

In our project SeNeReKo we use two copora: An Old Egyptian corpus and a Pali corpus. We aim to enhance both corpora in such a way that each word is associated with a part-of-speech tag. Unfortunately this is not that easy: Due to the historic nature of these corpora – they are two to three thousand years old – we have to deal with a large variety of morphological forms, especially in Pali.

To address this┬ácomplexity in Pali we need to maintain a dictionary with various information about words. To be able to actively work with the data we need a server that is capable of managing data associated with each word in such a way that this associations can be modified very easily. As classic database systems do not allow the flexibility required here we choose to create our own server application to address this. It is based on MongoDB and NodeJS. The NoSQL database MongoDB maintains losly structured document oriented data. It is used to actually store the data. A NodeJS server is used as a frontend to that data base in order …

  1. … to control access to this data,
  2. to implement some custom features that implements the REST API,
  3. giving even more flexibility than it would be achieved with MongoDB alone
  4. and – most importantly – enhances performance dramatically.

(I will give you more information about this in some other post.) This server enables us to perform easy machine based processing of our dictionary data. (This way we hope to considerably reduce requirements for manual editing.)

As the performance results are excellent and everything works well it is now time to do some refactoring on the server: That’s what I am currently working on. But let me give you a bit more background information what refactoring is and why it is so important.

If you develop software – especially in larger projects – it is custom to add more and more features. Projects typically are feature driven: Features are the “driving force” that brings progress to a project. Progress is typically measured by the number of code lines or features implemented: This is basically no different in scientific projects.

Unfortunately this is not enough. Projects can not be measured only by looking at features. (Or the number of code lines.) Features are only superficial characteristics. They are only directly visible properties, but there are more properties that need to be taken into consideration in order to judge the quality of a software product on it’s way to a usable state.

If you only look at features you miss the entire internal structure of a software: Though this structure is typically not visible it is of fundamental importance. It is the quality of this structure that decides about wether you will suceed in completing your work and resulting in a good software product or wether you will end up with a buggy system. Therefor not only features (or numbe of code lines) but the structure of a software requires your attention. Software developers with many years of experience know about this.

Screenshot
Screenshot of one of our forms showing a dictionary entry

At some point in time in projects it can be wise to not implement any more features, but to modify the structure of a software. In such a stage changes must be applied to a software that does not result in new features: These changes are therefor not visible directly from an outside point of view. One may be tempted to think that if these changes can not be seen directly from the outside, these changes may not be important. But this perspective is a fatal one: For software a really good internal architectural structure is mandatory. Then and only then new features may be added in the future without breaking the existing implementation and without endangering maintainablity of the code.

Without modification of the structure your software will get messed up internally by hacks and you will end up in an unmaintainable and unmodifyable peace of software – and will certainly fail in the long run. A good software design is crucial to project success. A design of good quality must be maintained all the time. (If you don’t follow this principle sooner or later you will inevitably learn the hard way why this principle is so important.) Therefor this internal structure needs much more attention as someone might think of in the first place.

Think of it like this: If you build a (real) house, you first create a plan, then you build the walls, the floors, the roof and everything else according to plan. Most of these things can be specified in advance. For smaller houses following this pattern is no big deal. But for larger complex of buildings this is not that easy: During creation the need for change arises frequently. The larger the projects are the harder it is to plan every step in advance in great detail. Therefor the architects try to build the complex in such a way that some changes can be tolerarted even after building the complex has already been started.

Software is a bit like raising a complex of buildings. But compared with such hardware software is much more flexible: While in real buildings no one would think of modifying basic structures like main walls or even fundaments after they have been build software must deal with these kind of changes: As software is based on source code not on concrete everything can be changed almost at any time. Typically requirements change over the time and new ideas arise: Therefor there is no point in time where there won’t be the desire for massiv changes some software product should undergo.

Because of that it is sometimes important to do refactoring on your software: Restructuring software without implementing new features. That’s what I am currently doing with our dictionary server at the moment. This way I will get a more developer friendly code: Then new features can be added easily. In our case: Exporting dictionary entries as TEI. For the ease of working with the dictionary data all information is stored in JSON structures. But in the end some applications in digital humanities will want to access the data in a TEI conform way. Refactoring the dictionary software will allow us to end up with a good server architecture, and this good architecture will later enable us to implement such TEI data exports in a plugin-like fashion very easily.