Page 1

ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

Requirements Engineering as a Stepwise Process of Information Merging

Horst Kandutsch AHK e.U.. Am Föhrenwald 10, 9201 Krumpendorf a. W. Austria Tel: +43 (676) 611 46 03 – E-mail: Johann Kuschnig LIFE Lakeside IT Forschung & Entwicklung GmbH. Lakeside B03. A-9020 Klagenfurt a. W. Austria Tel: +43 (463) 287 277 – E-mail: Manuel Warum LIFE Lakeside IT Forschung & Entwicklung GmbH. Lakeside B03, A-9020 Klagenfurt a. W. Austria Tel: +43 (463) 287 277 – E-mail: Günther Fliedl Alpen Adria Universitaet Klagenfurt, Department of Applied Computer Science Universitaetsstraße 65-67, A-9020 Klagenfurt Tel: +43 (463) 2700 3733 – Fax: +43 (463) 2700 993733 – E-mail: Christian Winkler Alpen Adria Universität Klagenfurt, Department of Linguistics and Computational Linguistics Universitaetsstraße 65-67, A-9020 Klagenfurt Tel: +43 (463) 2700 2814 – Fax: +43 (463) 2700 992814 – E-mail:

ABSTRACT Throughout the requirements engineering process, several documents are created in the course of time. During the development process, requirements are subject to changes, thus causing redundancy or ambiguity. The fact that specifications are usually incomplete, and different terms are raised throughout building awareness during the Requirements Engineering Process (REP) raises a strong need for linguistically enhanced information merging. In order to tackle these issues, several pre-processing tools in the environment of AVAnguide1 have been developed. In fact, the information merging process assembles all the respective requirements, thus enabling AVAnguide to compare the complete and unadulterated requirements with the corresponding source code analyzed by using Mono.Cecil2.



Automatisches Validieren von Anforderung ‘Automatic Validation of Requirements’. See; last accessed May 14, 2012.


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler



In the domain of requirements engineering we have recognised a strong need for comparing textual requirements specifications and source code. Textual specifications, however, are usually incomplete, and beyond that terminological inconsistence proliferates throughout the Requirement Engineering Process (REP). In order to solve these problems, the AVAnguide3 toolset has been implemented. AVAnguide is a combination of different computational linguistic tools and static code analysis using MonoCecil. We use AVAnguide and MonoCecil for comparing textual specifications with their respective implementations. As an input of this comparison process we expect a set of different text files (documents, textual presentation slides, and so on) resulting from REP, and the output of the static code analysis. Throughout the requirements engineering process, most likely several documents will be created in the course of time. As progress is being made, and additional specifications or clarifications are written down, older and newer documents might not always agree with respect to terminology or in describing facts. A promising way to tackle this problem is the use of an information merging component as described in section 2. Merging individual pieces of information brings along several issues to be considered. The use of co-referential expressions, anaphora and ellipses might obfuscate carriers of meaning. Furthermore, synonymous terms were possibly adopted in the course of time, making it difficult to elicit the correct meaning. LICORA, which stands for Linguistic Coreference Analysis, is a tool to handle these challenges. It is presented in section 3.

In Section 4 we describe the AVAnguide linguistic toolsets and their corresponding concepts in more detail to establish a conclusion in section 5. Reflecting the state of the art and some of our current tool implementations, we describe those tools which are presented in the demo sessions of this conference.



All software creation processes have one thing in common: during the development process, requirements are subject to changes, thus causing redundancy or ambiguity. To solve this problem, an information merging component brings all the respective requirements together, thus enabling AVAnguide to compare the complete and unadulterated requirements with the corresponding source code. In other words, the main goal of the information merging component is to combine requirements from different sources like specification documents, notes and mailing into one document which is well formulated and machine readable. The process of information merging faces several challenges when dealing with inherently heterogeneous documents: several parts of the documents – such as table of contents, list of figures, salutations or figure captions – might not always have any relevancy; ideally, the information merging component identifies logically contradicting statements as well. The merging process happens sequentially and hierarchically. It starts on the macroscopic level of documents and continues to drill down to the level of paragraphs, and lastly operates on individual sentences as single units of work. During the process of merging individual headlines, the proposed approach facilitates topic search algorithms and comparisons of headlines to insert paragraphs into the main document. Once all paragraphs were inserted



ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

into the main document, single sentences will be compared and merged in a similar fashion. At this stage, semantic analysis is performed to identify irregularities such as:   

Multiple mentions of the same fact, Contradictions, Complementary information.

Through reduction of sentences, it is possible to compare content. LICORA4 solves the problem of identifying anaphoric or co-referential expressions. Afterwards, pairwise comparing shows if content is introduced twice; in this case, all but one sentence is automatically removed from the resulting document. If a contradiction is found, a detailed explanation of the issue is presented to a human expert (such as the user), providing options to resolve the situation. These interactions are handled by a so called human computer interface (hci) 5. If information is discovered that is identified as being complementary information to already existing topics or statements, this content is added after the sentence in the main document. 2.1 Process description The entire process of merging individual documents can be summarized in a six-step process:      

Create an XML representation of all documents Perform linguistic analysis Correction Coreference resolution Finding an anchor document Merge

These steps are performed in sequence to create a single document from many input documents. 2.2 XML Representation In this step, the specified set of original documents is converted to an XML document. The content of the document is transformed into an XML representation of the document that contains some additional metainformation. Such a transformation is required to dispose of a uniform file format during the processing step. Input documents, on the other hand, can be of any file format supporting transformation. In addition to further details, the included meta-information contains the number of pages, the creation date, the creator and the count of words; some file formats offer a different variety of information which is included as well, such as sender and recipient of emails. Furthermore, a separate XML file is created in addition to the XML document concerned. This so called log file contains word frequency lists, identified topics of the document, and additional analytical data, which are not part of the main document. 2.3 Linguistic Analysis This step performs basic Natural Language Processing operations, such as tokenization and tagging. The respective stem is added to the XML document as well. This piece of information constitutes the central part on which the next processing step is based. 2.4 Correction Correction is necessary because document parts such as table of contents, cover, or salutations are not required for the following steps. However, the removed items will be added to the log file as deleted content but should never be discarded entirely. If a list of abbreviations (e.g. in a glossary) is found, all occurrences of abbreviations will be replaced by the full entry, so that in the following steps abbreviated terms require no further lookups.


Master Thesis Manuel Warum. Human computer interface is a part of Avanguide that presents problems to the user, allowing him/her to resolve issues by proposing solutions without having to alter the original document. 5


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

2.5 Coreference Resolution In this step coreference resolution will be performed. Coreference Resolution is the topic of Manuel Warum’s Master thesis. 2.6 Finding an Anchor Document The anchor document is the main document all information is merged into. Finding a document is the last process before the main merging process starts. In the context of requirements engineering, the requirements specification document forms the main document. If such a document is found, it will be marked as the anchor document. Otherwise, if no requirements specification document is identified, the largest document is marked as the anchor document, for the simple reason that it offers the largest content. 2.7 Merge This processing step merges any number of documents into a single one. Initially, every document is annotated in a way that every paragraph is assigned a bag of topics. This bag is filled with nouns, ordered by importance (most likely nouns occur first). These topics consist of nouns from headlines as well as nouns from the current paragraph. Short mails or paragraphs of 3-4 sentences have fewer nouns for a topic description, which can easily lead into a problem: one mail or paragraph could be assigned to too many other paragraphs, which means that this kind of text passages must be handled separately. In such cases, a possible solution would be to have a look at earlier mails or follow-ups to the respective mail. As for paragraphs, it would be a good way to include the paragraphs situated above and below the one in question. After identifying topics, the next step is to enrich the anchor document with the all other documents. The oldest documents are added first to form a chronologically intact history of requirements. Inserting a paragraph into the anchor document is done by topic similarity. Similarity is given if most nouns in the topic array are matching. As for this step, further testing in practical environments is required. After completing the anchor document, the sentence merging process can be started. Currently, this process is not fully defined and thus requires further research. After the merging process has completed, the resulting document is sent to the AVAnguide coverage process, which compares the source code and the resulting document.



As AVAnguide employs statistics on carriers of meaning, pronouns and different types of expressions for the same entity can skew and distort frequency-based metrics. To mitigate this issue, a machine-learning algorithm based on co-/dis-reference was developed, considering previous research carried out in the project SĂźKRE6 and work by Ng and Cardie. This standalone component is called LICORA (Linguistic Coreference Analysis). Its steps are illustrated in Figure 1, with more detailed explanations following below.


See; last accessed May 8, 2012.


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

Preprocessing (Tagging and Tokenization)

Markable Detection

Feature Extraction

Morphological Features

Semantic Features Semantic Relatedness, semantic Categories, ...

Co-/Dis-Reference Decision

Lexical Features

Type of word, Subject or object, ...

Levenshtein distance, capitalization, ...

Figure 1: Macroscopic view of Licora's processing pipeline

LICORA uses a slightly adapted version of chunking – the process of identifying individual phrases in a given sentence – for the detection of so-called markables. By Kessler’s definition [6], markables are expressions that might be coreferent to another expression; for the current purpose, we use noun phrases provided by shallow parsing algorithms. Post-processing steps are performed when possessive modifiers and embedded pronominal tokens are extracted and treated as individual, potentially sub-ordinate markables. For instance, my uncle’s house would be treated as three individual markables – “my”, “my uncle” and “my uncle’s house” – as each of these entities can be coreferent on their own. Once all markables in a given document have been identified, each one is paired with any number of antecedent markables. The number of pairs generated this way is restricted; tests have shown that a maximum distance of five sentences or around forty markables between both elements of the pair is sufficient for most texts. All markable pairs are then analysed individually in a two-step process: First, a set of features is extracted from these pairs, ranging from positional features over morphological aspects to semantic metrics. Some features have been taken directly from previous notable work by SüKRE, Ng and Cardie, whereas other features have been adapted or introduced after a range of viability tests. There are several different kinds of lexical, syntactic, morphologic and semantic features; below are a few examples to indicate the range of features employed in the feature extraction process:    

 

Type: the type of the markable in question; pronoun, proper name, common noun or otherwise. This is an important factor in some circumstances, especially if the latter markable is a pronoun and the former is not. Closest agreeing markable: inspects if both markables have no other markables in-between them and both agree in number and gender. Acronym: if one markable can be interpreted as an acronym of the other; this feature only attempts to identify if the concatenated words’ initial letters of one markable equals the other markable’s image. Overlap: counts the relative number of words in the intersection between content words of both markables; the traditional definition of a content word includes nouns, adjectives, and other carriers of meaning except pronouns. For LICORA’s purpose, pronouns are treated as content words as well. Definiteness: inspects the definiteness of either markable using specific keywords such as “a” and “the”. Agreement: inspects if both markables agree in number and/or gender using lexical datasets as well as lexical patterns (e.g. “-s”) and keywords (e.g. “Mr”, “Mrs”). If no gender can be extracted using name databases, keyword search or lexical lookups, the noun is presumed to be of any gender and automatically agrees with any other markable. Semantic categories: attempts to assign broad semantic categories to both markables (e.g. item, location, person, etc.) based on ontological databases. Once categories have been identified, Licora establishes a semantic category agreement if an overlap between both markables’ categories exists.


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

Semantic relatedness: using ontologies such as NELL7 and WordNet, these features attempt to measure the shortest distance between both semantic concepts within the ontology using Dijkstra; for this, some types of relations (such as synonymy or hyponomy) have a lower path costs than others to prefer synonymous relations. Positioning: in addition to other features, these measure the distance between both markables; this includes the number of intermediate noun phrases as well as the number of intermediate sentences.

After features have been extracted, they are relayed as a dataset to Weka 8, a Java-based data mining application developed by the University of Waikato. This dataset is classified using previously trained classification models, where each pair is classified as either being coreferent or disreferent.



The tools described in the previous sections enable the user to efficiently merge requirement specification documents into a single coherent document. All these implementations are part of a chain of tool sets within the AVAnguide project. As for the linguistic tools involved, these are divided into three tool sets, which will be presented in the following subsections. 4.1 The linguistic “MinToolSet” Aiming at extracting semantic features coded by nouns and noun phrases, the linguistic MinToolSet initially carries out the following language processing steps:    

Tagging Chunking Predicate structure identification Lemmatizing

After that, a further statistical component, called Frequencer, calculates the relevancy metric R of any extracted and lemmatized noun or entity. Using power laws, AVAnguide can draw conclusions of the relative R in the set of requirements specifications and calculates the most probable usage of the current or a similar string in the resulting source code of the project’s implementation. For instance, a string is considered similar if it represents an abbreviation or an acronym; a string “pwd” as a method argument could be interpreted as an abbreviated form of “password”, for which a counterpart might exist in the natural language requirements. This is achieved by a special component by using commonly employed devocalization and abbreviation techniques as well as a lookup table of frequently used abbreviations. The output of the MinToolSet contains representations of class and attribute candidates found in the source code. Subsequent processing of source code and especially its quantification with the help of the Hits algorithm [5] allows a comparison between these representations. The result of this comparison (c`) is a part of the AVAnguide Coverage metric C. 4.2 The linguistic “MedToolSet” After the reduction of requirements and source code in the MinToolSet, the MedToolSet calculates further relevant meta-information. Along with a quantification of operands and operators, for example the calculation of the Halstead Volume [3] or the cyclomatic complexity using McCabe [8], the MedToolSet draws conclusions from the syntactic adjacency of semantic features in the requirements documents. An entity E1 is considered with its relevance as a function f1 based on the relative frequency of this entity in the documents. Likewise, other entities E2, E3, E4 in the syntactic proximity of E1, including their respective relevance metric can be observed. The following figure illustrates the directed graph of these entities. The

7 8

See the Read The Web project,; last accessed May 8, 2012. See; last accessed May 8, 2012.


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler

direction of edges indicates the syntactic order of appearance. For example, the entity E1 is often followed by E2 and so on. The result is a semantic network, obtained without using large ontologies. The original sentences resulting in the graph in figure 2 were “Administrators (E3) of cost statistics (E4) are also administrators (E3) of the users (E1) time registration system (E2). The users’ hours of work are one of the key performance indicators of cost statictics”. The corresponding source code graph of the resulting software system might show a similar graph. The classes “adm” C3 with its function based on hub and authority f(h,a) 3 calls the class user (C1) and also a class cost statistic (C4).

Figure 2: Requirements and source code graph of a similar concept

According to these facts and the similarity of the ranking induced by frequency in terms of natural language and Hits [5] considering the classes in the source code, AVAnguide determines the c``, an additional part of the Coverage C. 4.3 The linguistic “MaxToolSet” The subsequent MaxToolSet extracts semantics with the aim of identifying user roles, functionalities and the completeness of the requirements. The implemented hypothesis follows the basic principle that the verb is not only head of the sentence, but also predominantly responsible for the meaning. In terms of requirement documents, AVAnguide also presumes that each verb can be reduced to one of the following CRUD n classes. These classes (create, read, update, delete) represent the basic operations of data centered business application requirements; these verb types are supplemented with modal and possessive verbs shown in the following example. “Administrators are allowed to create new user accounts. A user account consists of its name, his/her unique identifier, password and different facultative data fields.” After applying the different toolsets we obtain the following result: “administrator/NN CRUD5(CRUD1) user/NN account/NN user/NN CRUD6 name/NN, identifier/NN, password/NN” In this example, the indicator CRUD5 is particularly noteworthy. The verb class in this case represents a security guideline, which should be reflected in the source code of the application as well. Quite obviously, by making use of well defined verb classes, AVAnguide is able to simplify the semantic determination of natural language. The following table illustrates these classes and some representatives as initial seeds for completion via set expansion algorithms like SEAL [4].


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler


Main Representatives

Initial Seeds



Make, produce, assemble, establish, fabricate, constitute, draw



See, detect, gather, scan, sense, look, behold, gaze, view, glare



Edit, adapt, handle, calculate, process, revise, treat, work, attend



Erase, kill, cancel, clear, destroy, eliminate, eradicate, reset, undo



Permit, enable, able, approve, admit, can, may, must, have to



Have, hold, own, exhibit, belong, feature, keep, retain, remand

The final hypothesis of the current work thus coincides with that of an already established approach using verb classes [2]: each class implements the same dependence structure among its members. This is indeed the clue for solving the challenge of dependence parsing. According to Ă gel [2], it is not possible to automatically determine whether a declaration is facultative or obligatory. Nevertheless, reducing this problem with the help of CRUDnclasses, enables us to implement a strategy for each of these six classes. Furthermore, this approach should be quite suiteable for detecting errors in requirement documents, identifying potential violations of guidelines set forth by the rules of Rupp [7], as well as for gaining a list of functional requirements. By identifying a function or method for each entry of this list, AVAnguide can determine the c``` metric. Considering security issues (e.g. by corresponding interfaces for the different users), it can calculate the Coverage C of requirement documents and the corresponding source code of data centred business applications. This forms the first software metrics with focus on functional completeness of a software system [1].



In the current paper, Requirements Engineering has been considered as a stepwise process of information merging. The AVAnguide platform is a combination of different computational linguistic tools and the facilitation of static code analysis using Mono.Cecil to compare a specification with the developed implementation of the project concerned. As for coreference analysis, LICORA (Linguistic Coreference Analysis), a machine-learning algorithm has been developed. Further processing is carried out by three concatenated AVAnguide tool sets. By making use of well defined verb classes, AVAnguide is able to simplify the semantic determination of natural language. This approach is quite suiteable for detecting errors in requirement documents, identifying potential violations of guidelines, as well as for gaining a list of functional requirements.


ICSSEA 2012–Kandutsch, Kuschnig, Warum, Fliedl, Winkler



[1] Abran A. (2010). Software Metrics and Software Metrodology. John Wiley & Sons Verlag 2010, ISBN 0470597208, 9780470597200. [2] Ágel V. (2006). Dependenz und Valenz: Ein internationales Handbuch der zeitgenössischen Forschung, Band 2. Walter de Gruyter Verlag 2006, ISBN 311019984X, 9783110199840. [3] Burgin, M.S. (2005). Super-Recursive Algorithms. Springer Verlag 2005, ISBN 0387955690, 9780387955698. [4] Dalvi B., Callan J., Cohen W. (2011). Entitiy List Completion Using Set Expansion Techniques. Language Technologies Institute, Carnegie Mellon University Pittsburgh, 2011. [5] Kemper, A; Eickler A. (2011). Datenbanksysteme: Eine Einführung. Oldenbourg Verlag, 8. Ausgabe, ISBN 3486598341, 9783486598346. [6] Kessler, S. W. (2010). Analysis and visualization of coreference features. Master's thesis, Institut für Visualisierung und Interaktive Systeme, Universität Stuttgart. [7] Rupp C. (2007). Requirements-Engineering und -Management: Professionelle, iterative Anforderungsanalyse für die Praxis. Hanser Verlag 2007, ISBN 3446405097, 9783446405097. [8] Russ, M. (2008). Development of R&D KPIs in SW Development and SW Test. GRIN Verlag 2008, ISBN 3640138295, 9783640138296.


Icssea2012: Requirements Engineering as a Stepwise Process of Information Merging  

Throughout the requirements engineering process, several documents are created in the course of time. During the development process, requir...