Issuu on Google+

PROJECT TITLE: Automated extraction and presentation of patient scenarios from a database for research into health inequalities

- By Sharadha Jayaraman (Bachelor of Science in Software Engineering, Aston University, Birmingham, United Kingdom)

April 2013

ACKNOWLEDGEMENTS: I would like to thank the following people for their help in the production of this application: Dr. Christopher Buckingham, my supervisor and the client, who has guided me through this project. Dr. Tony Beaumont and Dr. Ian Nabney, for their support with the interface design Ms. Katrina Samperi for her programming support and insights into the project implementation. Mr. Ashish Kumar for providing me with some implementation aid. Mr. Sachin Hariharan for providing feedback at every stage of development. Mr. Jay Bhosle for advising me on Swing and database connections and in testing the application. Finally, my family and friends who have lent some programming support as well as suggestions, which contribute to the success of this project.

Table of Contents 1. Introduction ........................................................................................................................ 1 2. Context & Research ...................................................................................................... 2-14 2.1 GRiST: Galatean Risk Screening Tool .......................................................................... 2 2.2 MyGRiST: Galatean Risk and Safety Tool ................................................................... 3 2.3 GRiST Database ............................................................................................................ 5 2.4 Designing Experiements ........................................................................................... 6-10 2.4.1. What is Design of Experiments? ................................................................... 7 2.4.2. Components of DOE ...................................................................................... 7 2.4.3. DOE Process .................................................................................................. 8 2.4.4. Advantages and Disadvantages of DOE ........................................................ 9 2.5 Hypothesis/Significance Testing ............................................................................ 10-14 2.5.1. Need for Experimentation ............................................................................ 11 2.5.2. What are the tests for Hypotheses? .............................................................. 12 2.5.3. Statistical Packages ...................................................................................... 12 3. Requirements Analysis ............................................................................................... 15-28 3.1 User requirements ........................................................................................................ 15 3.2 Functional Requirements ........................................................................................ 17-26 3.2.1 Mind Maps .................................................................................................... 18 3.2.2 FreeMind: Software Tool for developing powerful Mind Maps .................. 20 3.2.3 Requirements Analysis ................................................................................. 21 3.3 Non-Functional Requirements ..................................................................................... 27 4. Design ........................................................................................................................... 29-43 4.1 The Parent Interface ..................................................................................................... 29 4.2 Human Computer Interface Principles- Part I ............................................................. 30 4.3 The Waterfall Model .................................................................................................... 32 4.4 Human Computer Interface Principles- Part II ............................................................ 33 4.5 Class Diagrams ............................................................................................................ 34 4.6 Database Design- Part I ............................................................................................... 40 4.7 Designing the non-functional requirements ................................................................. 42

4.8 Applying software engineering practices .................................................................... 43 5. Implementation ........................................................................................................... 44-61 5.3 Designing GUI with Java Swing ................................................................................. 44 5.4 MySQL ................................................................................................................... 47-57 5.2.1. Database Design- Part II .................................................................................... 48 5.2.2. Querying the database ........................................................................................ 50 5.2.3. OpenCSV ........................................................................................................... 56 5.2.3. Database Management ....................................................................................... 57 5.5 Usability Testing .......................................................................................................... 58 6. Evaluation .................................................................................................................... 62-70 6.3 Iterative Waterfall Model- Is it best suited for the application? .................................. 62 6.4 Revisiting Requirements ......................................................................................... 63-66 6.2.1. Extending the application .................................................................................. 65 6.5 Limitations .............................................................................................................. 66-70 6.3.1. Undelivered Requirements & Proposed Solutions ............................................ 66 6.3.2. Alternative Tools and Techniques ..................................................................... 70 7. Conclusion ................................................................................................................... 71-75 7.3 Techniques Learnt........................................................................................................ 71 7.4 Future Work ................................................................................................................. 73 8. References .................................................................................................................... 76-80 9. Appendix ...................................................................................................................... 81-86

ABSTRACT GRiST is a mental health clinical tool available freely online to help risk patients organise their lives coherently through risk minimisation. The mental tool effectively presents a questionnaire which encapsulates comprehensive questions about an individual and his/her risk history. Clinicians utilise the tool to cognise patient risks and postulate “action plans� to assist patients reorder their lives and help eradicate further harm to self. As much as these assessments are valued by clinicians and medical institutions, they may be biased, based on varying factors such as gender, age and ethnic background. It would thus be imperative that the database be comprehensibly investigated for biases, and, if recognised, successfully eradicate them. By doing so, we could also identify the various factors which trigger these prejudices. An efficient manner of determining such factors would be to visually interpret the data present in the database. The application which was built to aid the GRiST project helps automate sample creation for investigating health inequalities. The samples can then be exported to a file for statistical analysis. The implemented application developed through this project aids experimenters in creating samples and exporting them to CSV files. The report discusses in detail the context and rationale for developing the application, the requirements elicited, their design and implementation in an object oriented programming language and the evaluation, limitations and conclusions formulated.

Chapter 1: INTRODUCTION This project is associated with the Galatean Risk and Safety Tool (GRiST), which, as commended by Dr. Christopher Buckingham, is “a sophisticated clinical decision support system for mental-health risk screening, assessment, and management.� [1]

This risk assessment tool (GRiST) is designed to assess risks among individuals thereby estimating the graveness of their risks. The judgments, based on these, are exercised by clinicians. In this project undertaken by me, I have attempted to analyse and study these judgments. It has been deployed in NHS and many other clinics in Birmingham. This tool records user assessments and stores them in a mental-health repository. These are then analysed by clinicians who provide assessments for the corresponding risks recorded. These risk assessments can be used, then, by an individual to consciously organise and manage his/her life, based on what the risk(s) may be.

The rationale behind developing this application is to locate whether there are biases in clinicians’ judgements. Are the assessments biased towards males? Is there a larger female population who self-harm than the males? If so, then have they been treated differently on the basis of their gender or age? These kinds of questions do strike our minds when we want to analyse the data stored in the GRiST database. The aim of developing such an application is to locate and help eliminate biases in these assessments. This application provides the user with the functionality to setup samples from the GRiST database. The user can query the database and set up sample populations which can then be used for investigating prejudices based on gender, age, ethnicity, etc. The user can perform statistical analysis, later, on these samples that they setup. These samples can be customised to the user requirements.

The activities undertaken in this project are: i) developing a base understanding of the GRiST system and setting up sample populations for statistical analysis, ii) eliciting requirements for the software application and determining the flexible functionalities offered to the user to establish unbiased samples, iii) Implementing the tool with the functionalities drawn during the requirements stage and iv) Evaluating the final product and (v) investigating its scope for improvement and further work which could be undertaken to improve the tool. 1

Chapter 2: CONTEXT & RESEARCH This project is centered on the Galatean Risk and Safety Tool which is used to assess patient risks. The aim of developing this tool is to promote management of patient risks and clinicians’ judgements for optimising mental-health care. It is important to understand the assessments and what factors influence a clinician’s perceptions. This section details the GRiST system, its evolution and the need for testing the database for biases.

2.1 GRiST: Galatean Risk Screening Tool GRiST is a mental health clinical decision support tool designed to manage an individual’s life from possible risks and the other dependents involved in it. It has questions all about THE INDIVIDUAL i.e. you- your life, your risk history, your personality, your relationship with others, your job, your feelings and emotions, etc. It helps guide an individual in their journey to recovery as well. The questioned listed on GRiST are often very specific and individualistic. They are often very exhaustive also; questions which risk clinicians usually ask to elicit risk data for analysis. GRiST was initially a questionnaire-based venture which was later developed as an electronic tool for easier access and smoother management of data. The questionnaire followed the style:

Figure 1. GRiST Questionnaire [2] 2

The GRiST questionnaire later transformed into a metal health tool, found online on the GRiST website.

2.2 MyGRiST: Galatean Risk and Safety Tool The myGRiST tool was deployed across various clinics in 2009. The questionnaire now transformed into an online tool and patient assessments were stored in a database. The online version was adopted because it fostered not only efficient management of patient data but also effective medium for clinicians to offer their opinions judiciously. It was also deployed a more universal access to expert judgements and better flexibility of end-user requirements. At present, the database has over 20,000 assessments [3]. MyGRiST helps users monitor their own risks at home (self-assessment) and manage their lives more effectively. Users can store their risk histories without having to repeat the information they already shared. There are several versions of the tool developed and under development viz. Working Age (18-65 years), Older Adults and Younger People along with specialised versions for Learning Disabilities, etc. These versions help span a vast array of user profiles, and help sort key risk issues for individuals in their convenience. There are also user groups as per accessibility ranges viz. myGRiST (the functional tool which is deployed widely in clinics in Birmingham, this group is for GRiST developers), GRiST Demo (a demo group which helps users test the tool) and Anonymous Users (provides accessibility to users without the need to register).

Figure 2. MyGRiST Tool GUI [4] The tool is better than the paper-based version in terms of navigating through the questions. The user needs to answer only the relevant questions and the tool sorts out these questions for the user based on how much information the user provides the tool with. However, 3

for the clinicians, the activities remain consistent. They provide risk judgements for each risk. They are also required to provide any additional comments and plans of action for the users pertaining to their risks, except that they are automated. This helps document the clinician-patient transaction better as where the data comes from and how the patients are evaluated is more evident. The GRiST project is a very ambitious one and has been developed over the years. The knowledge engineering tool developed to promote myGRiST has been developed in Java. Because this application is closely linked to analysing data in the GRiST database, it was only reasonable to adapt the software application being developed to the existing one i.e. myGRiST tool. Also, the second reason for taking this step was that the pre-existing tool directly interacted with the database for recording user assessments. This could be beneficial as the current application needed some information from the stored database valuation to locate biases. Hence, the application developed as a part of this project is an extension to the pre-existing tool. The parts taken from the pre-existing application have been demonstrated. Summarily, the components taken from the pre-existing tool for the developing application (in the User Interface, primarily) are: 

XMLs: The user interface reads (and parses) two XML files which contain the various riskrelated questions as well as the codes for each node (used primarily for database storage/communication and management).


Common Assessment Tree: The tree structure used for navigational purposes. The users can navigate the tree to answer specific questions. This tree also contains the various valid answer ranges for each question, be it nominal, date-day, month or year or number among the various question types. This tree also becomes incredibly important in dynamically updating itself as per the client assessment. This is to say that if an end-user has a saved assessment, then the next time the user logs in, the tree must be updated with the answers without the user having to fill in his/her information all over again.


Question Tree: The tree with the question texts and unique codes. The question bank loads every time the user begins a fresh assessment or repeats or resumes his/her assessment. The primary objective of using this pre-existing knowledge engineering tool is to understand

the application better and build it more efficiently for its specific purpose rather than start from scratch and focus on the intricacies of parsing the node codes and value ranges. It was treated as 4

an open-source software which could be customised to the current project requirements. This framework would be immensely useful in yielding a better end-product than developing a custom tool from the beginning while driving away the focus of the application which is to analyse the database and the risk assessments stored in it.

2.3 GRiST Database The GRiST database has over 20,000 assessments stored. It is a complex model which contains patient data and risk assessments given by clinicians. The design of the database is rather intelligent where each node has a corresponding code and each code has a corresponding question associated with it. These questions are answered by the users and they are recorded against their related code. There are 10 types of questions viz. scale questions (taking ranges between 0-10), number questions (taking numeric/integer values), day, month and year questions (taking days and months as valid values), filter and layer questions (taking “YES”, “NO”, “NULL” or “DK” as valid answers) and header questions. The questions concerned with day, month and year are taken in as integer values but are parsed into specific values to complement the risk assessment easily. The other fields in the database correspond to the various node codes in the form “gen-gender”, “gen-relationship”, “gen-ethnicity”, “suic-patt-att”, “suic”, etc. These describe the questions associated with them such as “To what extent are you feeling anxious or fearful?”, “When did you last try to end your life?”, “Relationship Status”, “Gender”, “Ethnicity”, etc. These codes have recorded assessments i.e. answers filled in by individuals. The overall structure of the database can be highlighted hence:

Figure 3. Table structure of the GRiST database. The database forms the core of the rationale for developing this application. The database has many assessments stored in it. But, is there an efficient way of analysing the data? Is there reasonable justification for the assessments provided for certain risks? If yes, then who is benefitted? Is it the males or the females? Is it the youth or the aged? Such questions may arise if 5

the data in the database is read plainly. If not analysed appropriately, there may be existing inequalities in the judgements recorded. These may remain uncovered unless a clean way of administering it is devised i.e. visualising it in some manner. Since there are so many assessments stored in a single repository, a need arises to ensure their significance on the clinicians’ minds while providing opinions. A good way of implementing this is by producing some statistics and graphs to complement the statistics and tracking some patterns in the data. Hence, it would be easier to gauge the assessments, the biases in these assessments, if any, and where they may be directed. If there are biases in the assessments recorded, then they need to be eliminated. An efficient way of achieving this is by sifting through the records until such a point where equal samples are compared in terms of demographics and other risk factors. The aim of developing this application is not to perform significance testing as that may be carried out in any statistical package like SPSS. But the aim is to set these samples up SO THAT these can be analysed using statistical packages. Hence, the functionality offered to the user in the application is to build such sample populations that are devoid of biases. These can be achieved if the correct risk factors are picked and balanced in terms of the data recorded for the factors which will be the driving force for developing the application. However, what is the significance of setting up such sample populations and run experiments on them? Why is this necessary? The next section discusses about design of experiments, the process and their advantages and disadvantages.

2.4 Designing Experiments In the medical domain, it is essential to design efficient experiments to understand the human. Psychology, especially, deals with experimenting with humans in deriving various outcomes of their behaviour and reflexes. Why do people react in a certain manner in certain situations? Why does someone feel suicidal? What are the various domains of abuse faced by individuals and who essentially play victim to such abuses? How susceptible are people to changes around them? These are much sought after issues of interests in this domain. GRiST follows a similar approach and this application, in particular is developed to investigate these aspects of the assessments stored. Hence, it is extremely vital to design experiments intelligently. This section discusses experimental design in depth. 6

2.4.1 What is Design of Experiments (DOE)? According to by WebFinance Inc. [5], Design of Experiments is a: “Statistical technique used in quality control for planning, conducting, analyzing, and interpreting sets of experiments aimed at making sound decisions without incurring a too high cost or taking too much time.” Also, The Free Dictionary (Medical Dictionary) by Farlex Inc. [6] explains experimental design in research as: “(in research) a study design used to test cause-and-effect relationships between variables. The classic experimental design specifies an experimental group and a control group. The independent variable is administered to the experimental group and not to the control group, and both groups are measured on the same dependent variable. Subsequent experimental designs have used more groups and more measurements over longer periods. True experiments must have control, randomization, and manipulation.” Usually, an experimental design or design of experiments (abbr. DOE) is, according to Seltman [7], a “careful balancing of several features including “power", generalizability, various forms of “validity", practicality and cost. Often an improvement in one of these features has a detrimental effect on other features.” This means that greater accuracy achieved for one feature could be detrimental to other features. When analysing a process or a cause, usually conducted under controlled circumstances, experiments help evaluate the process inputs which have a major impact on the outputs and the target input levels which produce the desired results. It is, therefore, very important to design an experiment carefully to obtain accurate results with minimal scope for any type of errors. It is also advisable to draw up a clear design so that statistical tests can point out any errors at later stages with respect to biases.

2.4.2. Components of DOE Any experimental design in general may broadly consist of the following components which should be assessed before setting up experiments: 

Set of hypotheses: An experimental hypothesis sets the basis for research into a particular scenario for which you conduct an experiment and subsequently observe meaningful results. It may be a general statement or a complex one being as specific as the study demands. If the 7

hypotheses can be stated in very specific terms, the experiment often can be designed to provide critical and convincing tests that distinguish among them. A sample hypothesis in its simplest form for the project could be “There may be significant differences between males and females with respect to self-harm.” or a more complex and specific hypothesis could be “A positive correlation may exist between gender and the number of suicide attempts in the past one year.” etc. 

Experimental tests: An experiment could comprise of various tests which help the experimenter analyse data and formulate sensible conclusions. There are many types of tests and distributions designed for such purposes like chi-squared test & distribution, t-tests & distribution, F-statistics & distribution, etc. These tests help perform mathematical calculations on the sample data collected and obtain p-values and confidence levels which can help the experimenter with obtaining meaningful results and help explain them.

Analysis strategies: An important aspect of conducting an experiment is to analyse the sample being tested. It is advisable to strategise the analysis in the design phase. These maybe questions like “what will be my sample populations?”, “what tests will I perform on them to obtain inferences?”, “how will I set them up?” etc.

2.4.3 DOE Process The process of DOE can be summarised in the following steps: 

Define Problem(s) or the topic of study: Some questions we can ask include- Why are we interested in conducting this study? What do we want to obtain from the experiment? Are there any similar evidences from previous studies that may help us? For example, what are experimenters interested in and why?

Determine Objectives: We determine objectives of the experiment where we look at the problems more closely, and determine the elements of interest. For example, is it the demographic criterion which influences a clinician’s judgement more? etc.

Brainstorm the various possibilities and scenarios pertaining to the experiment and what it would evaluate to. This may also be done in a group of researchers conducting the research. We may come up with various methods of conducting the experiments like surveys, visual analysis, etc. which may require external participation. In this project, the sample population will be derived from the database.

Design Experiment: In this stage, a hypothesis or null hypothesis or alternative hypothesis is drafted, a design model is chosen to conduct the experiment, independent 8

and dependent variables assessed, the various tests to be performed on the inputs are constructed and the possible errors are evaluated. 

Analyse Data: After the experiment is carried out as designed, the data is collected for analysis at this stage. Possible results are formulated. This project is centered on providing automation aid to researchers at this stage where they want to analyse the data.

Interpret Results: The results obtained are then articulated to form solid conclusions. In this project, the results may be in form of a graphical visualisation supported by some statistical figures highlighting the population chosen for comparison for biases.

Verify Predicted Results: Based on the errors found in step 4, the results are verified again and the evidences are presented as conclusions. Any scope for uncertainties is resolved. This stage goes beyond the scope of the project; however, data is set up in a manner that would be effectively read into a statistical package for inferences.

Figure 4. DOE Process [8]

2.4.4 Advantages and Disadvantages of DOE A major advantage of experimental design is that it can help signify causality (causeand-effect relationship) if conducted thoughtfully. For example, “a psychologist may conduct an experiment on finding the degree to which children watching violent TV shows tend to aggress at their classmates in the playground.” [9]. A good experiment can help the experimenter to manipulate sufficiently. This may be controlling the exposure of children to violent television to 9

some extent. One way of doing this can be by splitting groups of children into “experimental group” [10] and “control group” [11] where one is exposed to violence and the other is not, respectively. The groups can then be studied individually and compared for accurate results. Secondly, with a good experimental design, an experimenter can randomly assign children to these groups to test various conditions of study to eliminate underlying factors. That is to say that if children exposed to violent television DID turn out to be more aggressive than those who weren’t, then one can conclude that it is indeed that the violent television was the cause of increased violence. Hence, random assignation of samples (here, participants) to independent variables for further analysis also serves as a major advantage. Another application-specific example could be where the user may be interested in knowing if there is an age bias between males and females regarding the number of suicides attempted. In this particular case, let’s assume that our data has 1000 records. The data is filtered to form 2 groups viz. “Males” and “Females”. Next, the user would sift through the suicide attempts for each category and observe that the average number of males who have committed 10 suicides and more is 430 and that of females is 570. It is clear that the number of females have committed more suicides than males. The end-user can then compare these groups for locating biases. This is one way of randomly assigning data to groups after filtering. It may serve as an advantage with the right experimental design. There are disadvantages, however. Firstly, it is difficult to conduct good experiments as they “require a lot of resources and human energy.” [12] Secondly, they require a lot of cleverness and experience to design experiments well. Another major disadvantage of designing experiments is that they may use relations which may be biased.

2.5 Hypothesis/Significance Testing The important components in an experiment are: a hypothesis, research objectives, analysis and interpretation of data. The objectives of the experiment may analyse many things such as the basis of study, some relevant evidences from the past ‘similar’ experiments undertaken, the factors affecting the hypothesis, etc. Factors can be of two types: dependent factors/variables and independent factors/variables. Dependent variables are those which are dependent on other variables (of the experiment) to cause a change in their state. These may be recognised upfront while designing the experiment or can be detected in the later stages as underlying factors. Independent variables are those which are not affected by other variables (of 10

the study) to cause their state to change. These must be defined while designing the experiment. In this project, there are several dependent variables, hence it is extremely vital to gauge them and design the samples intelligently.

2.5.1 Need for Experimentation Usually, design of experiments is driven by the need to experiment or research. Experiments are fundamentally conducted in a particular discipline to study a particular phenomenon and analyse the outcomes. The question about “Is there a relationship between humans and Apes (say)? If there is, then why is it important to study them? How will it be beneficial to underpin some significant conclusions from them”, etc. If not for experiments, then numerous issues would remain undiscovered or undefined from a scientific perspective. The need for experimentation also arises when an interesting area of study (or its entity) “is believed” to have some correlation with some other entity. Researchers maybe interested in locating what relations exist and how they affect each other. This “belief” per se, is usually “hypothesised” as a statement which is then tested for accuracy. In terms of the project domain, experimenters might want to mine such an enormous database for various dependencies and clinician biases. Researchers might propose that the database is suggestive of gender biases or other such confounding variables. They may state a hypothesis which they would want to test for significance. The aim of the project is to automate this process of helping experimenters set up accurate samples which can be analysed for significant differences. If there are any confounding variables discovered through comparing samples then a process must exist to eliminate such factors and to mine the dataset, further, for any parallel underlying factors possibly affecting the dataset returned. This is to essentially say that the application developed will provide an interface for the researchers to conduct an experiment on the GRiST database for any desired behaviour they want to test. Once the application helps automate the process of creating samples, the samples can be exported to a CSV file in the required data format for significance testing. The various statistical packages which can be compatible with the files generated through the application are discussed in the coming sections.


2.5.2 What are the tests for Hypotheses? Tests for hypotheses are carried out on two types: null hypotheses (Hₒ) and alternate hypotheses (Hₐ). These tests are essentially conducted to test the veracity of either of the hypotheses. It is never possible to infer that either of the hypotheses is completely true or otherwise. Their relevance is usually measured in terms of the other. This essentially means that neither is the alternate hypothesis completely rejected (‘reject Hₐ’) nor is the null hypothesis completely accepted (‘accept Hₒ’). For a null hypothesis to be accepted, it needs to surpass a fixed confidence interval. Otherwise the alternative hypothesis is accepted. Usually, the null hypothesis is a negation of the research hypothesis. For example, if our research hypothesis states that “the number of single males who are depressed is more than the number of single females”, the null hypothesis would usually state the “other way around”. The idea is to prove that the null hypothesis is wrong most percent of the times. Hence it can be proved that the original hypothesis is true. In the medical and public health domain, usually the probability of assuming that a relationship between 2 variables does exist when it actually does not is about 0.01% ( 99%). In this project, a clear research hypothesis can be established for defining appropriate experiments. The user is interested in setting up perfect subsamples for comparison. He/she may also be interested in investigating the factors which could possibly imply biases in the samples. Hence, it is important to state hypotheses unambiguously which could help set up sensible samples. The user would also be interested in eliminating the nodes which signify biases from the samples before exporting it and thereafter assessing it through a statistical package. The samples should be setup such that the statistical package indicates as little bias as possible.

2.5.3 Statistical Packages A proposed end point for the application would be exporting the samples created to various statistical packages for bias analysis. It could be argued that the application would not only be used to create these samples but also could help analyse differences as it would be a viable extension to the rationale. However, to counter argue, there are some statistical packages and software applications which have been specifically designed with the various statistical algorithms and tests/distributions to identify differences in the dataset that is provided to them. Given this fact, it would be not be of much use to re-design such an application (or even extend it) because the process is quite complex and justice may not be done to either of the parts viz. 12

forming sample populations and then testing them for significances. As a part of the undergoing background research, some powerful statistical packages were recognised to help direct the final course of the application. How do these packages expect the data to be wrapped before they can be provided to the package and analysed? How are CSVs, in particular, imported in such packages? What is the format of the data required to be in order for the packages to read them with their algorithms and test for differences? What types of tests are conducted on the CSV data for statistical significance? These questions needed answers and were researched on. The two main packages highlighted are SPSS Statistics and R. a) SPSS: SPSS is a very powerful statistical tool developed by IBM for efficient analysis of samples and to perform significance testing on them. The package requires the files to contain column headers, a point worth noting when exporting samples. These column headers help identify samples in the provided dataset. The statistics included in the software which can be used in the identifying biases may be bivariate statistics (in particular) which encompasses tests such as means, t-tests, ANOVA, correlations, parametric and nonparametric tests. Means would usually include the average calculated across the sample based on a particular condition. These conditions (or cases, more semantically accurate) can be set in SPSS by selecting Data => Select Cases (in the UI). The user can also set the cases to be used and the ones to be ignored. Once the cases are selected, average can be calculated in each case and depending on the weights of each factor, results can be concluded. In case of Student t-tests, 3 types can be performed viz. one-sample, independent samples and paired samples. In case of one-sample t-test, we first find the mean of a factor, say gender= MALE. We then hypothesise what the factor should weigh depending on the mean of the various values of the factor. Then, we can calculate the t-value and based on this value, conclusions can be made on how significant the data is i.e. if there are significant biases in the dataset. After setting a mean value/test value for the parameter, an alpha value 1 for the hypothesis is set. Then, by selecting Analyze=> One-Sample T-Test.. in the UI, we can configure the factor (in this case MALE) with a mean value to test (say 1, or any other factor). Once these configurations are set, two sets of results are displayed, one evaluated with the test value (giving the t-value) and one calculated without the test value (generic information). This tvalue is calculated in terms of the number of observations picked from the male sample. It is 1

In statistical terms, alpha (or p-value) is the confidence value of your hypothesis set. For example, if you want your hypothesis to be correct 95% of the times, then Îą is set to 0.05. This essentially means that the hypothesis is statistically significant.


also calculated in terms of α value used. Comparing the values would help the researchers determine whether to reject or accept the hypothesis. The t-value obtained is then compared with a table of critical t-values given for the male sample set and if (t)calculated > (t)critical_value then the hypothesis is accepted otherwise rejected as it failed to convey any statistically significant data. This test would be more applicable to the application as an extension as the researchers would be interested in proving their hypothesis correct. Since SPSS already provides a powerful framework to carry out such analysis, the application developed in this project will center on automating the process of sample creation and data structuring of the samples.

b) R: While researching about t-tests for SPSS, it was inferred that this test would be most appropriate for the experimenters to compare sample factors for biases. Another statistical package explored was The R Project. The process of calculating t-value in R is slightly different. As R is a programming language for statistics, we need to write function such as plot(density(male))






visualisation. The males and female groups are headers of the data, hence it is extremely vital to have these in the export file. For each group, we perform a t-test by writing the function: t.test(“male”, level=“0.05”)[13].





This sets α value and in the end, generates a t-value based on group1-

group2 and the various standard deviations and means calculated. If the computed t-value is greater the tabulated t-value (similar to SPSS) then we reject the null hypothesis. Hence, the importance of producing the right structure of the file is of extreme importance in order to help the experimenters using this application to justly identify and eliminate these biases. Having reviewed the various concepts covered in the background, it is well established as to why such an application is needed to be developed and how it can be used in the medical domain for data analysis.


Chapter 3: REQUIREMENTS ANALYSIS This chapter deals, in detail, with the requirements elicitation process, with the application end-user Dr. Christopher Buckingham. The key areas of interest were to elicit the core requirements of the application, the key functional requirements and non-functional requirements. The concept of “Mind Map”, which adopts the hierarchical anatomy of Unified Modelling Language (UML) principles, is also highlighted.

3.1 User Requirements Since the application is a part a large project called GRiST, its requirements align themselves with the parent. User requirements, in very simple terms, means “requirements of the end-user” or “what the users want their system to do”. In more technical terms, it means “providing our users the right to structure the functionality for the system”. According to Process Impact’s “Glossary of Requirements Engineering Terms” [14], user requirements can be defined as “User goals or tasks that users must be able to perform with a system, or statements of the user’s expectations of system quality.” For technical engineers, formulating user requirements is usually problematic for several reasons. Firstly, user requirements are subject to changes frequently as technology updates itself and with changing needs of users. Secondly, users may sometimes be ambiguous (or lesstechnical to communicate) about these requirements to engineers. They may be unaware of the computing environment and capabilities of the system to propose optimum requirements. Thirdly, since the system boundaries are almost always ill-defined and vague, they tend to confuse the users rather than elucidate the objectives. To overcome these problems, Somerville and Sawyer [15] in their book “Requirements Engineering: A Good Practice Guide”, discuss some “Guidelines” for eliciting clear user requirements. These have been enumerated in Table 1. Principles

How these principles were implemented

1. “Assess System Feasibility”: Somerville This principle was clearly established when the and Sawyer argue [16] that a business case rationale for the project was contrived. The must be established for the requirements to requirements of such an application to be 15

be consistent with the system objectives. developed was made clear and, although There is need to conceive a reason for business cases weren’t developed to investigate developing the system and how it could the feasibility, it was evident that with such an benefit




its enormous database, analysis it would be vital

environment. Questions like “Do we really for its growth and easier processing in future. need this system?”, “What would the consequences be if we did not develop this system?”






technology limitations which we face?” etc. need to be answered. 2. “Identify



System This principle was articulated at this stage with

Stakeholders”: Somerville and Sawyer also regular discussions carried out with the client suggest [17] that a “view-point oriented i.e. Dr. Christopher Buckingham. Detailed approach” be adopted and further enforce inspection of what was required from this to “collect requirements from multiple application was achieved. Since the application viewpoints”. By doing so, others may feel has only one end-user, multiple viewpoints more involved in developing the system were not necessary. and can offer clearer objectives. It is, hence, extremely vital to establish endusers at this stage. 3. “Look for domain constraints”: Somerville As the project was supported by a well-formed and Sawyer [18] propose that any domain framework which already complied with the constraints, which could influence our rules in the medical domain, the requirements system, should be acknowledged this stage. were also driven by these constraints. These constraints need to be looked into to conform





policies. The process must not be a “oneoff” and should be guided by domain experts. 4. “Use scenarios to Elicit Requirements”: While


Somerville and Sawyer [19] further suggest application,




for cases

this of

that the use of examples could aid in assessments) were cited as core guidelines to improving the designers’ understanding of supplement design at later stages. These 16






as scenarios were devised by the end-user to

demonstrating them to our end-users. These coagulate the application requirements in a scenarios help expose additional system more efficient manner. services





overlooked. In practice, they also suggest that





flowcharts and sequence diagrams which assist in identifying the flow-control of the system can be helpful. Table 1. Requirements Elicitation Guidelines proposed and how they were followed

3.2 Functional Requirements Functional requirements enlist the functionality of the system and/or its components. What does the end-user want the system to do? How does the end-user want it to look like? What communication is essential for the user to perform his/her task efficiently? These questions are answered by outlining some core system functionalities using diagrams like UML and mind maps. Functional requirements are often expressed as “system should do <requirement>”. Sometimes, functional requirements (in agile processes) are also elicited through user stories. Steffan Surdek, user experience lead in IBM [20], proposed that the stories take the form: “As a <role> I want to <goal> to achieve <business value>” Broadly customising this idea to our application, a user story would be formed such: “As an END-USER I want to be able to set up subsamples So that I can compare them for biases”


The key difference between user requirements (also known as business requirements) and functional requirements is the details involved in preparing them. The diagram below highlights this hierarchy.

Figure 4. User Requirements vs. Functional Requirements in the increasing order of details [21] Functional requirements are typically user requirements decomposed to their simplest form. With respect to the application, a high-level user requirement would be “I want to be able to setup sample populations to observe underlying biases” while a functional requirement would list the steps of how this is done: “Navigate knowledge tree to select nodes”, “Process the nodes selected”, “Generate test results for the nodes selected”, “Analyse samples”, “Eliminate underlying biases”. There are, in practice, more complex sub-steps involved in performing each of the tasks.

3.2.1 Mind Maps According [22], mind map is a: “Graphical technique for visualizing connections between several ideas or pieces of information. Each idea or fact is written down and then linked by lines or curves to its major or minor (or following or previous) idea or fact, thus creating a web of relationships. Developed by the UK researcher Tony Buzan in his 1972 book 'Use Your Head,' mind mapping is used in note taking, brainstorming, problem solving, and project planning. Like other mapping techniques its purpose is to focus attention, and to capture and frame knowledge to facilitate sharing of ideas and concepts.”


It is, hence, a well-defined data structure for sketching requirements in a very effortless manner. Mind maps can be used to do the following: 

outline/design application framework

illustrate relationships and system interactions

creating an “information web” which can be easily read It could well be argued as to why this approach was undertaken over the classic use-case and

other significant UML diagrams. Figure 5 provides an illustration of the similarities between mind maps and conventional use cases:

Figure 5. Comparison between mind map [23] and use case [24] devised for an ATM system to highlight similarities in their fundamental structures Hence, reasons why this approach for eliciting requirements was chosen over designing UMLs were: 

First and foremost, the end-user was comfortable and experienced with mind maps. The client has also had experience of representing ideas on mind maps in the past and felt most confident about assembling requirements articulately using this technique.

Secondly, mind maps follow the hierarchical structure that class diagrams and other UML diagrams follow as well. Hence they can be associated with the UML diagrams. Also, the notations can be inter-switched to produce either of the schemas.


Some advantages of using mind maps are: 

The nodes on a mind map highlight a LINEAR SEQUENCE or FLOW OF EVENTS that each requirement would follow to be achieved in entirety.

In early stages of requirements elicitation and design, mind maps are powerful tools to structure and re-structure the functionality of the system.

Some disadvantages of using mind maps can be: 

For larger applications, they can become unreadable and confusing.

There are other approaches such as agile techniques which adopt “user stories” which take precedence in some projects.

Sometimes, when done incorrectly, mind maps may not bring out as much detail in the system functionality as they are intended to.

3.2.2 FreeMind: Software Tool for developing powerful Mind Maps The software used to create the mind map was FreeMind [25], [26]. FreeMind is an opensource mind mapping software used to develop robust mind maps. This software has gained popularity in the recent years post some development which has incremented its productivity. It follows the acclaimed [27] “one-click "fold / unfold" and "follow link" operations”. The tool was recommended by the client and it was used to develop the requirements for the application. The following figure [28] illustrates a mind map formulated using the software.

Figure 6. Mind Map of “Computer Knowledge” depicted in FreeMind 20

3.2.3 Requirements Analysis The mind map highlighted the core functionalities of the system, as already discussed in the previous sections. These include functions to “create selection criteria”, “create selection criteria for subsamples”, “process selection criteria”, “experimental design functionality”, and “generate samples”. The mind map developed for the application was on the lines of an example [29] depicted for a clinical system, discovered on the Internet. This was used as a guide only.

Figure 7. Mind map for a generic clinical system





i) Create selection criteria

Purpose This functionality helps the end-user to create criterion for selecting sample population. This functionality is essential to create the first level of samples for analysis.


Create selection criteria a. Navigate for parent population

Node: Navigating the knowledge tree

Navigating knowledge tree is an important requirement as and select any node which the





needs to be included in conditions based on which this criteria. We can open parent populations are derived. all nodes or close all These can be accessed through nodes.

b. Select

the knowledge structure.



tool This requirement is useful in



the informing the user about the




select various types of nodes in the

nodes from the knowledge knowledge structure as it is structure that they traverse. very expansive. It is also At a concept node, the essential


relative influence (RI) i.e. understands





weight of each node is between these nodes as some displayed along with the of them correspond to rootquestions for each node. level



There are two types of contain values in the database questions




layer while the higher level layers are need not essentially contain

related to other concepts) information. Hence, the user and filer questions (which knows exactly where to derive filter out the questions the values from. 23

related to the nodes based on




answers with a YES or a NO). At the leaf node, user can either select the node or see the question related with the node as an answer is mandatory for such nodes. At the data node, the user can view and select much more specific data for each node. The value-mgs, questions and RIs are displayed at this point. c. Select data associated with criteria: the user can select specific



value ranges for datum nodes.



selecting the relationship varying








THAN”, “LESS THAN EQUAL TO”, “NULL”, “DK”, etc. which will become an operator in the SQL query and an answer value

such as “0-10”, 24

The user needs to be able to specify the value he/she wants to retrieve from the database and



this on




relationship operators, they need to be specified as well.

“YES”, “NO” and other specific ranges. 

Create selection criteria The user can select the nodes This requirement would help for subsample

in a similar way as that done the user filter the parent for the parent population. The population down even further, only difference is the variables helping them create more used can only be those that judicious



were in the parent population samples will primarily be used query i.e. query the parent for comparison for biases. population. By specifying all these values, the user can again query the database and build a dynamic query to organise subsamples, derived from the parent population. ii) Process selection criteria







the this

functionality helps the user process

them coherently and

articulate them to retrieve data from the data store. The user can




queries. 

Create query




generated This requirement helps the

dynamically on the server host user specify what he/she wants which can be visualized by the from the datastore. Is it an user. He/she can view the amalgamation (AND operator) criteria selected i.e. the nodes of the nodes selected? Or is it selected and apply operators a distinction (OR operator) such as ‘AND’ and ‘OR’

between nodes selected? The user may want them both at the same time, hence this


requirement is important. 

Generate samples

The query that is formed in the This functionality is extremely processing phase is executed important as it sets the end to retrieve data from the point of the application i.e. the database. The results (in terms automation



of the rows retrieved) are queries must be executable so displayed to the user after that





which he/she can check for retrieved by the application for biases. After the results are differences. Following that, it displayed, the user can: a) Compare

is necessary that the user can sample be able to observe these

differences in terms of the samples for biases as it is the rows returned.





b) Create criteria to remove automation process. By the biases: to eliminate biases, end of these activities, the data user can reset subsamples exported would be in the for comparison. These can format that

any statistical

be chosen from either the package requires it to be, to be parent population or the directly imported for statistical subsample

population. analysis.

Hence, samples and biases Chapter can be “equalised”.

As 1,

discussed the



analysis is not required to be

Once the user has observed performed by the application and



biases, as there are packages which

he/she can save the samples have




which will be the end point of complex algorithms to analyse the application. The data samples



exported will be provided in differences. the format required by any statistical package for import and the significances can be observed in them. 


design To test hypotheses, the user This functionality is important 26


can print down a research because it helps the user hypothesis which can direct analyse the samples for biases, him/her to select nodes and which is the rationale for values associated with the building the application. This nodes to set up sagacious is




samples. The user can test this numerical figures (such as the hypothesis for biases. The number of rows returned for a results are then reported to the sample population) would be user in terms of graphs such as of little consequence in terms pie charts, bar and line graphs of




so that the distinction of biases quantity i.e. bias. Hence, it is is evident. Based on these vital to provide the user with graphs, the populations can be some



refined for bias elimination. It would help them visualise is also extremely essential to (graphically) the differences support these graphs with between




some statistics such as mean, biases. median, standard deviation, quartile ranges, etc. so that the plots are cohesive with the population retrieved.

Thus, the mind map clearly and concisely states the design components which were to be implemented in the later stages. An important task was to realise the design of the user interface with these requirements, and subsequently integrating the provided parent design structure with the developing application. These issues are addressed in the next section.

3.3 Non-Functional Requirements According to Lessons From History [30], non-functional requirements are defined as:


“A non-functional requirement is a statement of how a system must behave, it is a constraint upon the systems behavior.” The key difference between functional and non-functional requirements is what they define about the system. While the former defines functionality of the system as well as the functionalities of its specific components, the latter focuses more on the operation, cohesiveness and behaviour of the system. Hence, attributes like system response time, scalability, availability, interoperability, performance, robustness, safety, testability, software compatibility, etc. comprise the system’s non-functional requirements. The application developed as a part of this project highlights software compatibility. Software compatibility typically presumes the connotation of “compatibility between the various software tools used to develop the application”. However, in this case, software compatibility is at a lower level, where integration of code is highlighted. The previous chapter emphasised on the fact that the application being developed is a part of a vast project and that a decision was made to extend the application to support this parent project. Hence, certain components which were already a part of the project were utilised to drive the application for a more efficient endproduct. The code provided as a base for adapting the application involved a good deal of efforts in terms of mental and intellectual energy expended as well as the time invested in reading and understanding the framework supplied. The prime most challenge was to understand the structure provide which was very generic. This was tackled by studying it for a few days, understanding the higher and lower level functionalities and customising it further, after harbouring a better comprehension of the data structure. The second challenge was to comprehend the architecture used to develop the preexisting framework as it was very vast and extensive. When the architecture was clearly understood, the third challenge arose. This involved integrating the application design and architecture cohesively with the framework provided. It was essential to align the application design decisions with the data skeleton provided beforehand (which involved coupling the user interface with data processing). There were some flaws in the approach as it became quite gruesome to read and understand beyond a point. Feedback received from some colleagues on the issue was similar in nature i.e. poor cohesion and coupling. It was also poorly commented and less documentation was provided. However, these were rectified with extensive comments appended along with some augmentations to improve the “code”. 28

Chapter 4: DESIGN This chapter details into the core design decisions taken for the application developed. The various tools utilised to accomplish them, the user interface design aligned with the fundamental human-computer interface principles and designing the database from the data provided. Also mentioned is the non-functional requirements design which essentially includes adapting the interface design to the pre-existing framework provided. Some software engineering skills applied are also highlighted.

4.1 The Parent Interface In this section, the prime focus is to align the requirements with the user interface. The generic tool provided beforehand assumed the structure highlighted in figure 8. The design provided already had some base code written and, as per the functional requirements, parts were extended to this code. There were various classes added to the classes provided and these were modelled using some basic class diagrams.

Figure 8. The parent framework i.e. â&#x20AC;&#x153;shellâ&#x20AC;? provided 29

The parent interface also provided the following functionalities (along with code) of reading nodes and recording them, as indicated in figures 9 and 10. This tool was provided as open source software, to be used and adapted for the developing application and only the navigation of the knowledge tree was used.

Figure 9. Reading (selecting) a node from the knowledge tree

Figure 10. Recording a node from the knowledge tree As evident, the user interface to be developed needed to be aligned with the structure provided. There was also a design compatibility issue. It was only a prudent decision to realise the design in such a manner so as to maintain consistency of the user experience and interface.

4.2 Human Computer Interface Principles- Part I The primary understanding of human-computer interaction can be surmised with the definition presented by Hewett, et al. [31] in the year 1992 as: 30

“human-computer interaction (HCI) is the discipline concerned with the design, evaluation, and implementation of interactive computing systems for human use and with the study of major phenomena surrounding them” The fundamental aspect that drives HCI is the usability of the user interface. This aspect must be carefully considered when designing applications. Questions such as “is the interface user-friendly?”, “Do all the components, organised in the interface, simple and holistic in themselves?”, “Are the components named distinctly, so as to guide the user through the processes to achieve their goals?” etc. These issues are clearly addressed with a thoughtful user interface. The most important facet of HCI is usability. It is a measure of how easily the user can interact with the interface. The 5 components [32] of usability include learnability (how easy is it for the users to get familiar the system?), efficiency (how quickly do users achieve their goals?), memorability (how easy is it for users to master using the system?), errors (measure of the number of errors the users commit errors and how easily they recover from them) and satisfaction (evaluating user-satisfaction) These characteristics of usability were incorporated into the design in the approaches listed in Table 2. Components

Design Integration

“Learnability” [33] This feature was implemented with a simple interface with minimal components. Hence, the first time the user uses the application, he/she is familiar with the components presented in the frame. Tool tips could also help enhance their learning experience. “Efficiency” [34] This






maintaining a consistent user-interface “feel” throughout the application. It also required keeping the new design homogenous with the tool provided. “Memorability” [35] This feature is also conformed to by achieving efficiency and uniformity in the user interface. “Errors” [36] This attribute was evaluated through some feedback received in the implementation and testing stages and they were significantly low. 31

“Satisfaction” [37] As done with errors, this detail was also covered in later stages and amounted to a significantly high value. Table 2. Discussing the HCI usability components and how they were accomplished

4.3 The Waterfall Model The waterfall model has been one of the first software engineering models developed in the industry. Although, for several years, it has followed a linear approach, recent developments in the model have also introduced “Iterative Waterfall Model” which permits revisiting previous phases in the model. The documentation for such a model is usually well-defined for each stage. Margaret Rouse defines [38] waterfall model as: “a popular version of the systems development life cycle model for software engineering. Often considered the classic approach to the systems development life cycle, the waterfall model describes a development method that is linear and sequential. Waterfall development has distinct goals for each phase of development.” The model is not used as frequently due to its disciplined approach except in smaller projects where the requirements and design are well-defined. Figure 11 shows the evolution of this primitive model over the years.

Figure 11. Development of the Waterfall model [39] into Iterative Waterfall Model [40] It could be argued as to why this model has been adopted to develop the project. There are two core reasons for siding with the decision. Firstly, as seen in the requirements analysis phase, 32

the requirements elicited were clear and well-defined. They were thoroughly discussed with the client and the design was also well approved. A draft of the interface design was also notified to the client who was in favour it. Since the goals for the project were established beforehand, an easier method to realise them was to exploit the waterfall model as it provides a linear framework to carry out events. Any changes to be made are reversible in the more recent transformations of the model. Secondly, the project was already supplied with a pre-defined framework. The framework contributed significantly to the application being developed. Hence the design and UI were already guided by a parent source. It was only reasonable to employ a model which could complement and utilise this design in easing the implementation by some amount. The waterfall model suited the process of development perfectly. Also, the non-functional design (explained in the later sections) had to be considered to support the development. This included design and architecting the extension to the application to maintain the user experience throughout. Integrating homogeneity was crucial to adapting the application to the supplied data framework. There are, however, some disadvantages of waterfall model. These can be listed as: 

It is usually problematic to articulate user requirements in the requirements phase and generally some changes may have to be incorporated later. This is backward integration, though introduced in the years, is very difficult to realise in practice in this model.

It is costly to revisit earlier phases in case errors and inconsistencies.

However, some types of projects which are suited for this model include: 

Projects involving database interactivity, for example, GRiST and other commercial software applications.

“In development of E-commerce website or portal.” [41]

“In Development of network protocol software.” [42]

4.4 Human Computer Interface Principles- Part II Now that a model has been established for developing the application, this section would explore the various ways in which the model aligns itself with core HCI principles. Although,


according to HCI principles, the model is a non-User Centered design2, it has been recognised why such a model is used. Some user design principles that were implemented in the initial stages of development can be aligned with 4 out of 5 of the HCI principles [44] which focus on users and tasks. Table 3 enumerates them. Principles

Design Integration

“users’ tasks and goals are driving force This principle drives the development model. behind development” [45]

The client was consulted regularly for inputs into defining requirements. Hence, user goals were of paramount importance.

“users’ behaviour and context of use are Since a parent framework already supported studied and the system is designed to support the developing application, it seemed to align them” [46]

itself with the user priorities. Also, since the design of the framework was maintained alongside the developing application, an understanding






established well before designing the interface. “users’ characteristics are captured and As mentioned in the second principle, the designed for” [47]

design was homogenous across the application and this homogeneity was established after recognising user limitations and priorities.

“all design decisions are taken within the This principle was implemented through context of users, their work, and their regular client meetings for every decision made environment” [48]

and approved. Table 3. Various UCD principles and their adherence

4.5 Class Diagrams Class diagrams in UML are diagrams with universal notations which help visually depict the various classes designed and the interactions between them. In a more technical attitude, class diagram is defined as “an illustration of the relationships and source code dependencies among 2

In HCI terms, User Centered Design (abbr. UCD) is defined as “design framework that enables interaction designers to build more usable systems” [43]


classes in the Unified Modeling Language (UML)â&#x20AC;? [49]. Class diagrams are extremely important when representing the class-object communications. Class diagrams have universal notations: a rectangle with two partitions viz. one for the variables in a class and second for the class methods. Figure 12 represents the classic notation for a class diagram. The tool used to develop the class diagram for this project was [50], an online tool which helps render class diagrams with in-built components.

Figure 12. Class Notation in Few other components which were largely used to depict interactions were Interfaces, Entities and packages (figure 13).

Figure 13. Interface and Package notations in formed other core components The class diagram was produced, keeping in mind the various activities that would possibly occur and the interactions between these activities. Some classes needed to be designed as interfaces while the others would need to be integrated with GUI components. These were carefully thought about to even fit the parent code that was provided.



Analysing the class diagram: There were six core packages established which would contain classes that coherently build up the application. These were: 1. “gristDatabaseUtilities” package: The core classes which would manage the database interactions are present in the “gristDatabaseUtilities” package. These classes comply with the “generating samples” requirement drawn in the previous phase. There could possibly be 3 classes viz. one to setup the parent population, second to filter the parent population based on the user selections or conditions, and third which would export the subsamples after the graphical evaluations. Each of the classes would have methods to essentially build these interactions up. o ParentPopulation: This class would be continued from the parent code given. The parent code necessarily assimilates the user conditions for setting up a SELECT query to access the database. Hence, a list structure would be used to accumulate them. The other requirements would be a query which would be built as a string and rows to keep a count of the number of rows retrieved. This is the basic interaction paradigm used throughout. Methods designed were: one to build the string to create a table, getters for query and rows. It’s worth noting that the diagram also indicates the table used by the class for communication and querying, essentially, is the database (investigated in the next section) which stores all the assessments i.e. grist_parent_population. o SubsamplesPopulation: This class is similar to its parent i.e. ParentPopulation. The list being called is UserSplitSelection. It has methods to assimilate the list elements into a query and retrieve the rows from the table as well. The only difference is the table being queried. At this stage, we want the parent_population to be queried to create a subsample table. It can be generally noted that at each step, a new table is formed, a filtered table, which holds records as per the user’s selection. o ExportSubsamples: This class is designed to export the subsamples created after some graphical visualisation and after the user is satisfied that the samples are devoid of biases. The user would provide a filename for the samples being exported as CSV file in addition to the query being formed and the rows retrieved. This is the end point of the application. The table being exported is subsamples_population.


2. Interfaces package: To comply with the design adopted by the parent code, there would some core interfaces need to be created. These are required to read and record the data and cohesively pass it on to the database classes as lists which encapsulated the user conditions. This package encapsulates classes which support the sample creations requirement analysed in the previous phase. This package could include: o UserSplit: This interface would create an object which would record the node codes, the node relationships and the answers as selected by the user. It would contain some other methods which would be required by other classes as well. o UserSelectedSplit: This interface could hold a list of userSplit conditions in a data structure to be collated with the database queries to retrieve records. This is done to complement the parent code as well. o SubmitButtonListener: A button click event would have to be described to finally encompass the list of user conditions to be articulated with the back-end database queries. 3. “NodeChooserInternal” package: The fourth package would need to trigger some communication for helping the user add and remove conditions. Conditions needed to be added through navigating the knowledge tree (extending to the create samples requirement). This was provided by the parent code, hence adapted to create the following classes: o CatSplitNode: This class would be designed as a dialog box which would be presented when the user would want to view and navigate the knowledge tree. The user would pick a node and the corresponding node needs to get recorded. Hence, a class would be designed to perform these events. o RemoveConditionListener: An interface which can be invoked and overridden when a condition is removed. This was created as the condition removed would not merely be an object (although this is the parameter that would be passed) but a collection of objects or a list which would need to be deleted as a whole. Hence, strategising an interface would help a class define its own set of objects to be removed. 4. “UserSplitUtilities” package: Conforming to the parent design, it would be essential to integrate the user interface with classes to implement (or override and define) the methods and take user inputs. These classes would comply with “process samples” requirement elicited in the previous phase. This package would be designed to interact with the user with the following classes: 38

o SplitCondition: This class would be similar to the code of the GUI provided by the parent classes. It would contain methods to collect the nodes selected by the user, their codes, the relationship selected and the corresponding answer values which can then be captured in a data structure such as a list to query the database. In case the user also intends to remove a condition, this would be apprehended by the RemoveConditionListener interface (which would hence require to implement the class) to remove the particular node and it’s attributes as set by the user. This class may be designed to register just one instance of a user condition. o SplitChooser: A main class which could implement the SplitCondition components would be essential at this stage as the user may intend that multiple conditions be stored for querying the database. Hence, a new list of conditions (integrated with the User Interface) would be vital to collect all user conditions. This class would also require implementing the remove conditions action listener. 5. “graphUtilities” package: This package would be a collection of graph libraries and classes which would implement them, conforming to the “experimental design” functionality in the mind map. This package intended to contain all the classes which would interact with the database to produce graphs. Two types of graphs would be presented viz. line graphs and Pie charts. The various classes could include: o LineGraph: This class could possibly formulate a query to retrieve specific number of rows to be plotted on the graph. It would also need a question type, to detect ordinal questions, as distinct graphs would need to be produced for various types of question nodes. As an illustration, assume that the nodes chosen by the end user would suit a comparison of suicide attempts between males and females. Hence, first, the user would query the database to retrieve results for patients who have higher suicide rates. Next, the user would query the new table for the number of males and females and would aim to visualise the records returned. In this case, a pie chart would suffice as only two attributes (which are nominal) are compared against one another. If an age dimension (which is an ordinal value) was added to the number of males and females being queried, then a line graph would be more apt to depict the results (age range in the x-axis and gender in the yaxis). After giving much thought, it was inferred that the question type would best aid the application in recognising the attributes chosen for comparison. Hence, the graph classes would particularly aim to capture such information. 39

o PieChart: Similar to LineGraph, a class would be required to detect the nominal values along with a query being formed to retrieve values from the database. Both the classes would be accessible in the Split class. 6. â&#x20AC;&#x153;Mainâ&#x20AC;? package: This package would encapsulate the main classes to run the application. It would primarily define the interaction between the GUI, the database and the visualisation. The main class would communicate with the parent_population table and the Split class would communicate with the subsamples table and the various chart classes designed (as they are elicited from the subsamples).

4.6 Database Design- Part I As discussed in Chapter 1: Context & Research, GRiST database is a plain file with recorded assessments. It forms the core of the application as it is investigated for biases and eliminating biases which are observed. The database was provided by the client as a simple text file. It was a flat file with assessments. Figure 14 illustrates the flat file.

b) The values recorded for each node (i.e. assessments). It included a variety of data types ranging from integers to doubles and Strings.

a) The node codes in the knowledge. Each node is assigned a code wherein assessment for the particular node is recorded.

Figure 14. Components in the flat file with appropriate labels 40

As evident from the figure above, the nodes are assigned node-codes. The hierarchy is illustrated in figure 15 below. Node in the knowledge tree

<associated with>

<associated with>


Assessment Record (integer, double, String or NULL)

Figure 15. GRiST Database hierarchy Hence, for each node in the knowledge tree, a code exists which is stored in the database for assessments. But before any actions could be taken, it was a necessity to design this flat file into a schematic database design in order to enable users to query it. This would be done using the popular MySQL platform. Utilising this ensured that the database was queried easily and the users have some familiarity with the database query language SQL. The table was designed such that all the columns corresponded to the node-codes (Figure 14, (a)). There were 158 node codes listed in the flat file and these were incorporated into the database structure. The names of the codes were altered slightly in order to conform to the MySQL naming system. The hyphens (“-”) in between the code names (for more than one words code names), for example, suic-past-att, were modified to underscores (“_”), i.e. suic_past_att. This modification helped created a cohesive table with all assessments stored in it. A great deal of time was invested in designing this table in MySQL. It is also worth noting here that with the table schema established, it needed to be populated. Two approaches were tested for the import process: one where the flat file was directly loaded in to the database (space separated variable file (abbr. ssv file), it would be addressed henceforth) and second where the ssv file was converted to a comma separated variable file (csv file, it would be addressed henceforth) and loaded into the database. The second methodology worked better for the application as the first produced errors in failure to import ssv file as it couldn’t establish a line break or a space in between some entries. The second methodology was decomposed into two steps. Step 1] Converting ssv file into a csv file: This was established through loading the text file into Microsoft Excel and undergoing the custom built steps for conversion. After the steps were followed and the text file was converted, it was saved to the desired location. Step 2] Importing action: The new csv file was then imported to the database. A query and program was written in Java to realise this. This query populated the tables designed for the 41

assessments. More about the program and query written to execute the import action would be explored in the next chapter.

4.7 Designing the non-functional requirements The non-functional requirements, as already discussed in the previous chapter, were aligned with maintaining homogeneity across the application through thoughtful integration of the design provided by the parent framework and the design to be developed in the application. When this aspect was noted more carefully, a number of implications were realised with this step. One of the major implications realised was “How can the design of the developing application be guided by this parent design?”, “How can they be cohesive?” It was implied that the parent framework was provided to act a guide to the emerging application design. These we realised as follows: 

Ensuring that the design of the evolving application was compatible with its parent: This challenge was overcome by investigating the design approach adopted by the developer of the parent application. It was intimidating initially, but as this understanding deepened, the design was appreciated and extended in the main application. It seemed a viable solution to capture user inputs and query the database based on it. The design was complex as the next chapter would highlight it, but also thoughtful and efficient.

Make certain that the architecture used to develop the application was also compatible with its parent: The second challenge was also absorb the architecture used by the parent developer. The idea was to align the architecture with what will be developed onto the application since the design was so interlinked. The architecture followed was that of an open source software which contains interfaces to the core classes which communicate with each other. This architecture is explained in detail in the next chapter, however, it would be vital to note that it was understood and applied to the extended application.

Ascertaining that the design and architecture could be extended, broadly, in future: The third challenge was to keep the design and architecture as generic as possible so that any developments to the application in future could be implemented with ease. The design was kept generic and simple, not very flamboyant, with the required information (processed or otherwise) displayed in simple labels, etc. The architecture was also comprehendible and extensively commented to mention any pointers in case of complexities.


4.8 Applying software engineering practices Some good software engineering practices implemented in developing the application were: ď&#x201A;ˇ

Developing a project in a team: Unlike many final year projects, this project was a part of a very vast endeavour and a lot of effort had already been contributed to the inception of the tool. This application would add more value to the tool, along with a newer dimension to assessing the patients. The application developed was part of a large team working toward excellence. It was a learning experience to work among others and to understand developersâ&#x20AC;&#x2122; architecture and design and carefully align the project application with the parent. This was a good approach toward development.


Extending the code generically for future developments: The implementation (talked about in the next chapter) would be kept very simple. It was also commented extensively to help readers comprehend the functionality better. Hence, for future developments, this could be used as a framework for other applications linked to it.


Chapter 5. IMPLEMENTATION This chapter explains, in detail, the various implementation techniques used to create complex queries and graphs for visualising the database. The various tools utilised in this phase have been listed and detailed in various sections. Code snippets have also been added wherever necessary, along with some pseudo code written where required. Some testing strategies are also discussed.

5.1 Designing GUI with Java Swing The GUI for the application was designed and developed using Java Swing. Cory Janssen for Techopedia defines [51] Swing as: â&#x20AC;&#x153;a lightweight Java graphical user interface (GUI) widget toolkit that includes a rich set of widgets. It is part of the Java Foundation Classes (JFC) and includes several packages for developing rich desktop applications in Java. Swing includes built-in controls such as trees, image buttons, tabbed panes, sliders, toolbars, color choosers, tables, and text areas to display HTTP or rich text format (RTF). Swing components are written entirely in Java and thus are platform-independent.â&#x20AC;? Swing is a Java GUI toolkit which provides users with powerful tools to create interactive user interfaces. As the application was intended to be a standalone one, and developed in Java, the Swing toolkit was utilised to design the UI. The various components of the toolkit used in this application included JLabels, JDialogs, JFrame, JPanels, JButtons, JComboBox, JCheckBox, etc. Swing also offers various layouts to lay down the components in a panel. These include FlowLayout (default), GridLayout, GridBagLayout, SpringLayout, BorderLayout and BoxLayout among others. The most frequently used layouts for this application are GridLayout, FlowLayout, BorderLayout and BoxLayout. Figure 16 highlights some Swing components.


Figure 16. Swing components with (top to bottom, left to right) JMenu, JLabels, JCheckBox, JPanel with BorderLayout, JList, JRadioButtons, JTabbedPane, JSlider, JTextField, JPasswordField, JComboBox and JTextArea [52] The interface for the application integrated the processing of some data with Swing components. Although this may not be the best practice, it was required for this application as the interface changed dynamically as the data was processed at various stages. As discussed in the previous chapter (please refer Chapter 4, section 4.1, pg. 29), the design of the application was supported with a pre-defined user interface which was also developed in Swing. The right panel which reads â&#x20AC;&#x2DC;[user space]â&#x20AC;&#x2122; was left to the user who would extend the design. Some changes were implemented in the UI provided. The user was now given the functionality to print the hypothesis when the application began. This can be illustrated in figure 17.

Figure 17. Components added to the main screen The interface on the left panel is laid down as follows:


Preconditions Panel: This is a JPanel which is fit into the left panel of the main frame i.e. the application frame. It contains all the components used to load the tree, read user inputs and record them.

Array list of GUIs: The precondition panel also enumerates an array list of GUIs. This is to say that the panel presents a collection of data which is incorporated into Swing GUI components. Anonymous inner classes form the core for retrieving array list elements and packing them again into an array list.

Submit button: The submit button is placed in the south of the Preconditions panel. The submit clicked event generates the following action:












arrayList when the submit button is clicked */ public void submitClicked(UserSelection userSelection); ------------------------------------------------------------------------------

The code for the design indicated in figures 8, 9 and 10 in Chapter 4 (please refer section 4.3, pgs. 29-30) include the components listed above. The other complexities include JDialog which appears on clicking the “Select Node” button. The inclusion of the "View Question" button is an extension to the parent. For selecting a node, the following conditions are implemented: if the node is a filter or layer question then the “OK” and “View Question” buttons are disabled (default). When a leaf node is reached, then the “OK” and “View Question” buttons are both enabled.

Figure 18. a) Default settings of OK and View Question buttons; b) OK and View Question buttons enabled when leaf node is selected; c) View Question button helps user view question related to node


The code for establishing whether the node selected was a leaf node and enabling the OK and View Button is: -----------------------------------------------------------------------------/*













getLastPathComponent() i.e. the last component in the path */ XMLTreeNode n= path.getLastPathComponent(); // check if n is a leaf node if(n.isLeaf()){ // enable the buttons viewQuestionButton.setEnabled(true); okButton.setEnabled(true); } else { // set false, as default viewQuestionButton.setEnabled(false); okButton.setEnabled(false); } ------------------------------------------------------------------------------

Once the node is selected, then the node is recorded (in the button in the UI and an array list in the back-end) as demonstrated in figure 10 in Chapter 4 (please refer section 4.3, pg. 30). Once the submit button is clicked, the array list is populated with the node-code, relationship with the node and answer value (both the latter are selected from drop down menus). The application now starts to develop from what has been provided i.e. with database querying.

5.2 MySQL MySQL, as asserted by Oracle Corporation, is “the most popular Open Source SQL database management system” [53] which offers efficient data management, flexible query manipulations, and instant data retrieval. It functions with SQL i.e. “Structured Query 47

Language”, which is the most common database query language used to manage data. The GRiST database was setup in MySQL as it was freely available for download [54] on their website. The workbench version 5.2 CE was used to design and populate the table. It was easy to use and quite powerful a platform to query databases. Since SQL was a familiar query language, and was compatible with MySQL, it was predominantly used to manage all the back-end interactions.

Figure 19. The SQL Development Environment which was principally used in the application.

5.2.1 Database Design- Part II Chapter 4 briefly discussed about the database design and the import actions needed to be implemented in order to set this data up (please refer Chapter 4, section 4.6, pg. 40). This section further extends to the realisation of the database design into code.

Figure 20. The GRiST data was imported into the “mygrist_samples” database The database was designed in the MySQL workbench which provided a well-defined environment to create databases and tables. The assessment data provided by the client was built into the table “grist_parent_population”. The table was quite extensive, with 158 columns and over 22000 rows. A simple query which can be written to import a csv file into a database would be:











The filename is specified in quotes along with the ‘,’ (comma) for a comma-separated variable file and ‘\t’ for tab separated variable file and so on. In addition, we may also include how lines are terminated (‘\n’). This would help the data be linearly populated in each column. The other optional fields in the query include fields/columns enclosed by ‘character’, escaped by ‘char’, lines starting by ‘string’, etc. The complete syntax can be viewed in [54]. For the application, the program written for importing the csv file is demonstrated in Figures 21 and 22.

Figure 21. Defining the system file containing the assessments and establishing a connection to the database. In line 15, the text file which has been converted to CSV has been defined so that it can be used conveniently in the method later. The constructor initialises database connectivity with the JDBC/ODBC driver connection tested in a try/catch block. The method which builds the query is:

Figure 22. The method written to import csv file into the database


The lines 37 and 38 build the query to be executed in the back-end where ‘grist_parent_population’

is the name of the table created. Line 40 executes the update and

returns the number of rows updated by this query i.e. 22,845. Once this program was written, it was executed and the database was setup, ready to be queried. The query could also be extended to “LINES TERMINATED BY ‘\n’”; however, this was not required. Once the database was designed, simple queries were written to the table such as: SELECT * FROM mygrist_samples.grist_parent_population; SELECT COUNT(*) FROM mygrist_samples.grist_parent_population WHERE gen_gender= ‘MALE’;

This was quite efficiently conducted in the MySQL workbench. It was important to test various queries on the database so that it returned accurate results for forming more complex queries in the application.

5.2.2 Querying the Database Once the database was designed, it was ready to be integrated with the application GUI. Some complex queries were formed in the back-end. A query to create a new table which filtered the data based on the conditions that the user chose i.e. the conditions stored in the userSelection array list was required to be created. The formation of the query would take the following form: String








grist_parent_population WHERE ” +arrayList_elements+ “)”;

This was realised in Java as illustrated in the code below: -----------------------------------------------------------------------------String query= “SELECT * FROM mygrist_samples WHERE ” // checking if the arrayList has atleast one condition if(items.size()>0){ // iterating through the arrayList


for(each item i in Items){ // retrieve the nodeCode() String correspondingCode= item.get(i).nodeCode(); /*the code supplied by the user is in the form gen-gender the columns are saved as gen_gender in the database hence some conversions are required.*/ code= code.replaceAll(“-” with “_”) /* same conversion is done with relationships as the values from user inputs is in form “IS EQUAL TO”, etc. Needs to be changed to “=” */ Enumerations enumRel= new Enumerations() String correspondingRelationship= enumRel.getMathOperator(item.get(i).getRelationship()) /* concatenate query i.e. add to the query string the WHERE conditions as selected by the user */ query += correspondingCode +” ”+ correspondingRelationship +” ”+item.get(i).getAnswer() } } ------------------------------------------------------------------------------

The table could then be queried as: String query= “SELECT * FROM parent_population”

Not only was the query formation important but also it was necessary that, according to simple SQL norms, the “AND” and “OR” clauses had to be placed at the right instances so as to retrieve the correct samples back. The program needed to be sensitive to 2 conditions: i)

The node-codes in array list: If the node-codes for more than one precondition are the same, then an “OR” needs to be appended after each precondition. This would ensure that there is only one value for each node-code at one time (unlike AND which compares if two node-codes have the desired values at the same time). HOW “OR” WORKS: For two conditions to be compared, this operator returns true if either of the conditions is true. Hence the truth table for the operator can be given as: 51

Condition 1 (A)

Condition 2 (B)

A “OR” B returns













HOW “AND” WORKS: For two conditions to be compared, this operator returns true if and only if both of the conditions are true. Hence the truth table for the operator can be given as: Condition 1 (A)

Condition 2 (B)

A “OR” B returns













Having established this, the query needed to compare the node-codes among each other. This is to essentially verbalise that if both the codes were equal, they could not equate to two values simultaneously, hence AND operator would return no rows from the database. In such cases, only ORs could work. Hence, the query formulated also needed to accommodate this aspect of the operators to retrieve the right number of rows. A third constraint as highlighted in the requirements was the “IN BETWEEN” operator which was broken down into “greater than” and “less than”. Hence, if the codes were equal but the relationships were unequal (as in case of the IN BETWEEN), then an AND operator can be appended. The code for addressing these issues to build a coherent query is described below. --------------------------------------------------------------------/* Compare every (i+1)th element to the ‘i’th element in the arrayList i.e. comparing the 2nd element to the 1st and the 3rd to the 2nd, and so on. */ for(j=i+1 to arrayList.size()){ // get the (i+1)th code for comparison code= item.get(j).nodeCode(); // get the (i+1)th relation for comparison of equality relation= enumRelation.getMathOperator(item.get(j).getRelationship()); // comparison condition to fit in the AND and OR


if(code.equals(correspondingCode) & relation.equals(correspondingRelationship)){ // append an OR query += ” OR ”; break; } else{ // append an AND query += “ AND ”; break; } } ---------------------------------------------------------------------

This query when produced the following visualisation in the UI as shown in figure 23.

Figure 23. Visually representing the select query Upon careful inspection, we observe that the first and second conditions are identical in codes and relations, hence an OR is appended. The third, fourth and fifth, being different, cause an AND to be appended. However, since the fourth and fifth conditions have the same code but varying relationships (read as gen-mood-swings IN BETWEEN the range 1 and 5), an AND is appended. The final output would be the number of records retrieved for the query. Similar queries were built to setup subsamples. However, the variation when subsamples were created was not to display the records retrieved, but the number of records retrieved for each condition registered by the user. As an example, the user has queried the parent population to set 53

up samples for patients whose angry emotions are greater than 2. The user now intends to query the number of males and females in the sample returned. This essentially indicates a split i.e. gender, which will dominate the sample comparisons. The split condition(s) chosen by the user is/are yet again wrapped in an array list, much like the preconditions, and articulated with the database query in a similar manner. However, as mentioned, the retrieval of records is modified. The user is now interested in knowing the number of records returned for each split condition chosen, rather than the records returned for the subsamples table. To obtain the number of records for a particular condition, the following query is written in SQL: SELECT COUNT(*) FROM tableName WHERE [condition(s)];

The operator COUNT(*)returns the number of records held for the defined conditions in the database. To integrate this with the application, it was realised that a string array of count queries (depending on the number of conditions in the array list) would need to be executed one-by-one to retrieve records for each condition. The following code depicts the function: -----------------------------------------------------------------------------String[] getCountOfWhereCondition(){ /* initialising a variable which will build the count query. This will be an array as each element will store the count value for each condition.*/ String[] countQuery = new String[items.size()]; for(items i in Item){ // build the count query countQuery[i]= “SELECT COUNT(*) FROM mygrist_subsamples WHERE ” // code and relationship conversions occur as mentioned in the // previous codes countQuery[i]+= code+ “ ” + relationship + “ ” + answer } return countQuery; } ------------------------------------------------------------------------------


The countQueries are stored in an array and are executed one-by-one hence: -----------------------------------------------------------------------------// Storing the method in a variable String[] countQueries= getCountOfWhere() // Initialising an integer array to execute queries int[] rowsIns= new int[countQueries.length]; // looping through the array to execute queries for(i from 0 to countQueries.length){ // execute each COUNT(*) query executeQuery(countQueries[i]) // retrieve the rows for printing while({ // store the value in rowsIns rowsIns[i]= get(â&#x20AC;&#x153;COUNT(*)â&#x20AC;? value); } // printing the number of records for each for(rows from 0 to rowsIns-1){ print rowsIns[i]; } } ------------------------------------------------------------------------------

Figure 24 highlights the results obtained on execution.


Figure 24. Count Query Visualisation

5.2.3 OpenCSV The final step of the application is to export the samples into a CSV file. OpenCSV [56] is CSV Parser for Java. This API was used mainly to export the subsample population to CSV after the user has created the desired population. With some research in MySQL export events, it was deduced that the standard query used for











mygrist_samples.subsample_population FIELDS TERMINATED BY ‘,’ LINES TERMINATED BY ‘\n’

in MySQL did not provide enough functionality to export the data with column names,

much as how these statistical applications require the CSV data to be. This query could only export the data for the samples i.e. rows without column headers which were inconsequential. In order to design the data to be exported with column headers, MySQL procedures were required to be written in order to export the table with column names. The steps in doing so involved writing a procedure in MySQL (PL/SQL) to retrieve database information and iterate through the columns of the specified table to retrieve the column name and column data, next, procedure calls would be made to implement this procedure after which the data would be written the CSV file. Since this option seemed quite tedious given the limited time to achieve the task, OpenCSV was utilised to write the data into a CSV file. This approach, although the process seemed elaborate to complete, was, on the contrary, a better option than any other considered. The class mainly used for these purposes was the CSVWriter [57] class. This class has methods to write data sequentially or as a chunk through writeNext(String[] nextLine) and writeAll(ResultSet







predominantly used to accomplish the task of accurate export events. The writeNext() function 56

was used to write the column names of the table into the CSV file first, followed by the writeAll()

function which sequentially printed records into the CSV file from the Result Set.

The user would be offered the functionality to enter a filename for saving the subsamples. This was implemented as in real-time; files cannot be saved by the same name. Hence, the files to be created would be named at run-time by the user.

5.2.4 Database Management The final phase in implementation was to integrate database management. It was important to identify the lifetime of the tables created and deleted. Since two tables were created through the application, it was imperative to delete them at some point so that duplicate table errors maybe avoided. After some careful consideration of the design and the structure of the application classes, it was realised that a good point to perform this deletion would be at the end of the application i.e. when exports are completed. This was a good stage to delete the previous tables as data from such tables wouldn’t be required after samples have been created; as long as the main assessment table is present. The .csv file would also contain the necessary samples established by the user. Hence, when the user restarts the application, he/she needn’t worry about SQL table exceptions. This was integrated in the code through formulating a constructor in the database classes solely for the purpose of table deletions. This was done because the pre-defined constructors passed array list objects as parameters but this would be quite unnecessary in this situation. These constructors were called right after the button click event was triggered for exporting the subsamples: -----------------------------------------------------------------------------// add a button to export samples okButton.actionPerformed({ // instantiating the class which exports samples. Also takes filename from user ExportSubsamples export= new ExportSubsamples(filename); // display success message displaySuccessMessage(“Export Successful!”);


// delete tables: note the difference in constructors. ParentPopulation parentPopulation = new ParentPopulation(); SubsamplePopulation parentPopulation = new SubsamplePopulation(); }); ------------------------------------------------------------------------------

Having performed the table deletions, the application was culminated with the subsamples ready to be processed through a statistical package such as SPSS for further analysis.

5.3 Usability Testing The interface of the application was tested with 3 users. They were asked to use the application and “speak aloud” their thoughts. The sequence of task was: Steps


Ideal Task Time

1. Home Screen- Print a Hypothesis

10 seconds

2. Click on the “Select Node” button to bring up a pop-up

5 seconds

3. Select the desired node from the list of nodes.

30 seconds for each condition selected.

View Questions

7 seconds per node

Choose a relationship

7 seconds

Choose an answer value

7 seconds

4. For adding a condition, select the “Add Precondition” button

5 seconds

6. Click on the “Submit” button to visualise the query

25 seconds

7. Once the query is approved, click on the “Submit Query” button 7 seconds for processing. 8. Click on the “Select split factor” button to bring up a pop-up i.e. 5 seconds JDialog 9. Select the desired node from the list of node.

20 seconds for each condition selected.

5 seconds per node

View Questions 58

Choose a relationship

5 seconds

Choose an answer value

5 seconds

10. For adding a condition, select the “Add split conditions” button

4 seconds

11. Click on the “Submit” button to visualise the query

7 seconds

12. Process the query and click on the “Create Subsamples” button.

10 seconds

13. Click on the “Export Subsamples” button

4 seconds

14. Name the CSV file

7 seconds

15. Click on the “OK” button

3 seconds Total task time:

2 minutes 96 seconds

The following results were observed: Users User 1



Task Time

This user struggled with To print hypothesis, set an 3 minutes 30 seconds tasks 1, 3 and 9. The user, active text when the home when thinking out loud was page is displayed. The confused as to which node to application




pick since the tree was so JOptionPane to capture comprehensive. A lot of time the


was also invested in selecting dynamically the

corresponding question



set in

the the

answer background.

value. The average time JFileChooser


A can


taken for these tasks was implemented to save the over a minute (for choosing CSV file. one node) as opposed to 93 seconds of ideal task time. The other areas of effort included exporting samples to CSV as the user did not know where the samples were getting saved. 59

User 2

This user was stressed about A clearer tree structure of 3 minutes 50 seconds tasks 3 and 9. The user, when which would particularly thinking







was highlight leaf nodes as the that is what is selectable answer while


values to choose for the subsamples. nodes



The means





average time taken for these visualisation such a real tasks was 200 minute (for time query builder so that choosing



as it is easier to comprehend

opposed to 93 seconds (per the nodes selected and node) of ideal task time. The how they build in the text other areas of struggle were area. to comprehend the SELECT queries which were set for visualisation.



conducted with ease though. User 3

This user struggled with Set tool tips for buttons so 3 minutes tasks 1 and 3. The user also that


found it harder to understand clearer.

actions For

the button click actions as hypothesis, there





printing a


many which can bring up a text


As area can be implemented

user 1, this user also took which will be hidden until efforts to navigate the tree required by the user to and select 4 nodes (about view. The tree navigation 500


including could



selecting the relationships through drop boxes for and


values). nodes



However, unlike the other 2 relationship and answer users, this user grew more boxes. The export can be familiar structure

with while


tree done setting JFileChooser. 60


subsamples (100 seconds for 2 nodes selected, much lower than the estimated time per node).

Observing the user comments and bearing in mind the time constraint, tool tips were set to the buttons although JFileChooser and the smoothening of tree navigation is an area of future work.


Chapter 6. EVALUATION This chapter discusses the various outcomes of implementing the application, its drawbacks and some further work which can be established, given the time. It discusses the various ways in which the implemented application did meet some requirements and ways in which it didn’t meet other requirements. Some steps are proposed to meet the requirements which, due to time constraints, couldn’t be realised in the final deliverable. It also embarks on some reflection of processes and the overall methodology which can be altered to deliver the finished product in a more efficient manner.

6.1 Iterative Waterfall Model- Is it best suited for the application? In the design phase, various drawbacks of the iterative waterfall model were discussed. These seemed to hinder the development of the application as it rendered the implementation to the end. This was not such a wise decision as the implementation of the application was as important as eliciting requirements and designing the classes. However, after drafting some core classes through class diagrams in the Design phase, a clearer understanding of the implementation was achieved, which may not be possible with other development models such as Iterative or Agile processes as they emphasise more on the need for prototyping. Another key drawback of adopting the iterative waterfall approach was the insufficient amount of time that was available for testing a quality end-product. Although the application performs in the intended manner, it would have been sturdier if unit testing would be conducted (area of future work). In order to unit test, user stories can be drafted for the test strategy which could well be coded in frameworks such as JBehave and JUnit. JBehave provides users the functionality of writing test cases in ubiquitous language through annotations such as ‘@Given’, ‘@When’, ‘@Then’ so that business-end clients could use such templates to offer their requirements which can be directly adopted in code by the technical personnel. Another drawback of the model was that the apart from usability testing, other nonfunctional requirements such as flexibility, robustness and reliability of the application could not be attained. This would, however, be another major area for development in future. Hence, evaluating the overall development process utilised for this process is suggestive of the fact that 62

the end-product could be more holistic in many more aspects than what has been accomplished. An iterative or agile model would also be beneficial in terms of not only the user involvement and feedback but also the shorter design, implementation and testing phases which could be efficient in developing a more balanced application coupled with a better design architecture.

6.2 Revisiting Requirements There are several ways in which the application has met the pre-defined client requirements. These include: i) creating selection criteria for subsample population through knowledge tree navigation and selecting the various types of nodes and choosing corresponding answer ranges and relationships, ii) the CORE requirement was to generate samples for parent population and subsample populations, however, due to time constraints, observing samples for biases is yet to be implemented , samples can be saved in a CSV format which was another core requirement for experimental design functionality, iii) processing samples through formulating appropriate queries through felicitous positioning of ‘AND’ and ‘OR’ operators in the queries returned. A good point to be noted here is that although the samples do not comply with some experimental design functionality drafted in the Requirements Analysis phase, solutions have been provided in alternative ways to help observe samples for biases in any statistical software package. Bearing in mind the time constraints, functionality such as selectively equalising subsample data could not be attained; however, some solutions are proposed to achieve this. The application is also designed to assist such changes. We look, in some depth, at how these requirements have been executed in the final phase of implementation: i)

Creating selection criteria for subsamples: Although the concept of tree navigation was supplemented by some code, it was replicated for offering the user the functionality to select a split factor based on demographics or any other node in the knowledge tree. The split conditions were collected in lists, similar to its parent design, the data handling at this stage was much more filtered than what was supplied to the conception of the application (the parent code). For example, to justify this requirement, we can ask questions such as “I have already formed parent population. But how do I proceed further?”, “I want to create a new population DERIVED from the parent population to observe if these conditions produce some significant differences in the health assessments stored, can I record a split condition based on gender (say) to observe how many males and females are returned from the parent samples?”. The answer to such questions would 63

be “yes”; this functionality will help the user record splits on the parent population to observe the various statistics in the parent population. Hence, the application which was extended to the parent code did meet the requirements of helping the user set up population for some split conditions for bias analysis.


Generate samples for parent and subsample populations: The application focused around the various means in which establishing parent and subsamples populations based on user conditions could be attained. This was extremely important as the rationale for the application was to query the database to create samples for comparison. The application was aided by some pre-defined framework which only helped record user conditions through navigating the knowledge tree. However, this would be futile if the information cannot be used in any manner. The extended application actualised the process that follows after the user records his/her selection(s). The various complex queries formed based on user requirements, establish database connections and accessing the database (all activities automated) to form samples formed the central idea of the extended application. For example, the various stages at which queries were formed were: firstly, in creating parent population from the preconditions selected by the user, secondly, in creating the subsamples population based on the rows returned by the parent population, thirdly, in count the number of records returned for each split condition and fourth to write data to CSV files. These complex queries form an engine driving the application. Some efficient database management was also performed based on client feedback. The tables created were deleted when the samples would be exported to a CSV file. This was a good point to delete tables as all the necessary information was available in the CSV file and secondly it was the end point of the application. Hence, I believe that it is the right stage to delete tables so that the next time the application is run, SQL exceptions are avoided.


Processing samples through formulating appropriate queries: This requirement was fulfilled through querying the table using both the operators. Initially, it was decided that either one of the operators would be powerful enough to build judicious queries to create samples. However, there were various dependencies which prohibited that. If we were to merely work with ANDs, then choosing splits where codes are the same just with different answer values would return no rows from the table. However, there were code like gen-age which required an “IN BETWEEN” operation to be performed. This 64

essentially means that age can be between two values (greater than value1 and less than value2 or vice versa) Hence, although the node-codes are the same, they can take an â&#x20AC;&#x153;in betweenâ&#x20AC;? value for some ordinal questions and still have an AND operator. Bearing in mind these constraints in the requirements, it was inferred that working with solely one type of operator would not be feasible to retrieve the desired samples. It was necessary to incorporate both into the queries being built.

6.2.1 Extending the Application After evaluating the various ways in which the application meets the user requirements, it is worth mentioning the contribution of this tool in the medical domain. The primary ideology of developing the tool was to analyse user created samples for biases. Once these biases were eliminated, then they could be exported in a suitable data format to be analysed for significances. Hence, the application will, in future, contribute to GRiST as a means to mine the database for clinician judgements. It can also contribute to the medical domain generally in order to help researchers to directly import the samples into a statistical package for significance testing. The rationale of the project stated that the aim was not to conduct experimental analysis on the samples being created by the user as we have packages to implement those events, but to create such samples that are suited to be processed directly through statistical software packages such as SPSS and R. The figure 25 compares the various statistical packages and the CSV file structures they accept for import and further analysis. The figure also compares how their requirements are fulfilled by the file structure generated in the CSV file exported by the application.

Figure 25. From L-R (top to bottom) A comparison between the data format acceptable for import in SPSS [58], R [59] and the CSV data created by the application 65

From the figure, we can observe that both the statistical packages researched about i.e. ‘Statistical Product and Service Solutions (SPSS Statistics)’ and ‘R Statistical Package’ require the column headers to be present in as the first line of the column data. The application writes the data into the CSV file in the exact same way. Hence, the application can now be extended to be tested in any statistical package for observing biases. Since MySQL proposed a lengthy method for establishing this, OpenCSV was used to export the data in the required format. A good point to be noted here is that although the functionality of comparing samples for biases could not be realised completely, this step covered well for it as now the user can use this application as a medium to compare various samples given the data.

6.3 Limitations The following section evaluates the various limitations of the application with some factors contributing to such drawbacks. It highlights the requirements that couldn’t be completed, the proposed solutions to some of these limitations and the alternative tools and techniques which could be used to efficiently achieve the application targets.

6.3.1 Undelivered Requirements & Proposed Solutions The requirements analysis chapter throws light on the various requirements the user required the application to perform. The previous section highlighted the requirements which were judiciously satisfied. However, some of the requirements couldn’t be met due to time constraints which restricted some insight into the various techniques of realising them. These included: analysing samples for biases, creating criteria to remove biases and reporting results to the user in a graphical manner (or alternate visualisation). These are discussed in detail below. i)

Analysing samples for biases: Due to the model adopted which caused the implementation phase to be carried out in the end, the application had a major drawback of setting up one set of samples for comparison. This is to essentially say that the tables can be queried only once before they get deleted in the end of the application when samples are exported. Given the time, I would probably try to alter this significantly to help the user re-query the table multiple times to store data in temporary memory or Views as compared to 66

stable Tables. Another solution would be to offer the users the choice to name new tables as an when they want to form different samples on the same data. This would allow more room for comparison between samples. A clear method of implementing this would be: Step 1) When the user selects the “Submit Query” button after choosing preconditions and receiving the records for the query, a “Resample Data” button could be positioned whose on click event would trigger a message to the user such as “Do you want to resample the data?” with a ‘yes’ or ‘no’ options.

Step 2) When the user selects ‘no’ (button.getSource()==‘No’), then the next step would be performed i.e. selecting subsamples. If button.getSource()==‘Yes’ then a textField will be positioned under the “Resample Data” button to prompt the user to select to enter a view name and select a button named “Choose Precondtions”. Step 3) The on-click event for “Choose Precondtions” could cause a new frame to appear, much like the “Submit Query” in the UI does, with the Preconditions UI appearing. The user can select multiple conditions on the previous table formed (parent_population table) and observe the results returned. Another “Submit Query” button could offer the user with the option of resampling data in the similar procedure until button.getSource()==‘No’. A similar approach can be adopted for resampling the subsamples. I believe that changing tables to views would help release many database resources. ii)

Creating criteria to remove biases: During the implementation of the application, a “selective equalisation” of subsamples was one of the requirements agreed with the client. It was initially decided that some random equalisation of samples would be enough for the application, given the time. However, based on client suggestions, “selective equalisation” would be better option as more judicious and accurate samples could be compared. For selective equalisation, let us assume that the user has selected subsample population (of 1000 records) based on gender bias. The query returns “400 records for MALE” and “600 records for FEMALES”. The user may then be interested in equalising 67

these samples (as there may be a size imbalance) before we set up exact samples to conclude for biases. Equalising them would help set up 500 records each for males and females. There are two approaches to perform this equalisation: random and selective equalisation. In selective equalisation, we capture that lower sample size (say lowerSize) and iterate through lowerSize to pick records for the larger sample size (say largerSize) based on a third and/or fourth factor. Let’s continue with the example of the gender biases. The lowerSizeMale=400 and largerSizeFemale=600. The user may be interested in restricting the age range of both the gender to between 35-50 years. Hence, the user is interested in selecting 1 female (out of the 600 females, with 400 such females picked for 400 males, equal samples) aged between 35-50 years for all 400 males in the sample population. The end result is the individual counts of 400 males and 400 females aged between 35-50 years. This will provide the user a better understanding of biases. The SQL query used for selecting such equal biases is: -----------------------------------------------------------------------“SELECT * FROM mygrist_samples.subsample_population WHERE gengender=’FEMALE’ AND gen_age>35 AND gen_age<50 ORDER BY RAND() LIMIT x”; ---------------------------------------------------------------------

The limit is the parameter which determines the number of samples returned for a particular factor against another. Given the time, I would work towards implementing this selective equalisation to obtain a better understanding of how judicious samples can be created for comparison and to apply the knowledge attained from performing some statistical background research for this project. iii)

Reporting results to the user: Another undelivered requirement for the application was to provide the user with some statistical figures such as mean, median, mode, standard deviation, quartile ranges, etc. and graphical visualisation of the data/records retrieved from the database. Visualising data graphically in form of pie charts, bar and line graphs aid in a better understanding of data as opposed to deciphering numerals and literals. Due to the iterative waterfall model used and the time constraint imposed at the penultimate stage, this requirement couldn’t be achieved as it required some pre-requisite knowledge of the Java graph library JFreeChart. JFreeChart is an open source framework which supports creation of graphs in Java. The JDBCCategoryDataset is used to categorise values obtained from databases after establishing connection. For pie charts, two numeric 68

parameters are accepted and plotted in the chart. As discussed in the chapter on Design (section 4.5), the question types would be of paramount importance to plot various types of graphs. This is because the nominal questions have lesser answer values (“equal to”, “unequal to”) as compared to ordinal values (“0-10”, “null”, “DK”). Hence, the nominal values are best represented by pie charts (2 parameters) and ordinal values are best represented as line graphs (multiple values, implies multiple bars) and a combination of both can be represented by bar graphs. F i g u r e

2 6 . Figure 26. Possible representations of bar graphs and pie charts for nominal and ordinal questions nodes

Some other limitations which are also would be rectified in future include: i) The positioning of ANDs and ORs is sequential i.e. every succeeding element is compared to its preceding element. However, if the user were to select the nodes in the following fashion: gengender= ‘MALE’, gen-sad>=3, gen-gender= ‘FEMALE’, then it is evident that the query generated should OR the genders. However, since gen-gender= ‘FEMALE’ is compared with gen-sad, an “AND” is positioned between gen-gender (for FEMALE) and gen-sad. This will effectively return zero rows (conflicting genders, cannot be male and female simultaneously). ii) CSV files which are exported through the application contain the column headers (as the tables are exported with headers). Hence, even if the final query returned zero rows from the table, the exported CSV file would contain column headers (consume memory on disk) with no data which would effectively be of little use. 69

6.3.2 Alternative Tools and Techniques From a computer science perspective, several techniques can be incorporated to develop a more efficient application. Some of these tools are discussed below: i)

Java vs. JavaScript & PHP: The various options explored for developing the application during the drafting of the Project Definition Form. These included PHP and JavaScript. However, since I was more familiar with Java, I chose to work with the environment to develop the application. However, I believe that PHP and JavaScipt are equally powerful tools to automate the development in several ways. PHP being a very popular language to develop server side applications could have yielded better results in terms of faster database connection as in Java, the database connection tends to fluctuate more. Also, JavaScript, in recent times, has become a powerful extension to Java with robust libraries. Bearing in mind these advantages, it would have been a better decision to implement the application on web using JavaScript in the front end and for the UI supported with PHP in the server end.


Database interactions: There may be various methods of communicating with databases. One of them, as discussed, would be PHP. There are also methods such as XML based interactions which aid in quicker database connection and interactions. However, Java was at par with the mentioned technologies and familiarity with the environment also influenced me to work with JDBC for the applications. However, there are equal advantages of using PHP and XML based interactions as compared to JDBC. Another technique of connecting to the database is to use JNDI by registering a datasource and establishing pooled connections. This technique would probably be a more efficient method of handling databases as it fosters connection pooling and distributed transaction. It is a possibility that multiple users may access a database resource in the application simultaneously. Hence, pooling the connection would grant individual user the right to access all resources at one point of time. Some other techniques which could help improve the quality of the application would be to

introduce some non-functional aspects to it such as flexibility and some exception handlers. The future work and scope of the project are discussed in the next chapter.


Chapter 7. CONCLUSION This section explores the various conclusions and outcomes learnt from implementing the application. The project has been an immense learning experience of working with various components of Java, Swing and JDBC. Some scope for future work and design compatibility for carrying out the future work have been explored.

7.1 Techniques Learnt This project heavily focused on two components: the UI developed in Swings and the database connectivity implemented in Java (JDBC). It has directed me toward developing a more logical perspective with Java and designing some user-friendly designs. The various techniques learnt include: ď&#x201A;ˇ

Java Database Connectivity (JDBC): The rationale of the project enforces on mining the database for biases. Hence, since the database was established in MySQL and accessed through Java programming, the in-built JDBC driver was utilised for establishing connections. Through the project, a number of pros and cons of JDBC were identified. The pros included: using JDBC, SQL statements can be processed and directed to any relational database (MySQL, SQL Server, etc.) to query databases and tables. It acts as a powerful interface between Java and any database management system to help set up databases and tables. Any queries which can be performed in the RDBMS console/workbench can be realised through Java programming using JDBC/OBDC Bridge. It is also provides classes which aid in maintaining a secure connection. The class PreparedStatement is designed to combat SQL injection. However, this class hasnâ&#x20AC;&#x2122;t been used in the application. But the Statement class, although is susceptible to SQL injection, is used as powerfully because the application is a stand-alone application and the chances of SQL injection is very minimal. Also, the JDBC drivers are inbuilt in several Integrated Development Environments such as Netbeans and Eclipse. Maintaining these plus features, it was decided that JDBC will be used for establish connection to the database and manipulating data to and from the database. However, some cons include SQL injections, fluctuating connection, deployment of correct drivers for various database types, etc. These cons seemed to affect the application minimally. 71

Using JDBC, some of the SQL syntaxes were made clearer. The various queries which could be performed easily with JDBC include simple SELECT statements, COUNT(*), import and export queries. Another major advantage of using JDBC was that it supported the use of OpenCSV which helped create the desired CSV files in the end. The ResultSet provided queries with a vast array of methods to access all the data associated with the database and the tables and other files associated with it. ResultSetMetaData class provided the meta-data about the data stored in database tables which was immensely useful in the application. I also learnt many techniques to execute array queries, experimenting with the getArray() function which helps insert the ResultSet into array structures. Hence, this simplistic yet dynamic conversion between Java and RDBMS objects makes JDBC most suited for application. It is also easy to alter tables in databases. ď&#x201A;ˇ

Swing: Swing is an old Java UI toolkit which helps develop powerful UIs integrated with Java. It is gradually being replaced by JavaScript. Although the toolkit is primitive, I believe that it can be highly useful to build interactive native applications in Java although it may be tedious. Through the application development, I have worked extensively with Swing, observing the various mechanisms and operations from things as trivial as causing an object to appear on screen to displaying complex SQL queries to the user. Although learning Swing at a later stage in the project did hinder some requirements from being completed, it was a learning experience and with experience comes mastering the technique. I have attempted to place buttons on panels and panels on frames to implementing pop-ups with JTrees in them. Although simple components have been added to the panels, they have been quite tedious to code and realise in terms of low level design. Reflecting back on the various options available, I chose Swings only to gain more expertise with the toolkit. However, in future, I would attempt to establish a more powerful application on the web using JavaScript than opting for a stand-alone application with a Swing GUI. Some interesting components to work with in Swing would be JOptionPane which helps deliver simple messages to the user. It is easy to use and create and contains some inbuilt function of dialog.dispose() on button click. Some other components that are worth exploring include JTable and integrating graphs with the JPanels. In future, I would work towards establishing adequate knowledge in these components. Swing is definitely powerful to develop any native Java applications and games and for this particular application, it is best suited as the application is stand alone and realised in Java. The various event listeners were also powerful to implement various button 72

click actions such as instantiating database classes and other labels appearing on the panels at run-times. I would also explore how timers can be incorporated into the application along with running some threads for processing. Swing is, however, slightly difficult to debug. ď&#x201A;ˇ

Java: Finally, I have polished some of the core Java concepts of arrays and objects. I have learnt to implement some of my own ActionListener classes which access some getters and setters in other classes. The various concepts of ArrayList, interfaces and enumerations have also become clearer through developing this application.

Overall, the three areas of learning have been of paramount importance in terms of implementation. In other spheres such as requirements elicitation, the predominant technique learnt is creating a Mind Map. The ideology of a linear model which closely aligns itself with the UML principles is appealing in terms of applying ones thoughts to paper. Brainstorming ideas with the client was an enormous learning experience. Having worked significantly lesser with any other design models than UML class diagrams and use cases, it has been a fresh approach to elicit requirements which is more customised to the user requirements. In the Design phase, some class diagrams were also drafted, which was another technique learnt. The sequential model of software engineering development, although was slower to implement the application completely, provided adequate space for me to learn various techniques in each phase, compare them and carefully apply them to develop the end-product.

7.2 Future Work As mentioned in the previous chapter, the application has some limitations. These limitations primarily arose due to two influential factors viz. time and the iterative waterfall methodology adopted. There is scope for future work as with all applications developed. Firstly, some non-functional aspects such as reliability, robustness and flexibility would need to be added. Good methods of implementing these would be: ď&#x201A;ˇ

The 2 Rs- Reliability & Robustness: These aspects can be achieved by coupling the code with some exception handlers. So far, only the database connectivity contains some try and catch blocks along with some throws exceptions. In order to make the application 73

robust and reliable, more exceptions would be added in the collection of array list, checking if the user has inserted any elements into the list, and display appropriate messages so as to direct users to perform the necessary actions. 

Flexibility: As mentioned in the previous chapter (section 6.3.1, point (i)) the application could enforce more flexibility in terms of resampling data and setting up more than one sample populations for comparison. As part of future work, such suggestions of providing users with the UI to create their own views to store database queries would be looked into significantly earlier than the other suggestions. Some flexibility can also be exercised in terms of the interface design which can accommodate more components to provide users with a larger array of options to compare samples for biases.

Immediate work in the future would be to implement all the suggestions offered to meet the requirements which haven’t been met through the development of the application i.e. creating different subsamples and reporting results in terms of graph visualisation (in the order of priority). Work also needs to be done on another limitation that the positioning of AND and OR queries can be more flexible. The application currently compares the (i+1)th element of the array list of user conditions with its preceding element. As a result, if the user were to provide gender condition for male, then provide an age condition for equal to 25 (say) and then provide the gender condition for female, the ANDs will be inserted in between the gender= MALE and age= 25 and gender= FEMALE indicating that gender is both Male and Female simultaneously. This query would still return zero rows. Hence, a robust comparison technique needs to be devised to compare all elements of the array list before inserting the AND operator and the OR operator. Some more future work lies in enforcing the “IN BETWEEN” operator. The various nodes where this operator is applicable are all the ordinal nodes; for example, gen-age is IN BETWEEN 30 AND 35. Currently, this is handled by the application as a two node-entry. One node specifies the less than (or less than equal to) value and the greater than (or greater than equal to) value. In future, as per some client feedback, the “IN BETWEEN” operator, which could help couple these two nodes into one node, aiding the user in entering lesser conditions, could be accomplished. A third future work is to predominantly work on representing the data in a graphical form to the user post some research on JFreeChart and developing charts using values from databases. This is extremely essential to the application, in terms of flexibility (as the user can decide if the samples 74

are devoid of biases before he/she exports them as CSV for statistical analysis) and reliability of the samples generated. Finally, some test classes would also be hard coded to render the application more usable and accurate.


REFERENCES: [1] C.D. Buckingham. (2003). Welcome to GRiST [Online]. Available: (Accessed: 20 April 2013). [2] GRiST. (2009). Galatean Risk Screening Tool General Version 1 [Online]. Available: (Accessed: 20 April 2013). [3] C.D. Buckingham and A. Adams. (2011). “The GRIST web-based decision support system for mental-health risk assessment and management,” First BCS Health in Wales/ehi2 joint Workshop. [Online]. Available: (Accessed: 20 April 2013). [4] myGRiST. (n.d.). myGRiST. Available (Accessed: 20 April 2013). [5] (2013). Design for Experiment. [Online]. Available: (Accessed: 20 April 2013). [6] The Free Dictionary by Farlex. (2013). Experimental Design. [Online]. Available: (Accessed: 20 April 2013). [7] H.J. Seltman. (2012). Experimental Design and Analysis. [E-Book]. Available: (Accessed: 20 April 2013) [8] MoreSteam. (n.d.). “Experiment Design Process,” in Design of Experiments (DOE). [Online]. Available: (Accessed: 20 April 2013) [9], [12] I. Price. (2000). “Non-experimental designs,” in Research Methods and Statistics PESS202 Lecture and Commentary Notes, New South Wales: School of Psychology University of New England. [Online]. Available: (Accessed: 20 April 2013) 76

[10], [11] I. Price. (2000). “Research Designs,” in Research Methods and Statistics PESS202 Lecture and Commentary Notes, New South Wales: School of Psychology University of New England. [Online]. Available: (Accessed: 20 April 2013) [13] R Tutorials by William B. King, Ph.D. Coastal Carolina University. (n.d.). “Syntax,” in Single Sample t Test. [Online]. Available: (Accessed: 25 April 2013) [14] Process Impact. (2013). Glossary of Requirements Engineering Terms. [Online]. Available: (Accessed: 21 April 2013). [15] I. Somerville and P. Sawyer., “Requirements Elicitation,” in Requirements Engineering: A Good Practice Guide, New York: Wiley, 1997, ch. 4, sec. 4.1-4.13, pp. 64-109. [16] I. Somerville and P. Sawyer. (1997). “Requirements Elicitation,” in Requirements Engineering:








Available: KNOVEL_CONTENT&p_p_action=1&p_p_state=normal&p_p_mode=view&p_p_col_id=colu mn1&p_p_col_count=1&_EXT_KNOVEL_CONTENT_struts_action=/ext/knovel_content/view &_EXT_KNOVEL_CONTENT_contentType=2&_EXT_KNOVEL_CONTENT_SpaceID=0&_ EXT_KNOVEL_CONTENT_VerticalID=0&_EXT_KNOVEL_CONTENT_SetID=7039631&_ EXT_KNOVEL_CONTENT_BookID=1537&_EXT_KNOVEL_CONTENT_NodeID=5561767 &_EXT_KNOVEL_CONTENT_Associated=true&_EXT_KNOVEL_CONTENT_SearchMode= false&sistring=&ststring= (Accessed 22 April 2013). [17] I. Somerville and P. Sawyer. (1997). “Requirements Elicitation,” in Requirements Engineering:








Available: KNOVEL_CONTENT&p_p_action=1&p_p_state=normal&p_p_mode=view&p_p_col_id=colu mn1&p_p_col_count=1&_EXT_KNOVEL_CONTENT_struts_action=/ext/knovel_content/view &_EXT_KNOVEL_CONTENT_contentType=2&_EXT_KNOVEL_CONTENT_SpaceID=0&_ EXT_KNOVEL_CONTENT_VerticalID=0&_EXT_KNOVEL_CONTENT_SetID=7039631&_ EXT_KNOVEL_CONTENT_BookID=1537&_EXT_KNOVEL_CONTENT_NodeID=5561767 &_EXT_KNOVEL_CONTENT_Associated=true&_EXT_KNOVEL_CONTENT_SearchMode= false&sistring=&ststring= (Accessed 22 April 2013). 77

[18] I. Somerville and P. Sawyer. (1997). “Requirements Elicitation,” in Requirements Engineering:








Available: KNOVEL_CONTENT&p_p_action=1&p_p_state=normal&p_p_mode=view&p_p_col_id=colu mn1&p_p_col_count=1&_EXT_KNOVEL_CONTENT_struts_action=/ext/knovel_content/view &_EXT_KNOVEL_CONTENT_contentType=2&_EXT_KNOVEL_CONTENT_SpaceID=0&_ EXT_KNOVEL_CONTENT_VerticalID=0&_EXT_KNOVEL_CONTENT_SetID=7039631&_ EXT_KNOVEL_CONTENT_BookID=1537&_EXT_KNOVEL_CONTENT_NodeID=5561767 &_EXT_KNOVEL_CONTENT_Associated=true&_EXT_KNOVEL_CONTENT_SearchMode= false&sistring=&ststring= (Accessed 22 April 2013). [19] I. Somerville and P. Sawyer. (1997). “Requirements Elicitation,” in Requirements Engineering:








Available: KNOVEL_CONTENT&p_p_action=1&p_p_state=normal&p_p_mode=view&p_p_col_id=colu mn1&p_p_col_count=1&_EXT_KNOVEL_CONTENT_struts_action=/ext/knovel_content/view &_EXT_KNOVEL_CONTENT_contentType=2&_EXT_KNOVEL_CONTENT_SpaceID=0&_ EXT_KNOVEL_CONTENT_VerticalID=0&_EXT_KNOVEL_CONTENT_SetID=7039631&_ EXT_KNOVEL_CONTENT_BookID=1537&_EXT_KNOVEL_CONTENT_NodeID=5561767 &_EXT_KNOVEL_CONTENT_Associated=true&_EXT_KNOVEL_CONTENT_SearchMode= false&sistring=&ststring= (Accessed 22 April 2013). [20] S. Surdek, (2009). “Creating user stories,” in Agile planning in real life. IBM developerWorks. [Blog]. Available: (Accessed: 22 April 2013). [21] M.G. Pedersen, (2012). What is the difference between business and functional requirements? RequirementOne Inc. [Blog]. Available: (Accessed: 22 April 2013). [22] (2013). mind mapping. [Online] Available: (Accessed: 20 April 2013). [23] mindmeister. (2013). ATM user requirements. [Online]. Available: (Accessed: 23 April 2013). [24] creately. (2008-2013). All creately products. [Online]. Available: (Accessed: 23 April 2013). 78

[25], [27] FreeMind. (2013). “FreeMind- free mind mapping software,” in Main Page. [Online]. Available: (Accessed: 23 April 2013). [26] FreeMind. (2012). Download. [Online]. Available: (Accessed: 22 April 2013). [28] FreeMind. (2013). File:FreeMind-computer-knowledge-080.png. [Online]. Available: (Accessed: 23 April 2013). [29] UCLan Student Lobby, University of Central Lancashire. (2012). “Organise your thoughts,” in Literature Search Tips. [Online]. Available: (Accessed: 22 April 2013). [30] Lessons From History. (2013). Functional versus Non-Functional Requirements and Testing. [Online]. Available: (Accessed: 23 April 2013). [31] T.T. Hewett, R. Baecker, S. Card, T. Carey, J. Gasen, M. Mantei, G. Perlman, G. Strong, and W. Verplank. (1992, 1996). “Chapter 2: Human-Computer Interaction,” ACM SigCHI Curricula for Human-Computer Interaction. [Online]. Available: (Accessed: 22 April 2013). [32], [33], [34], [35], [36], [37] J. Lumsden, Human Computer Interaction. (2013). CS2260: Lecture 1, Birmingham: Aston University. Slide 15. (Accessed: 20 April 2013) [38] M. Rouse, (2007). Waterfall model. SearchSoftwareQuality. [Blog]. Available: (Accessed: 22 April 2013). [39] QBase. (2001). “The selected software life cycle,” in Software Development Plan for the QBase Project. [Online]. Available: (Accessed: 25 April 2013). [40] Tripod. (n.d.). “Representation of Components, Relationships and rules,” in Waterfall Model. [Online]. Available: (Accessed: 27 April 2013). [41], [42] P. Sparrow, (n.d.). “Projects where waterfall method is suitable for SDLC,” in Waterfall Model: Advantages and Disadvantages of Waterfall Model. [Blog]. Available: (Accessed: 19 April 2013).


[43] J. Lumsden, Human Computer Interaction. (2013). “HCI in the Software Process (1),” in CS2260: Lecture 8, Birmingham: Aston University. Slide 18. (Accessed: 23 April 2013) [44], [45], [46], [47], [48] J. Lumsden, Human Computer Interaction. (2013). “HCI in the Software Process (1),” in CS2260: Lecture 8, Birmingham: Aston University. Slide 15. (Accessed: 23 April 2013) [49] M. Rouse, (2007). Class diagram. SearchSoftwareQuality. [Blog]. Available: (Accessed: 23 April 2013). [50] Confluence® Plugin. (n.d.). [Online]. Available: (Accessed: 23 April 2013). [51] C. Janssen, (n.d.). “Definition - What does Java Swing mean?,” in Java Swing. Techopedia (2010-2013). [Blog]. Available: (Accessed: 19 April 2013). [52] Wikipedia. (2007). “File:Gui-widgets.png,” in Swing (Java). [Online]. Available: (Accessed: 22 April 2013). [53] MySQL. (2013). What is MySQL? [Online]. Available: (Accessed: 23 April 2013). [54] MySQL. (2013). Download MySQL Workbench. [Online]. Available: (Accessed: 21 April 2013). [55] MySQL. (2013). LOAD DATA INFILE Syntax. [Online]. Available: (Accessed: 22 April 2013). [56] Sourceforge. (2013). opencsv, Available: (Accessed 24 April 2013). [57] Glen Smith. (2011). Class CSVWriter. [Online]. Available: (Accessed: 29 April 2013). [58] Online Psychology Laboratory. (n.d.). “Figure 6,” in How To Import CSV files into SPSS. Available: (Accessed: 26 April 2013). [59] R Tutorial: An Introduction to Statistics. (2009-2013). “CSV File,” in Data Import. Available: (Accessed: 25 April 2013).


APPENDIX: USER MANUAL: Features of the Application: The application helps automate the process of creating samples which can be observed for biases. It has the following features: 

Java Swing UI with tool tips.

Installation on any Windows machine with Java Runtime Environment and MySQL 5 and higher.

Easy tree navigation for selecting node conditions developed in Swing.

System Requirements: 

25.7 MB of hard disk space.

MySQL workbench (version 5 or higher, desired version would be 5.2 CE).

Eclipse IDE (Juno or Jave EE IDE) for importing the class to view source code.

Java Runtime Environment (jre 6 or higher)

Java Development Kit (jdk 1.6).

Instructions to use: The user is required to import the grist_assessments SQL file into a MySQL workbench (5.2 CE is the desired version). For importing a file in MySQL 5, open the workbench editor and in the Home tab, select the “Manage Import / Export” link. In the right hand panel under the Data Export / Restore section, select “Data Import/Restore” option. Choose a file from either the mysql dump (i.e. default location) or specified location (where the user has stored the dump file). Once the filepath is set, select the “Start Import” button and allow the database to be imported. After this procedure is successful, the grist_parent_population which is a part of the “mygrist_samples” database should be visible. This setup needs to be established before running the application. It is advisable to execute the application using Eclipse IDE.


Once these requirements are established, the GUI can be initiated to execute the application as follows: 1. User starts by printing a hypothesis for the sample creation.

2. The user will then click on the “Select Node” button. A dialog box will appear from where the user will select the appropriate node called ‘gen-sad’.

3. The user selects the appropriate relationship and answer values for the registered node by clicking on the “OK” button in the dialog box.


4. The user then clicks on the “Submit” button to visualise the query. He/she then clicks on the “Submit Query” button to set up the parent population. The number of rows retrieved is displayed. This is the starting point to the application handed over by the parent framework.

5. A new frame appears on clicking on the “Submit Query” button which will help the user record a split. In this case, the split is gender. Hence, the tree is traversed in the exact same way as done for the parent population. This time, the user selects the gender as male and female for creating subsamples.

6. When the user selects the “Submit” button, then the query can be visualised and the count of each of the condition is enumerated.

7. The user selects the “Export Samples” button and the option of naming the file is offered. The user names the file and a success message is displayed.


In stages 4 and 5, database connections are instantiated to query the database and retrieve the relevant number of rows. The user is required to possess the main assessment table named “grist_parent_population” to be able to derive tables such as “parent_population” and “subsample_population”. The user is also required to change the path of the CSV file exported in the last stag in order to save it to his/her own file directory (such as D:). EXAMPLE CODE: The following code highlights the ExportSamples class with the write procedure to export data into a CSV file: package nodeChooser.gristDatabaseUtilities; import import import import import import import import;; java.sql.Connection; java.sql.DriverManager; java.sql.ResultSet; java.sql.ResultSetMetaData; java.sql.SQLException; java.sql.Statement;

import; public class ExportSubsamples { private Connection connection = null; private Statement stmt = null; // setting filename at run-time. private String filename; public ExportSubsamples(String filename) { this.filename = filename; // connect to the database try { Class.forName("com.mysql.jdbc.Driver").newInstance(); connection = DriverManager.getConnection("jdbc:mysql://localhost:3306/mygrist_samples", "root", "sharadha_1992"); } catch (Exception e) { e.printStackTrace(); connection = null; } }


/* * Write to the CSV file using OpenCSV. Query used - SELECT * FROM * mygrist_samples.subsample_population * @return * Step 1: get the column name to write the columns into CSV (as * includeHeaders was not a good option) Step 2: write all the other data * from the query into the CSV later. */ public void exportCSVData() throws SQLException, IOException { // writer file defined CSVWriter writer = new CSVWriter(new FileWriter("D:/Aston docs/Aston Final Year/Final Year Project/GRiST/Subsamples Data/" + filename + ".csv"), ','); String exportData = "SELECT * FROM mygrist_samples.subsample_population"; stmt = connection.createStatement(); ResultSet rs = stmt.executeQuery(exportData); // get meta data about query such as column names ResultSetMetaData md = rs.getMetaData(); int col = md.getColumnCount(); // assigning a new String array the length of the coulmns to help writer write in the columns to CSV one-by-one. String arrs[] = new String[col]; System.out.println("Number of Column : " + col); System.out.println("Columns Name: "); // start array count from 0 (as column count from 1) int arraycounter = 0; for (int i = 1; i <= col; i++) { String col_name = md.getColumnName(i); System.out.println(col_name); arrs[arraycounter] = col_name; // increment counter to insert the next column name. arraycounter++; } // writer writes each next element into CSV in the STring array writer.writeNext(arrs); for (int i = 0; i < arrs.length; i++) { System.out.println("----->>>> " + arrs[i]); } // write all the other values in the ResultSet into the CSV file. writer.writeAll(rs, false); writer.close(); } // retrieve the filename public String getFilename() {


return filename; } }

For sources codes (and JAR files) for the parent interface as well as the functional application, please contact me at


Automated extraction of patient scenarios