by the Stat Team Mehrbod Sharifi Jing Yang
The Stat Project, guided by Professor Eric Nyberg and Anthony Tomasic
Feb. 25, 2009
Introduction to STAT In this chapter, we introduce the Stat project, its motivation and scope and also define the target audience and stakesholders. We will start the discussion of why we believe such a framework will be useful for the software engineers and computer science researchers but we will provide more details and evidence in the later chapters.
Stat is an open source machine learning framework in Java for text analysis. Original the work Stat was abbreviating Semi-Supervised Text Analysis Toolkit which refer to the implementation of some semi-supervised algorithms in this package, however later on we evolved to defining a framework as opposed to our particular implementation and therefore the first S can now be interpreted as ”Simple” or ”Statistical”. Applying machine learning approaches to extract information and uncover patterns from textual data has become extremely popular in recent years. Accordingly, many software have been developed to enable people to utilize machine learning for text analytics and automate such process. Users, however, find many of these existing software difficult to use, even if they just want to carry out a simple experiment; they have to spend much time learning those software and may finally find out they still need to write their own programs to preprocess data to get their target software running. We notice this situation and observe that many of these can be simplified. A new software framework should be developed to ease the process of doing text analytics; we believe researchers or engineering using our framework for textual data analysis would feel the process convenient, conformable, and probably, enjoyable. Existing software with regard to using machine learning for linguistic analysis have tremendously helped researchers and engineers make new discoveries based on textual data, which is unarguably one of the most form of data in the real world. As a result, many more researchers, engineers, and possibly students are increasingly interested in using machine learning approaches in their text analytics. Those people, some of which even being experienced users, find existing software packages are not generally easy to learn and convenient to use. In the next section, we will outline our design goal and provide a summary of how this differentiates Stat from the exiting software packages. We will also defined the scope or our work and our audience in the sections that follows. 1
Here is the outline of our design goal for the new framework. These points will be clarified mostly in the upcoming chapters but we will state them with brief introduction in this section: • Simplicity: This is the most important consideration. Essentially, we will reduce the complexity of the API by limiting the hierarchy and number of domain objects and their interaction. We achieve this by defining a clear distinction of responsibilities and the evaluate our success by how quickly someone completely unfamiliar with text analysis and machine learning can understand the toolkit and start using it. This is explained further in the next sections and chapters. • Extensibility: We put the focus on how to facilitate the extension of the package or in other words: implementing within our framework. Combined with the simplicity, we hope that this will encourage more people to contribute and enable the kinds proven success as can be seen in Matlab or R for example. • Performance: As it is widely know, dealing with text is computationally intensive and we will take this into consideration from ground up (e.g., using Java primitives instead of objects) • Features: In the presence of extensibility we will give lowers priority to implementing many features for this package. Instead, we will demonstrate how the package generalizes the approaches by many other packages by ”wrapping” those tools so they can be used in the simplified manner and also implicitly providing some training for them if the users would rather continue by moving to any of those packages. As stated previously, we will provide implementation of unsupervised and semi-supervised methods which is what lacks in this domain. These objectives shows how Stat will be different than existing software package in this domain. For example, although Weka has a comprehensive suite of machine learning algorithms, it is not designed for text analysis, lacking of naturally supported capabilities for linguistic concepts representation and processing. MinorThird, on the other hand, though designed specifically as a package for text analysis, turns out to be rather complicated and difficult to learn. It also does not support semi-supervised and unsupervised learning, which are becoming increasingly important machine learning approaches. Another problem for many existing packages is that they often adopt their own specific input and output format. Real-world textual data, however, are generally in other formats that are not readily understood by those packages. Researchers and engineers who want to make use of those packages often find themselves spending much time seeking or writing ad hoc format conversion code. These ad hoc code, which could have been reusable, are often written over and over again by different users. Researchers and engineers, when presented common text analysis tasks, usually want a textspecific, lightweight, reusable, understandable, and easy-to-learn package that help them get their works done efficiently and straightforwardly. Stat is designed to meet their requirements. Motivated by the needs of users who want to simplify their work and experiment related to textual data learning, we initiate the Stat project, dedicating to provide them suitable toolkits to facilitate their analytics task on textual data.
In a nutshell, Stat is an open source framework aimed at providing researchers and engineers with a integrated set of simplified, reusable, and convenient toolkits for textual data analysis. Based on this framework, researchers can carry out their machine learning experiments on textual data conveniently, and engineers can build their own small applications for text analytics or use the classes designed by others.
The previous section may give an impression for an impossible task. In this section, we clearly state what is and is not included in this project. The main deliverable for this project is a set of specifications, which defines a simplified framework for text analysis based on NLP and machine learning. We explain how succinctly the framework should be used and how easily it can be extended. We also provide introductory implementations of the framework, including tools and packages serving foundation classes of the framework. They are â€˘ Dataset and framework object adaptors: A set of classes that will allow reading and writing files in various formats, supporting importing and exporting dataset as well as loading and saving framework objects. â€˘ Linguistic and machine learning packages wrappers: A set of classes that integrate existing tools for NLP and Machine Learning and can be used within the framework. These wrappers hides the implementation and variation details of these packages to provide a set of simplified and unified interfaces to framework users. â€˘ Semi-Supervised algorithms: Implementation of certain Semi-Supervised learning algorithms that are not available from the existing packages. The goal is NOT to design the most comprehensive machine learning package or compete or correct the previous packages. We will to focus on the goals stated above to create our framework from a different perspective.
Below is the list of stakeholder and how this project will affect them: • Researchers, particularly in language technology but also in other fields, would be able to save time by focusing on their experiments instead of dealing with various input/output format which is routinely necessary in text processing. They can also easily switch between various tools available and even contribute to STAT so that others can save time by using their adaptors and algorithms. • Software engineers, who are not familiar with the machine learning can start using the package in their program with a very short learning phase. STAT can help them develop clear concepts of machine learning quickly. They can build their applications using functionality provided STAT easily and achieve high level performance. • Developers of learning package, can provide plug-ins for STAT to allow ease of integration of their package. They can also delegate some of the interoperability needs through this program (some of which may be more time consuming to be addressed within their own package). • Beginners to text processing and mining, who want fundamental and easy to learn capabilities involving discovering patterns from text. They will be benefited from this project by saving their time, facilitating their learning process, and sparking their interests to the area of language technology.
Survey Analysis 2.1
Existing Related Software Package In this chapter, we analyze a few main competitors of our projects. We focus on two academic toolkits – Weka and MinorThird. We comment on their strengths and explore their limitations, and discuss why and how we can do better than these competitors.
Weka is a comprehensive collection of machine learning algorithms for solving data mining problems in Java and open sourced under the GPL.
Strengths of Weka
Weka is a very popular software for machine learning, due to the its main strengths: • Provide comprehensive machine learning algorithms. Weka supports most current machine learning approaches for classification, clustering, regression, and association rules. • Cover most aspects for performing a full data mining process. In addition to learning, Weka supports common data preprocessing methods, feature selection, and visualization. • Freely available. Weka is open source released under GNU General Public License. • Cross-platform. Weka is cross-platform fully implemented in Java. Because of its supports of comprehensive machine learning algorithm, Weka is often used for analytics in many form of data, including textual data.
Limitations of using Weka for text analysis
However, Weka is not designed specifically for textual data analysis. The most critical drawback of using Weka for processing text is that Weka does not provide “built-in” constructs for natural representation of linguistics concepts1 . Users interested in using Weka for text analysis often find themselves need to write some ad-hoc programs for text preprocessing and conversion to Weka representation. • Not good at understanding various text format. Weka is good at understanding its standard .arff format, which is however not a convenient way of representation text. Users have to worry about how can they convert textual data in various original format such as 1
Though there are classes in Weka supporting basic natural language processing, they are viewed as auxiliary utilities. They make performing basic textual data processing using Weka possible, but not conveniently and straightforwardly
raw plain text, XML, HTML, CSV, Excel, PDF, MS Word, Open Office document, etc. to be understandable by Weka. As a result, they need to spend time seeking or writing external tools to complete this task before performing their actual analysis. â€˘ Unnecessary data type conversion. Weka is superior in processing nominal (aka, categorical) and numerical type attributes, but not string type. In Weka, non-numerical attributes are by default imported as nominal attributes, which usually is not a desirable type for text (imagine treating different chunks of text as different values of a categorical attribute). One have to explicitly use filters to do a conversion, which could have been done automatically if it knows you are importing text. â€˘ Lack of specialized supported for linguistics preprocessing. Linguistics preprocessing is a very important aspect of textual data analysis but not a concern of Weka. Weka does not (at least, not dedicated to) take care this issue very seriously for users. Weka has a StringToWordVector class that performs all-in-one basic linguistics preprocessing, including tokenization, stemming, stopword removal, tf-idf transformation, etc. However, it is less flexible and lack of other techniques (such as part-of-speech tagging and n-gram processing) for users who want fined grain and advanced linguistics controls. â€˘ Unnatural representation of textual data learning concepts. Weka is designed for general purpose machine learning tasks so have to protect too many variations. As a results, domain concepts in Weka are abstract and high-level, package hierarchy is deep, and the number of classes explodes. For example, we have to use Instance rather than Document and Instances rather than Corpus. Concepts in Weka such as Attribute is obscure in meaning for text processing. First adding many Attribute to a cryptic FastVector which then passed to a Instances in order to construct a dataset appears very awkward to users processing text. Categorize filters first according to attribute/instance then supervised /unsupervised make non-expert users feel confusing and hard to find their right filters. Many users may feel unconformable programmatically using Weka to carry out their experiments related to text. In summary, for users who want enjoyable experience at performing text analysis, they need built-in capabilities to naturally support representing and processing text. They need specialized and convenient tools that can help them finish most common text analysis tasks straightforwardly and efficiently. This cannot be done by Weka due to its general-purpose nature, despite its comprehensive tools.
Partial UML Domain Model of Weka (Preliminary) evaluate 1
1 contain * NominalToString
Note: when you see ClassA "contains" a number of ClassB, it is probably that Weka implements it as ClassA maintains a "FastVector" whose elements are instances of ClassB.
Figure 3.1: Partial domain model for Weka for basic text analysis
Requirements specifications Here we first explain in detail the major features of our framework. • Simplified. APIs are clear, consistent, and straightforward. Users with reasonable Java programming knowledge and basic machine learning concepts can learn our package without much efforts, understand its logical flow quickly, be able to get started within a small amount of time, and finish the most common tasks with a few lines of code. Since our framework is not designed for general purposes and for including comprehensive features, there are space for us to simplify the APIs to optimize for those most typical and frequent operations. • Reusable. Built-in modular supports are provided for the core routines across various phases in text analysis, including text format transformation, linguistic processing, machine learning, and experimental evaluation. Additional functionalities can be extended on top of the core framework easily and user-defined specifications are pluggable. Existing code can be used cross environment and interoperate with external related packages, such as Weka, MinorThird, and OpenNLP. (I use reusable instead of extendable because it cover a higher level of concept we might also need and able to follow, what’s your idea? ) • To be added
In this section, we define most common use cases of our framework and address them in the degree of detail of casual use case. The “functional requirements” of this project are that the users can use libraries provided by our framework to complete these use cases more easily and comfortably than not use.
Actors Since our framework assumes that all users of interests are programming using our APIs, there is only one role of human actor, namely the programmer. This human actor is always the primary actor. There are some possible secondary and system actors, namely the external packages our framework integrates, depending on what specific use cases the primary actor is performing.
Casual Use Cases Here we present some typical use cases of our framework in a casual format. For better understanding and separation of responsibilities, use cases are divided to many categories, where each category defines a typical step of doing text analysis. 9
• Dataset importing and exporting. In this category of use cases, a user want to read file(s) from different kinds of sources in different kinds of formats, to some specific data structures representing dataset in memory for further processing, or write dataset to files in other format. Here list sample important use cases: 1. Use case 1. Read a list of raw text files that placed in a specified directory of the local file system, to a RawCorpus in which a RawDocument represents a text file. 2. Use case 2. Read a list of HTML files that placed in a specified directory of the local file system, strip the tags, and store to a RawCorpus in which a RawDocument represents a HTML file. 3. Use case 3. Read a XML file with non-unicode encoding from the Web specified by a URL to a RawDocument, with fields appropriately populated. • Object persistence. In this category of use cases, a user want to persist objects in our framework to disk in our internal format, which can be loaded lately. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Structured information extraction. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Linguistic preprocessing. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Machine learning. 1. Use case 1. 2. Use case 2. 3. Use case 3. • Experiment and evaluation. 1. Use case 1. 2. Use case 2. 3. Use case 3.
• Open source. It should be made available for public collaboration, allowing users to use, change, improve, and redistribute the software. • Portability. It should be consistently installed, configured, and run independent to different platforms, given its design and implementation on Java runtime environment. • Documentation. Its code should be readable, self-explained, and documented clearly and unambiguously for critical or tricky part. It should include an introduction guide for users to get started, and preferably, provides sample dataset, tutorial, and demos for user to run examples out of the box. • Performance. It should be able to response to user within reasonable amount of time given a limited amount of data (unclear, need specify). Preferably, it can estimate the running time needed to perform a task and notify user before user actually execute the task (is this the responsibility for framework designers? ) • Dependency. It is actually a issue. The package integrates other external packages and has many dependency. How to resolve this issue? How do we distribute our package?
Bibliography  Reference 1  Reference 2