Issuu on Google+

1 TEXT MINING: THE SEARCH FOR NOVELTY IN TEXT (A report submitted in partial fulfillment of the requirements of the Ph.D Comprehensive Examination in the Department of Computer Science) by Aditya Kumar Sehgal Advisor: Prof. Padmini Srinivasan This paper has been written with a two-fold aim. The first is to provide a survey of text mining research. Second, we identify, at particular points, our current research areas of investigation as well as potential directions for future research towards my dissertation.

1

Text Mining

Text mining also known as Text Data Mining (TDM)[23] and Knowledge Discovery in Textual Databases (KDT)[15] can be described as the process of identifying novel information from a collection of texts (also known as a corpus)[23]. By novel information we mean associations, hypothesis or trends that are not explicitly present in the text source being analyzed. This definition is by no means standard. In [36] Nahm and Mooney describe text mining as “looking for patterns in unstructured text”, in [13] Doore et al. say “text mining applies the same analytical functions of data mining to the domain of textual information, relying on sophisticated text analysis techniques that distill information from free-text documents”, whereas in [62] Tan describes text mining as “the process of extracting interesting and nontrivial patterns or knowledge from text documents”. None of them however


2 explicitly emphasize the importance of novelty in the mined information. In [23] Hearst makes one of the first attempts to clearly establish what constitutes text mining and distinguish it from information retrieval and data mining. In this highly cited paper, she metaphorically describes text mining as the process of mining precious nuggets of ore from a mountain of otherwise worthless rock. She calls text mining the process of discovering heretofore unknown information from a text source. For example, suppose that a document establishes a relationship between topics A and B and another document establishes a relationship between topics B and C. These two documents jointly establish the possibility of a novel (because no document explicitly relates A and C) relationship between A and C. Hearst also argues that tasks such as text categorization, text clustering, co-citation analysis, etc., cannot be classified as text mining because they do not produce anything novel. She is also ambivalent about terming metadata mining as ‘real’ text mining. Using another metaphor she describes information retrieval, the process of finding documents that contain information of interest to the user, as “finding needles in a haystack”. Since the required information is already present in the text, no new ‘nugget’ of information is revealed. Text mining is also often confused with data mining with some people describing text mining as a simple extension of data mining applied to unstructured databases. Hearst disputes this by saying that data mining is not ‘mining’ at all but simply a (semi)automated discovery of patterns/trends across large databases that help in making decisions and that no new facts are established during this discovery process. In [28] Kroeze et al. dispute some of Hearst’s definitions. They argue that novelty defined as ‘nuggets of ore in rock’ is a contradiction in her own terms because the nuggets are


3 already present in the rock and so nothing new is being extracted. Although Hearst’s metaphor may not be completely appropriate, we recognize that her key emphasis is on novelty. Kroeze et al. expand on the ‘novel’, ‘non-novel’ classification of Hearst by introducing a new class called ‘semi-novel’. They classify data/information retrieval as being non-novel, knowledge discovery (standard data-mining, metadata mining, and standard text mining) as being semi-novel, and they introduce a new type of investigation process called ‘intelligent text mining’, which they classify as being truly ‘novel’. They define intelligent text mining as the process of automatic knowledge creation. They stress that artificial intelligence techniques, which help simulate human intelligence, are critical to achieve intelligent text mining. People also tend to confuse text mining with Information Extraction (IE). Information extraction deals with the extraction of facts about prespecified entities, events or relationships from unrestricted text sources. One can think of information extraction as the creation of a structured representation of select information drawn from text[21]. E.g., tagging all the gene names in a collection of text is an information extraction task. There is no notion of novelty involved because only information that is already present is extracted, and therefore IE cannot be classified as text mining. However, information extraction is directly involved in the text mining process. E.g., one approach to text mining from semi-structured documents on the web is to use extraction techniques to convert documents into a collection of structured data and then apply data mining techniques for analysis. In our research we agree with Hearst’s view that novelty with respect to the text collection is a requirement in text mining. However, unlike Hearst we adopt a more flexible definition of what constitutes ‘novelty’. Additionally,


4 we see a subjective dimension in what is or is not perceived to be novel. For example, some consider summarization and clustering outputs to be novel whereas others do not. Text mining represents a significant step forward from text retrieval. It is a relatively new and vibrant research area that is changing the emphasis in text-based information technologies from low level ‘retrieval’ & ‘extraction’ to higher level ‘analysis’ & ‘exploration’ capabilities. Given the large amount of data available today in the form of text, tools that automatically find interesting relationships, hypothesis or ideas, or assist the user in finding these would be extremely useful.

2

Related Areas

Text mining is an inter-disciplinary field using techniques from the fields of information retrieval, natural language processing, machine learning, visualization[29], clustering and summarization[37], among others. We give an overview of some of the ways in which techniques from the first two fields are applied to text mining. We focus on the first two because in our current research we primarily rely on these fields for text mining.

2.1

Information Retrieval

The goal of information retrieval is to retrieve, accurately and quickly, those documents (termed relevant documents) that contain some piece of information a user is interested in. At the same time the emphasis is also on retrieving as few irrelevant documents as possible[63]. The number of relevant documents desired is user-dependent varying from just one document to


5 all that exist in the collection. The retrieved documents are generally ranked according to some criteria (popularly the cosine similarity score between the document vector and the query vector). Although IR functions are the basis of many interesting applications such as question answering and information filtering[40] our focus here is on its impact on discovering novel information. In [45] Shatkay et al. explore functional relationships between genes in DNA microarray experiments by searching the biomedical literature and establishing relationships based on the content of retrieved abstracts. These in turn are used to establish functional connections between genes. In [47] Srinivasan describes MeSHMap, a prototypical text mining system which uses metadata associated with each MEDLINE[39] record for creating ‘profiles’ (called MeSH profiles) for input biomedical topics. The cosine similarity between the metadata profiles of two topics then gives some indication of the strength of the relationship between the two concepts. A higher similarity score would indicate that a meaningful relationship between them may exist. In [51, 42] we use this approach to explore relationships between genes. In [50] this approach was the basis of research in which we postulate a beneficial role for turmeric in retinal diseases. This is an example of text mining research leading to hypothesis generation. Other applications and evaluations of MeSHMap are described in [52, 48]. Generally we conclude, along with others[5], that using manually assigned keywords/metadata for text mining proves to be better than using words from free-text[48]. However, we observe both from our experiments and others’ research, that the choice between using free-text and metadata for text mining, is not clear cut. Most text collections (such as most of the


6 web) do not have any metadata descriptions. Even in those collections that have metadata, the quality is not necessarily as good as with MEDLINE metadata, which is assigned manually by trained indexers at the National Library of Medicine. Even in MEDLINE we observe that the interesting aspects and concepts in the records are not necessarily completely represented in the metadata. Thus, we suggest using a balanced approach where both metadata and carefully extracted phrases from free-text are used. We are currently pursuing experiments exploring strategies based on this approach.

2.2

Natural Language Processing

The ultimate goal of natural language processing (NLP) is to create software that will enable computers to understand and generate language used naturally by humans. Although this goal remains unachieved, a significant positive outcome of NLP research is in the area of information extraction. As mentioned before this is part of the foundation for text mining, especially when working with free text collections. There is extensive NLP research on key problems such as word sense disambiguation, part-of-speech tagging, phrase identification, extraction of relations, etc. For example, in [43] Sekimizu et al. use natural language processing techniques to extract relationships between gene products and proteins from MEDLINE by identifying subject and object terms for frequently seen verbs such as activate and interact. It may be observed that machine learning techniques have also been used, for example, to learn information extraction rules for semi-structured and unstructured text[46] and learn hidden markov structures for keyphrase extraction[44]. In our opinion NLP is a critical part of text mining. The use of NLP


7 techniques enable text mining tools to get closer to the semantics of a text source. This is important, especially as text mining systems begin to address the goal of ‘explaining’ the mined information. In our research we are currently exploring NLP for information extraction from MEDLINE. In particular we are exploring methods to achieve the ‘correct’ set of documents from MEDLINE even when the gene name is ambiguous. Once that is completed we will focus on extracting the relevant ‘gene’ sentences from the text and so on. These methods will also be of value as we move towards mining free-text web pages.

3

Approaches

Text mining provides various approaches for identifying novel information. In the previous section we discussed the contributions of IR and NLP to text mining. Here we describe some specific approaches.

3.1

Association Rule Mining

In [1] Agrawal et al. introduce the notion of mining transaction data for association rules between sets of items in large databases, with a specified confidence level. An association rule is defined as an implication of the form X ⇒ Yi where X is a set of items and Yi is an item not present in the set X. The items in the set X are termed the antecedents and Yi is called the consequent. The support s of an association rule is defined as the percentage of the total number of transactions that contain both X and Yi . The confidence c of an association rule is defined as the percentage of transactions of X that also contain Yi . Association rules allow us to


8 view implicit relationships between different entities and the confidence factor associated with each rule allows us to rank them. Thus one can execute queries such as “Give me the top 10 association rules for item A”, where A is the antecedent. Association rules are mined in two steps. In the first step those items sets that have support above a minimum support level specified are identified. Such an itemset is usually called a frequent itemset. This step can be computationally very intensive. In the second step, association rules are formed by finding all non-empty subsets ai for each frequent itemset f and generating rules of the form ai ⇒ (f − ai ) if the ratio of the support of (f − ai ) and the support of ai is greater than a threshold. The Apriori algorithm[2] and the Direct Hashing and Pruning algorithm[41] are two well known algorithms used to mine association rules. In the context of text mining, association rules have been used to discover potentially interesting relationships between concepts that co-occur in documents[14, 5]. Association rules have also been used to establish relationships between documents that do not share any terms. For example[24], consider an association rule of the form B ⇒ C, where B and C are words, whose confidence level is above the threshold. Then the document set retrieved for C can be expanded to include B’s document set. This allows us to find those documents that do not contain C but are still related to it because they have the related term B. Latent Semantic Indexing (LSI)[12], a technique based on singular value decomposition, maps documents to a lower dimension space where documents are considered close to each other if they share a sufficient number of term-based associations. This allows the original document set for a query to be expanded to include those documents


9 that are closely located in the lower dimensional space. The benefits of working with association rules are limited by the fact that mining association rules from text databases is a considerably intensive process. This is mainly because of the high dimensionality of the feature space. Hence the number of items (more realistically words) that need to be considered when creating frequent itemsets is orders of magnitude larger than the number of items in a set of transactions (in a business setting). These factors reduce the effectiveness of the Apriori and Direct Hashing and Pruning Algorithms in text contexts. In [24] Holt and Chung propose two new algorithms, viz. the Multipass-Apriori and Multipass-Direct Hashing and Pruning Algorithms that effectively mine association rules for text databases. In addition to the generation of association rules, co-occurrence forms the basis of a number of other text mining projects, especially in the biomedical domain. In [53] Stapley et al. propose a co-occurrence based gene network in which the relationship between two genes is assessed on the basis of its bibliographic distance. Graphical tools such as PubGene[25], depict co-occurrence based links, derived from MEDLINE, between 13, 712 human genes.

3.2

Open and Closed Discovery

In [55] Swanson describes an approach that discovers relationships between concepts that are logically related but not bibliographically related, i.e., do not co-occur in any document. This approach forms the basis of the ARROWSMITH [59] system. The general idea is that two concepts A and C might be related if A co-occurs in some document with intermediate some concept B, and B co-occurs in some document with C. This implication


10 based discovery process was successfully used by the authors to discover several novel relationships such as connections between Raynauds disease and fish oils [55], and migraine and magnesium [56] among others [57, 58, 60]. Swanson and Smalheiser essentially designed two kinds of discovery processes that were later named ‘open’ and ‘closed’ discovery processes[64]. The input to the open discovery process is a single concept (A) and the goal is to find related concepts (C) that do not co-occur with A in any document in the collection, i.e. the relationship between A and C has not been explored yet. Their process begins with a literature search for A and interesting phrases (B concepts) from titles of the retrieved documents (for A) are extracted. These B concepts are then used to initiate another round of literature search. Interesting phrases (C concepts) in the documents retrieved for the B concepts are then extracted. By analyzing (reading) the two sets of documents one can establish which B concepts connect the A and C concepts in a potentially interesting and novel way. In the closed discovery process one starts with both the A and C concepts and the goal is to establish potentially interesting B concepts that overlap with both A and C and connect them in a novel way. A big positive of both the open and closed discovery processes is that they have been successfully used to suggest novel connections between concepts. These processes, as implemented by Swanson[59] and others[64], are however only semi-automatic as the B and C concepts have to manually selected and the connections between the concepts have to be manually inspected by a domain expert.


11

3.3

Metadata Mining

Metadata is defined as data that describes data. Instead of using free-text one may choose to use metadata where available for text mining. It is much easier to work with metadata as it is more structured that free text. In some cases the metadata for a text collection are manually created, as in MEDLINE, whereas in other cases metadata are automatically generated. Feature selection is an important part of text mining as it has a profound affect on the data model produced. With metadata a significant amount of feature selection is implicitly present. Using metadata can also significantly reduce the size of the feature space required to model the text collection[5]. This impacts suitability for large text collections. There are several approaches in text mining that are built on metadata mining. In [64] the authors replicate Swanson and Smalheiser’s experiments on Raynaud’s disease and fish oils by limiting the interesting phrases extracted (Bs and Cs in the “open” discovery process) to metadata terms associated with the MEDLINE records. These metadata are known as MeSH (Medical Subject Heading) terms. In [47] Srinivasan replicates Swanson and Smalheiser’s experiments using MeSH profiles for topics and also using IRbased term weighting schemes to identify interesting MeSH term connections between the topics. In [14] Feldman and Hirsh use the co-occurrence frequencies of metadata terms, which are taken from a hierarchically arranged vocabulary, to mine a text collection. As mentioned earlier, we were able to postulate a beneficial role for turmeric in retinal diseases[50]. We were also able to postulate beneficial roles in the context of crohn’s disease and problems related to the spinal cord[49]. In both of these papers the open discovery approach was used.


12

4

Text Mining applied to Specific Domains

There are many domain specific text collections available electronically. Besides MEDLINE[39], we have for example, Reuters newswire data, 10K filings of companies, archives of mailing lists dealing with specific subject areas and collections of customer emails, product reviews etc. that are generally maintained by companies. Since these corpi are domain specific they motivate the design of customized text mining algorithms that can use prior domain knowledge and thus work better than generic text mining algorithms. In this section we describe some specialized domains where text mining techniques are being used.

4.1

Bioinformatics

The biomedical research literature is a very promising target for text mining. Given the extensive presence of biomedical papers in digital form, as well as their formal and technical vocabulary, these papers offer a profitable area for automatic text mining. Moreover the high level of interest in biotechnology has made this area one of the most active application domains for text mining. In fact a recent paper in Nature coins the term ‘conceptual biology’ for the science of text mining in biology and describes its value in fueling progress in bioinformatics[35]. Most of the text mining research in this domain has been done in the context of MEDLINE. MEDLINE records consist of a title, an abstract, a set of manually assigned metadata terms (known as MeSH terms), and several other fields. The huge and growing size of MEDLINE makes it almost impossible for someone to keep abreast of all the literature in their domain. Also, given the inter-disciplinary nature of research, apart from


13 one’s own field one also needs to keep track of related fields. This further underlines the challenge in biomedical research. Therefore, tools that filter through the literature and help in discovering new relationships and suggesting hypothesis are highly valued. Various text mining approaches such as co-occurrence based mining[5], IR-based metadata profiling[47], speculative sentence annotation[30] have been proposed almost exclusively for MEDLINE data. Some of these approaches have been previously discussed in this paper. A particular sub-problem in bioinformatics that has received a fair amount of attention from text mining researchers is gene/protein analysis. This is partly due to the large amount of literature on genes and proteins and partly due to the high level of interest in genomic research. Automatic annotation (identification in text) of gene and protein names[65] is an important part of this research. A key motivation is that once these entities are annotated in texts, it will become easier for scientists to connect the information available in MEDLINE with those in allied databases such as LocusLink[17], OMIM[18] and SwissProt[38]. What makes this task challenging is the inherent ambiguity associated with gene/protein nomenclature. Dealing with synonymy and homonymy with respect to gene and protein names is part of this challenge. Strictly speaking the gene/protein annotation problem is an example of information extraction. However, we intentionally refer to this research as it is a fundamental problem that seriously impacts higher level text mining capabilities in this domain. Also, besides genes and proteins, researchers also wish to extract several other entities including organs, cells, methods and more broadly, biological pathways. Annotation methods used are quite varied. In [67, 22] machine learn-


14 ing approaches have been used to disambiguate gene and protein names and assign appropriate class labels to them. Hidden Markov Model based approaches[32, 34] have also been used for the same problem.

Operat-

ing on top of such annotation efforts we observe mining of gene and protein functions[3], functional relationships between genes[45, 26, 54], proteinprotein interactions[33, 6], and interactions between genes and gene products[43]. All of these define active areas of research. There have also been efforts in visualization for this sub-domain. PubGene (mentioned before) and Gennav[7] are examples in this regard.

4.2

Business Intelligence

One of the major concerns of any business is to minimize the amount of guesswork involved in decision making and thereby reduce risk. Most data mining techniques, such as association rule mining and data warehousing, were originally created to help remove the uncertainty or alleviate it, so that decision making could be more sound. The problem with data mining is that it can help only upto a certain point, since the majority of data available with a company (reports, memos, emails, planning documents, etc.) is in the form of text. Since text is not structured enough for data mining techniques to apply, text mining holds plenty of promise. For example, text mining techniques, built by combining methods for feature selection, clustering and summarization, allow business professionals to extract important words/patterns from documents, group related documents together, read only summaries and drill down to the full documents as necessary, thereby saving precious time and energy. Data mining and text mining techniques can also complement each other. For example, data mining techniques may be used to reveal the oc-


15 currence of a particular event while text mining techniques may be used to look for an explanation of the event. Text mining can also be used to identify implicit connections, wherein lies its, for the most part untapped, potential value for businesses. Research in the application of text mining techniques to this area is encouraging. In [4] Bernstein et al. analyze co-occurrence based association rules that relate different companies. Their analysis is done on over 22,000 business news stories. Initially they use an information extraction software (ClearForest[11]), to extract the set of company names from the text. They then use disambiguation techniques on this set to identify all the unique company names. For example, H.P. and Hewlett Packard are names for the same company. A graph structure is used to visualize the model they generate. Each node in the graph represents a company and an edge represents a cooccurrence based association between two companies. To eliminate random associations they link two companies only if the strength of their association is above a minimum support threshold. From this graph they were able to identify hubs, which represent dominant companies in different industries. They also used the vector space model (from IR) to represent companies as weighted link vectors. They use the cosine similarity score between a company vector and the average industry vector as an estimate of the relatedness of a company to its industry. Additionally, the similarity between different average industry vectors gives a measure of how closely related the industries are to each other. For example, they found that the computer software industry and the computer hardware industry were fairly closely related. Although this research did not reveal any new knowledge, as acknowledged by the authors, we can use these techniques to explore relationships other than


16 co-occurrence, such as between sales, customers, etc, which can hopefully lead to interesting conclusions. In [27] Gerdes describes EDGAR-Analyzer, a text mining tool that analyzes the free-text portion of records in the EDGAR database, maintained by the Securities and Exchange Commission (SEC). EDGAR consists of financial and operational disclosures of public companies, and contains over 650 GB of data. This tool allows the user to specify subject areas of interest, which it then uses to extract relevant concepts, from the text, backed up by the actual text passages that contain those concepts. This kind of analysis can help in monitoring key company characteristics, which may then be used by investors for making investment decisions. Of interest to us is a case study in the paper wherein Gerdes uses his methods to explore, via company filings, the different extents to which companies were prepared for the Y2K problem at the end of the last century.

4.3

Terrorism

The use of text mining techniques in helping counter-terrorism efforts is a very recent effort. Government agencies are investing considerable resources in the surveillance of all kinds of communication, such as email. Since time is critical and given the scale of the problem, it is infeasible to monitor email manually. Thus automatic text mining techniques offer considerable promise in this area. In [61] Swanson et al. apply text mining techniques to existing medical literature to identify those viruses, which can potentially be used as biological weapons, and where such capability is not yet recognized. They essentially partition the literature on viruses, in MEDLINE, into two parts. The first


17 part consists of documents that talk about the genetic aspects of virulence, and the second part consists of documents that talk about the transmission of viral diseases. A virus that can be used as a biological weapon would have both these properties. They then created a list of common virus terms between these two sets. They found that most of the viruses that had already been identified as potential biological weapons were present in this list. They hypothesize that since the other viruses on the list share important properties with the known biological agents and so they are potential biological weapons. There is far more emphasis on the prevention of terrorism now than there was in the past. It is imperative for security agencies to be able to analyze large amounts of text quickly and accurately, and also understand the implicit connections between various sources of information. An example of a practically used text mining system is the COPLINK system[10]. Developed at the University of Arizona in Tucson and currently being used by local police there and in several states, this system can extract concepts from multiple text databases, maintained by different agencies, and discover hidden links between them. We feel that the drive to solve problems in this area will provide a key motivation for the development of future text mining techniques.

5

Web Mining

There is no source of information today that can compare to the world wide web in terms of sheer size and diversity. Consisting of billions of web pages created by millions of people, the web is a mammoth repository of both explicit and implicit knowledge. Although the web consists of different kinds media such as text, sound,


18 video, etc., an overwhelming majority of the web is in the form of text. This offers a tremendous incentive for the development of text mining algorithms that can work on such a gargantuan scale. The web is one of the greatest expressions of democracy on planet earth. Anyone is free to post whatever they want and there is no group of people who control the content of the web. This brings in newer kinds of challenges in the design of mining algorithms, such as the verification of extracted facts, as well as the reliability of any discovered novel information. Given its ‘one-web-for-all’ nature, the likelihood of false positive relations being identified is greater than when mining a specialized corpus like MEDLINE. Given the absence of any sense of control on the web, text mining, to a great degree, will depend on filtering techniques[66] that can eliminate low quality information. The web is a mix of semi-structured and free-text data. Moreover metadata is seldom seen, or used inconsistently. Natural Language Processing techniques that allow parsing of text data and that attach semantics to words, by named entity tagging, thus have an important role in web text mining. Despite these challenges there is tremendous interest in using mining techniques on the web. An excellent, although a little outdated, survey on this subject has been written by Chakrabarti[9]. In this survey, he explains some of the problems and challenges associated with mining data on the web. He gives an overview of supervised, semi-supervised and unsupervised learning techniques that can be applied on the semi-structured and free-text web data, available on the web. He also briefly describes an analysis of the web as a social network. In [31] Liu et al. describe techniques that mine the web for definitions and related sub-topics, for a given topic of interest. Literature-


19 based discovery techniques, the ‘open’ discovery model pioneered by Swanson, discussed earlier, have been shown to hold promise for establishing novel hypothesis that are based on hidden connections in the web[20]. Database techniques are also being studied in the context of web mining. These techniques primarily attempt to organize the semi-structured data available on the web into more structured collections. One can then use a querying mechanism to analyze the data. A survey in this area can be found in [16]. Since the web can also be viewed as an extremely large graph, graph theoretic methods have become quite popular for analyzing the structure of the web. Individual web pages represent the nodes of the graph and hyperlinks between web pages are the edges. This link structure of the web has been used in many interesting ways to improve web searching[8]. But for text mining, the following question may be asked: Can this link structure along with evidence from the nodes be used for discovering new information? For example, links alone have been used to automatically identify communities on the web[19]. What is the effect of adding content-based criteria to such algorithms? The web contains a vast amount of information that is still untapped. In our opinion this offers a significant incentive for the development of efficient text mining algorithms. Existing techniques are neither easily transferable nor are they efficient for web-based mining problems. Research with the goal of mining for ‘novel’ information from the web is still at an early stage. We believe that this area of will need aggressive research involving strategies from information retrieval, machine learning, and natural language processing. Discovery of novel information from the web is an extremely challenging task but as is the case will all challenging tasks, the potential rewards are


20 great. Hence we are currently outlining specific text mining problems and strategies for the web.

6

Conclusion

In this paper we have presented an overview of text mining and surveyed some of the techniques used to discover knowledge from text databases. We have described some of the issues concerning the application of text mining techniques to specialized corpus’s and also to the web. Despite the absence of a commonly agreed definition for text mining, we observe this to be highly active research area involving diverse methods and seeking different kinds of novel information. Most of the research emphasis appears to be in bioinformatics. However, we observe increasing attention given to the more general web content. Development of efficient text mining tools is critical if current and future information needs are to be satisfied. The vastness and diversity of text data available as well as its semi-unstructured/unstructured nature make research in this field both challenging as well as exciting.

References [1] Rakesh Agrawal, Tomasz Imielinski, and Arun N. Swami. Mining association rules between sets of items in large databases. In Peter Buneman and Sushil Jajodia, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216, Washington, D.C., May 1993. [2] Rakesh Agrawal and Ramakrishnan Srikant. Fast algorithms for mining association rules. In Jorge B. Bocca, Matthias Jarke, and Carlo Zaniolo, editors, Proceedings of the 20th Int. Conf. Very Large Data Bases, VLDB, pages 487–499. Morgan Kaufmann, 1994.


21 [3] Miguel A. Andrade and Alfonso Valencia. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, 14(7):600–607, 1998. [4] A. Bernstein, S. Clearwater, S. Hill, C. Perlich, and F. Provost. Discovering knowledge from relational data extracted from business news. In Saˇso Dˇzeroski, Luc De Raedt, and Stefan Wrobel, editors, MRDM02, pages 7–20. University of Alberta, Edmonton, Canada, July 2002. [5] Catherine Blake and Wanda Pratt. Better rules, few features: A semantic approach to selecting features from text. In ICDM, pages 59–66, 2001. [6] Christian Blaschke, Miguel A. Andrade, Christos Ouzounis, and Alfonso Valencia. Automatic extraction of biological information from scientific text: Protein-protein interactions. In Proceedings of the International Conference on Intelligent Systems for Molecular Biology, Heidelberg, pages 60–67, 1999. [7] Olivier Bodenreider. Gennav, May 2003. (Online) Available: http: //etbsun2.nlm.nih.gov:8000/perl/gennav.pl. [8] Sergey Brin and Lawrence Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7):107–117, 1998. [9] Soumen Chakrabarti. Data mining for hypertext: A tutorial survey. SIGKDD: SIGKDD Explorations: Newsletter of the Special Interest Group (SIG) on Knowledge Discovery & Data Mining, ACM, 1, 2000. [10] Hsinchun Chen. COPLINK, April 2004. (Online) Available: http: //www.coplink.net/. [11] ClearForest Corp. ClearForest :: From Information To Action, April 2004. (Online) Available: http://www.clearforest.com. [12] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W. Furnas, and Richard A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990. [13] Jochen D¨orre, Peter Gerstl, and Roland Seiffert. Text Mining: Finding Nuggets in Mountains of Textual Data. In Knowledge Discovery and Data Mining, pages 398–401, 1999. [14] R. Feldman and H. Hirsh. Mining Text Using Keyword Distributions. Intelligent Information Systems, 10(3):281–300, 1998.


22 [15] Ronen Feldman and Ido Dagan. Knowledge discovery in textual databases (KDT). In Knowledge Discovery and Data Mining, pages 112–117, 1995. [16] Daniela Florescu, Alon Y. Levy, and Alberto O. Mendelzon. Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3):59–74, 1998. [17] National Center for Biology Information. Locuslink introduction, May 2003. (Online) Available: http://www.ncbi.nlm.nih.gov/ LocusLink/. [18] National Center for Biology Information. Online mendelian inheritance in man, May 2003. (Online) Available: http://www.ncbi.nlm.nih. gov/omim/. [19] David Gibson, Jon M. Kleinberg, and Prabhakar Raghavan. Inferring web communities from link topology. In UK Conference on Hypertext, pages 225–234, 1998. [20] Michael Gordon, Robert K. Lindsay, and Weiguo Fan. Literature-based discovery on the world wide web. ACM Transactions on Internet Technology (TOIT), 2(4):261–275, 2002. [21] Ralph Grishman. Information extraction: Techniques and challenges. In SCIE, pages 10–27, 1997. [22] Vasileios Hatzivassiloglou, Pablo A. Dubou´e, and Andrey Rzhetsky. Disambiguating proteins, genes, and rna in text: a machine learning approach. Bioinformatics, 17(1):S97–S106, 2001. [23] M. Hearst. Untangling text data mining. In Proceedings of ACL’99: the 37th Annual Meeting of the Association for Computational Linguistics., 1999. [24] John D. Holt and Soon M. Chung. Efficient Mining of Association Rules in Text Databases). In Proceedings of the eighth international conference on information and knowledge management, pages 234–242, 1999. [25] T.K. Jenssen, A. Lgreid, J. Komorowski, and E. Hovig. A literature network of human genes for high-throughput analysis of gene expression, May 2001. (Online) Available: http://www.pubgene.org/. ¨ [26] Tor-Kristian Jenssen, Lisa M.J. Oberg, Magnus L. Andersson, and Jan Komorowski. Methods for Large-Scale Mining of Networks of Human Genes. In Proceedings of The First SIAM Conference on Datamining, Chicago, April 2001.


23 [27] John Gerdes Jr. Edgar-analyzer: automating the analyses of corporate data contained in the sec’s edgar database. Decision Support Systems, 35(1):7–29, 2003. [28] Jan H. Kroeze, Machdel C. Matthee, and Theo J. D. Bothma. Differentiating data- and text-mining terminology. In Proceedings of the 2003 annual research conference of the South African institute of computer scientists and information technologists on Enablement through technology, pages 93–101, 2003. [29] David Landau, Ronen Feldman, Yonatan Aumann, Moshe Fresko, Yehuda Lindell, Orly Liphstat, and Oren Zamir. Textvis: An integrated visual environment for text mining. In Principles of Data Mining and Knowledge Discovery, pages 56–64, 1998. [30] M. Light, X.Y. Qiu, and P. Srinivasan. The language of bioscience: Facts, speculations, and statements in between. In HLT Biolink (To appear), May 2004. [31] Bing Liu, Chee Wee Chin, and Hwee Tou Ng. Mining topic-specific concepts and definitions on the web. In Proceedings of the twelfth international World Wide Web conference (WWW-2003), Budapest, 2003. [32] W.H. Majoros, G.M. Subramanian, and M.D. Yandell. Identification of key concepts in biomedical literature using a modified Markov heuristic. Bioinformatics, 19(3):402–407, 2003. [33] E.M. Marcotte, I. Xenarios, and D. Eisenberg. Mining literature for protein-protein interactions. Bioinformatics, 17(4):359–363, April 2001. [34] Alex Morgan, Lynette Hirschman, Alexander Yeh, and Marc Colosimo. Gene name extraction using FlyBase resources. In Sophia Ananiadou and Jun’ichi Tsujii, editors, Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 1–8, 2003. [35] Blagosklonny M.V. and Pardee A.B. Conceptual biology: unearthing the gems. Nature, 416(6879):373–374, June 2002. [36] Un Yong Nahm and Raymond J. Mooney. Text mining with information extraction. [37] J. Neto, A. Santos, C. Kaestner, and A. Freitas. Document clustering and text summarization. In Proceedings, 4th International Conference on Practical Applications of Knowledge Discovery and Data Mining (PADD-2000), London, pages 41–55, 2000. [38] Swiss Institute of Bioinformatics (SIB). Expasy - swiss-prot and trembl, April 2004. (Online) Available: http://us.expasy.org/sprot/.


24 [39] National Library of Medicine. Entrez-pubmed, May 2003. (Online) Available: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi? db=PubMed. [40] National Institute of Standards and Technology. Text REtrieval Conference (TREC) Home Page, April 2004. (Online) Available: http: //www.trec.nist.gov. [41] Jong Soo Park, Ming-Syan Chen, and Philip S. Yu. Using a hashbased method with transaction trimming for mining association rules. Knowledge and Data Engineering, 9(5):813–825, 1997. [42] A.K. Sehgal, X.Y. Qiu, and P. Srinivasan. Mining medline metadata to explore genes and their connections. In Proceedings of the SIGIR 2003 Workshop on Text Analysis and Search for Bioinformatics, July 2003. [43] T. Sekimizu, H.S. Park, and J. Tsujii. Identifying the interactions between genes and gene products based on frequently seen verbs in medline abstracts. In S. Miyano and T. Takagi, editors, Genome Informatics (GIW’ 98), pages 62–71. Universal Academy Press, Tokyo, Japan, 1998. [44] Kristie Seymore, Andrew McCallum, and Roni Rosenfeld. Learning hidden Markov model structure for information extraction. In AAAI 99 Workshop on Machine Learning for Information Extraction, 1999. [45] H. Shatkay, S. Edwards, W. J. Wilbur, and M. Boguski. Genes, themes and microarrays. using information retrieval for large-scale gene analysis. In Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pages 317–328, La Jolla, California, USA, August 2000. [46] Stephen Soderland. Learning to extract text-based information from the world wide web. In Knowledge Discovery and Data Mining, pages 251–254, 1997. [47] P. Srinivasan. Meshmap: A text mining tool for medline. In Proceedings of the American Medical Informatics Annual Symposium, pages 642– 646, 2001. [48] P. Srinivasan. Text mining: Generating hypotheses from medline. Journal of the American Society for Information Science And Technology, 55(5):396–413, 2004. [49] P. Srinivasan and B. Libbus. Mining MEDLINE for Implicit Links between Dietary Substances and Diseases. Bioinformatics, To appear, 2004.


25 [50] P. Srinivasan, B. Libbus, and A.K. Sehgal. Mining medline: Postulating a beneficial role for curcumin longa in retinal diseases. In HLT Biolink (To appear), May 2004. [51] P. Srinivasan and A.K. Sehgal. Mining medline for similar genes and similar drugs, 2003. Technical Report: Department of Computer Science, The University of Iowa. [52] P. Srinivasan and M. Wedemeyer. Mining concept profiles with the vector model or where on earth are diseases being studied? In Proceedings of the Text Mining Workshop. Third SIAM International Conference on Data Mining, San Francisco, May 2003. [53] B. Stapley and G. Benoit. Biobibliometrics: Information retrieval and visualization from co-occurrences of gene names in medline abstracts. In Proceedings of the Pacific Symposium on Biocomputing, pages 529–540, 2000. [54] M. Stephens, M. Palakal, S. Mukhopadhyay, R. Raje, and J. Mostafa. Detecting gene relations from medline abstracts. In Proceedings of the Sixth Annual Pacific Symposium on Biocomputing (PSB 001)., 2001. [55] Don R. Swanson. Fish oil, raynaud’s syndrome, and undiscovered public knowledge. Perspectives in Biology and Medicine, 30:7–18, 1986. [56] Don R. Swanson. Migraine and magnesium: Eleven neglected connections. Perspectives in Biology and Medicine, 31:526–557, 1988. [57] Don R. Swanson and N. R. Smalheiser. Indomethacin and alzheimer’s disease. Neurology, 46:583, 1996. [58] Don R. Swanson and N. R. Smalheiser. Linking estrogen to alzheimer’s disease: An informatics approach. Neurology, 47:809–810, 1996. [59] Don R. Swanson and N. R. Smalheiser. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence, 91:183–203, 1997. [60] Don R. Swanson and N. R. Smalheiser. Calcuim-independent phospholipase a2 and schizophrenia. Archives of General Psychiatry, 55(8):752– 753, 1998. [61] Don R. Swanson, Neil R. Smalheiser, and A. Bookstein. Information discovery from complementary literatures: Categorizing viruses as potential weapons. Journal of the American Society for Information Science And Technology, 52(10):797–812, 2001.


26 [62] A. Tan. Text mining: The state of the art and the challenges. In Proceedings of the Pacific Asia Conf on Knowledge Discovery and Data Mining PAKDD’99 workshop on Knowledge Discovery from Advanced Databases., pages 65–70, 1999. [63] C. J. Van Rijsbergen. Information Retrieval, 2nd edition. Dept. of Computer Science, University of Glasgow, 1979. [64] Marc Weeber, Henny Klein, Lolkje T.W. de Jong-van den Berg, and Rein Vos. Using Concepts in Literature-Based Discovery: Simulating Swanson’s Raynaud-Fish Oil and Migraine-Magnesium Discoveries. Journal of the American Society for Information Science And Technology, 52(7):548–557, 2001. [65] Marc Weeber, Bob J.A. Schijvenaars, Erik M. van Mulligen, Barend Mons, Rob Jelier, Christiaan van der Eijk, and Jan A. Kors. Ambiguity of human gene symbols in locuslink and medline: Creating an inventory and a disambiguation test collection. In Proceedings of AMIA Symposium, pages 704–708, 2003. [66] Lan Yi, Bing Liu, and Xiaoli Li. Eliminating noisy information in web pages for data mining. In Knowledge Discovery and Data Mining, pages 296–305, 2003. [67] Hong Yu and Eugene Agichtein. Extracting synonymous gene and protein terms from biological literature. Bioinformatics, pages 340–349, 2003.


Text Mining