Data Mining PhD Dissertation Sample

Page 1

Data.PhDresearchon.com

SAMPLE

PHD DISSERTATION ON DATA MINING Banks daily record large amounts of data. For each client, account information, transactions per account, credit commitments, and demographic data are kept. These data are recorded in the transactional databases that are necessary for the current business. Transaction databases generally perform three general functions: keeping records of business events, generating documents required in business and reporting on the state of the business process (Parker & Case, 1998). It is best learned from experience, so fifteen years ago scientists first began, and later managers realized that transaction databases were a rich source of knowledge that could improve the company's business. It became clear that companies had a lot of data, and little information and very little knowledge about many aspects of their business. However, transactional databases are huge. Let's imagine that the bank's management wants to determine the characteristics of clients that were illiquid in the past. Such information can be requested from informatics in a company that, in addition to their regular work, should spend a lot of time to make the requested report. When the report is on the manager's table, it may be too late to make a decision. A method that can increase the success of using transactional databases to improve enterprise performance is called data mining. There is a lot of definition of data mining, and we will highlight the following: Data mining is the search for valuable information in large amounts of data. Data mining is the research and analysis of large amounts of data using automatic or semi-automatic methods to detect meaningful regularities. Until several years ago, the method developed primarily in scientific circles. It has only recently emerged in companies, when it became clear that the use of data mining is inevitable for the acquisition of a comparative advantage of the enterprise. It is important to point out that data mining is more art than science. It does not exist a recipe for the successful mining of data that will surely result in finding valuable information. However, the likelihood of success will be increased if the steps of the data mining process follow (Baragoin et al., 2001). In the first step, a business problem is defined. The second step is to prepare data that includes determining the required data, transforming and sampling, and evaluating data. Modeling is the third step, involving the selection of the mining method and the design and evaluation of the model. The fourth step relates to implementation that involves the interpretation and use of results. The data mining process is iterative, which means that at any time it is possible to return to one of the previous steps. For example, in the process of selecting a mining technique, we can figure out that we did not select the required data well, so we can go back to the second step and start all over again. This "jump back" will be more a rule than an exception, because in the mining of data it is most important to define the problem well and to choose well and prepare data, which is difficult to "do the first" properly. On the other hand, during the data mining process, our knowledge of the business problem and data is


Data.PhDresearchon.com

PHD DISSERTATION ON DATA MINING increasing, and such a "revised" definition of a business problem is often better than the original one. Below we will detail the steps of data mining. Data mining is a new methodology and often there is an attitude in companies that there are valuable data in the databases to be discovered, which does not define which data would be for any purpose. If we return to the digestion analogies mentioned at the beginning of the work, we can compare this attempt with the search for gold at any place. The probability of finding gold is much greater if we use a modern tool. However, we need to know where to dig. The first step in the process of data mining is the definition of a business problem and the expression of this problem in the form of questions that can be answered at the end of the process. The best approach to defining a business problem is to analyze areas where data mining has been successfully used. Once we are well acquainted with the successful data mining application, we can choose the most critical area for our enterprise. For example, the business goal may be to increase life insurance sales. It is possible to make a model for predicting which bank customers will buy life insurance. The Bank's marketing efforts can only focus on those customers, reducing costs, but also increasing the effectiveness of a marketing campaign. In the next chapter, we will describe in detail the typical application of mining information in banking. This step determines which people will take part in the mining project. It will typically be a data mining specialist, an informant who knows well the bank databases and warehouses, and a wellknown banker with potential application. It is important that a key management team is headed by a team that does not have to work directly on the project, but it should support it and help solve any difficulties. There is still resistance to the use of new technologies in some banks, although the situation has recently improved, as this research has shown. The role of a person in management is precisely to help overcome these resistance. Data preparation involves determining the required data, transforming and sampling, and evaluating data. This phase is time-consuming, and covers 60-90% of the time required for data mining (Pyle, 1999). Mining data can be stored in different forms, the most common relational databases or data warehouses. Operational systems such as POS, ATMs, telephone conversations, web servers and the like can be used. Also, data collected by market research or external data sources may be used, for example, registries of commitments on loans, which contain information on the credit burden of citizens. A Data Mining Specialist, an IT specialist, and a banking expert together determine which data will be needed for model design. The second step in data mining is determining the data that will be used to create the model. Data typically used for data mining is stored in the form of a transaction database and a client base. The transaction database records the data for each transaction, the typical content of which is as follows: client code, account number, type, amount and date of transaction. The client database typically contains the client's code, household code, account number, customer's name, address, telephone, demographic data, products and services, past offers and segmentation. In this step it also determines which variables will be excluded from the analysis, and which will be not. The variable is targeted or dependent. For example, if it is a credit risk analysis, the target variable will be the one that describes whether the client has returned the loan or not. The final result of determining the required data is a list of variables that will be used in model design. In the transformation of data, variables from the available databases are transformed into a form suitable for data mining. The data must be in a tabular form, where the columns should contain variables (


Data.PhDresearchon.com

PHD DISSERTATION ON DATA MINING characteristics), and in the words of the observation. Each line must describe information relevant to the company (e.g. customer, product). Often, the data from the transaction database must be summarized to be useful, with a significant use of the total and average monthly transaction amount across all client accounts. Based on the available variables from the databases, the variables determined by a bank expert are calculated. Examples of such variables are the differences between the current balance of the current account and the approved minus and the time that elapsed between the opening day of the account and the day of the first transaction. There are large amounts of data in transaction databases and customer databases. There is no need for so much data for making a model, so data sampling is used to select the smaller amount of data needed for the model. Here is often the question: How much data is enough? There is no unambiguous answer because the number of required data depends on the algorithm. There are two or three thousand data to make decision trees, but training for neural networks needs much more. Sample data is chosen most often by random selection. It often happens that the share of events that is analyzed in the sample is very small. For example, if we want to make a model that will predict the probability that a customer will buy a product, we need a database with similar data from the past. In such a base, for example 100,000 clients, there may be only 4,000 customers who have purchased the product. Based on their characteristics, a model will be created. For modeling, we do not need 100,000 data, much less - for example, 10,000. However, if the data is randomly selected, the number of 4,000 customers who bought the product in the sample will be much smaller - around 40. It is therefore recommended that in such cases a sample of 10,000 takes out all 4,000 customers who bought the product and the other 6,000 clients are randomly selected. It has been shown that such an approach gives more reliable results (Scott et al., 1986). Since the pattern design template has been chosen, it is necessary to further divide it into two parts - part of the model data and part of the model testing data. This approach is typical for data mining, as it checks the efficiency of the model on data that was not used to make it.

REFERENCES Parker, C. & Case, T. (1998). Management Information Systems: Strategy and Action. New York: McGraw Hill. Baragoin, C., Andersen, C. M., Bayerl, S., Bent, G., Lee, J. & Schommer, C. (2001). Mining Your Own Business in Banking Using DB2 Intelligent Miner for Data. Available at: http://www.redbooks.ibm.com/ Pyle, D. (1999). Data Preparation for Data Mining. San Francisco: Morgan Kaufmann. Scott, A.J. & Wild, C. J. (1986). Fitting Logistic Models under Case-control or Choice Based Sampling. Journal of the Royal Statistical Society Series, B 48 (2), 170 - 82.


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.