Pasquale J. Festa INF384C â€“ Organizing and Providing Access to Information Prof. Efron 4.5.2007 Automatic Classification Assignment 1. Rule: Classify all articles with 4 or more Authors as medical. When looking at the histogram it is clear to see that medical and non-medical articles start to mix in the 4 to 5 authors range. While my rule will allow a number of non-medical articles to creep into the medical class, this is necessary due to the fact that one can not have fractions of an author (despite the fact that the histogram makes it seem as if you can, though this is due to the fact that the numbers we see are  made up and  most likely averages). When approximating where documents lie in the histogram, it appears that a small proportion of articles that are medical lie just below the 5 author mark. If I were to make 5 my cut off point, I would then be misclassifying medical documents as non-medical documents. By making my cut off point 4, I am, instead, misclassifying a number of non-medical documents as medical documents. As my job is to index medical documents, I feel that the lesser of the two evils would be to misclassify non-medical documents as medical documents as having nonapplicable information would do less harm to my company than missing much needed medical information may. 2.
Medical documents: Classified correctly 100% (100 out of 100) Non-Medical documents: Classified correctly 83% (83 out of 100) Overall Percentage of correct classification: 91.5% (183 out of 200)
If we assume that the distribution of data we have is an unbiased estimator of unforeseen
data then we well be assuming that every set of 200 new documents will follow this same pattern. However, as we see there is a section where documents overlap (the 4 to 6 author range), our model runs the possibility of adding too many non-medical documents to the medical class. In the next group of 200, 17 more would be misclassified and now we would have 34 non-medical documents in our medical class. With 34 of the 234 documents in the medical class being (in actuality) non-medical documents we start to find ourselves running into trouble. Despite our model being completely accurate in terms of classifying medical documents, we find it adding unnecessary information to our document set. Our new class would end up containing 34 non-
medical documents which would make up 14.5% of the total number of documents put in the medical class.
I positioned my line such that the entire group of non-medical articles would be completely separated from the entire group of medical articles. By bisecting the data this way, I created a firm boundary between the two classes so that no one document from one group falls into the other groupâ€™s territory. By classifying the data this way we have 100% accuracy in terms of classification with our sample set. 5. Percentage of data correctly classified: 100% 6.
In this particular case adding more dimensions to the data set made our classification
problem easier to solve and allowed us a greater degree of accuracy. However, this is not always the case. First and foremost, when adding a dimension it is imperative that the dimension be pertinent to classification. If the dimension one adds does not allow us any way of significantly differentiating between the two classes it will do us no good. For example, if 50% of 100 men have blue eyes and 50% have brown eyes and 50% of 100 women have blue eyes and 50% have
brown eyes, the dimension of eye color would not do much to help us differentiate between men and women as the two would completely overlap and there would be no substantial split. In addition, while adding more dimensions is helpful in creating classification models it does create the added burden of data collection. As stated in class, for each column we create in a data table we will need to at least double or triple the number of rows we have. To build a sample size that would give us a decent level of accuracy in regards to out statistical analysis, it would be necessary to collect a vast amount of data. Sometimes, however, adding this burden of data collection can kill any attempts of creating a model before it begins as the data amount could be too large for efficient computation or the needed quantity of data may just not be able to be found.