http://www.tracingisdrawing.com/Docs/004 by passy hearst

Rosangela Briscese Pasquale J. Festa Jennifer Noble Sarah Ticer INF384C â&#x20AC;&#x201C; Organizing and Providing Access to Information Prof. Efron 3.29.2007 Crossword Puzzle Classification Final Model Method and Motivation As repeatability among different categories of users was of integral importance in creating the classification model for crossword puzzles, our group was motivated to create a classification scheme that had as little ambiguity in terms of goals, directions and user knowledge about it as possible. To form our classification system we started by collecting as much data about the sample puzzles given to us as possible. After a number of puzzle features were examined (all features considered may be found at the end of this essay), features that stood out as having statistically significant variances between groups of days (ex. Monday, Tuesday, Wednesday, and Thursday or Friday and Saturday) or from one day to the next (i.e. Monday or Tuesday) were given precedence as possible suspects to be examined as prime classification features (all features considered important to the design of our model are highlighted in the included data set). After witling down the number of features we were going to examine, we then chose to create a model that would act as a filtering system for classifying a puzzle as a specific day by running a puzzle through a series of simple tests that could be performed by the utilization of basic arithmetic and adherence to clear and simple directions. In presenting our model we chose to utilize a flowchart format to appeal to both verbal and visual learners.

Key Puzzle Features Total Number of Black Squares The number of black squares in a puzzle grid was found to play a significant role in classification and was utilized to divide the week up into two sectors: "Monday, Tuesday, Wednesday or Thursday" and "Friday or Saturday". This feature was chosen as our first criteria for puzzle classification as it was found that puzzles with 34 or less black squares tended to be "Friday or Saturday" puzzles while puzzles with more than 34 black squares were, for the majority, "Monday, Tuesday, Wednesday, or Thursday" puzzles. By opting to split the entire group into two smaller groups it was now easier for us to start working towards finding a means of definitively classifying puzzles as specific days of the week. Criteria 1:

If total # of black squares â&#x2030;¤ 34, predict "Friday or Saturday" If total # of black squares > 34, predict "Monday, Tuesday, Wednesday or Thursday"

Total Number of Answers longer than 5 Letters in Length As our first test split the week into sets of 4 ("Monday, Tuesday, Wednesday or Thursday") and 2 ("Friday or Saturday") we chose to focus on a feature that would separate "Friday or Saturday" puzzles from one another so that two definitive classes would be filed away. In analyzing our data it was found that there was a significant split between "Friday" and "Saturday" puzzles in regards to the total number of puzzle

answers that were greater than 5 letters in length. In our sample set "Saturday" puzzles, for the most part, were found to contain more than or equal to 35 answers longer than 5 letters in length. The majority of "Friday" puzzles had less than 35 answers that were longer than 5 letters in length. We chose to classify "Friday" and "Saturday" by this quantitative criteria Criteria 2a:

If total # of answers longer than 5 letters in length < 35, predict "Friday" If total # of answers longer than 5 letters in length ≥35, predict "Saturday"

Total Number of Clues greater than 5 Words in Length With "Friday" and "Saturday" out of the picture, we then needed to divide up the larger portion of the week ("Monday, Tuesday, Wednesday or Thursday"). Rather than taking a direct approach as we did with "Friday" and "Saturday" we opted to subdivide this chunk into smaller pieces that would be easier to manage (i.e. rather than trying to go for a direct day classification when we had 4 days to classify rather than only 2, it seemed more plausible to split the group into two subgroups that would later be split again into exclusive classes). When looking at our data it was noticeable that the number of clues longer than 5 words in length marked a significant difference between puzzles from "Monday or Tuesday" and puzzles from "Wednesday or Thursday". We opted to say that puzzles with 6 or more of such clues were to be classified as "Wednesday or Thursday" and puzzles with less than 6 were to be classified as "Monday or Tuesday". Criteria 2b:

If total # of clues greater than 5 words in length < 6, predict "Monday or Tuesday" If total # of clues greater than 5 words in length ≥ 6, predict "Wednesday or Thursday"

Total Number of Answers 4 Letters in Length With our group of 4 days now split into 2 groups of 2 days it became a matter of splitting each 2 day group into distinct day of the week classes. In examining our puzzles that were now classified as "Wednesday or Thursday" a noticeable difference was found in regards to the number of answers that were 4 letters in length. Puzzles that had 27 or more of such answers tended to be "Wednesday" puzzles. Puzzles with less than 27 4 letter answers tended to be "Thursday" puzzles. We decided this was a key feature in classifying these 2 days and opted to use this criteria as a means of differentiating "Wednesday" and "Thursday". Criteria 3a: |

If total # of answers 4 letters in length < 27, predict "Thursday" If total # of answers 4 letters in length ≥ 27, predict "Wednesday"

Total Number of Answers containing Multiple Words To complete our model and classify our last 2 days we looked for a key feature that would make a distinction between puzzles that were "Monday" and puzzles that were "Tuesday". Upon looking at the data it was noticed that the number of answers that were made up of multiple words was a distinctive feature. We opted to split this group via this criteria. In our sample set puzzles that were "Tuesday" tended to have 8 or

more multiple word answers. Puzzles that were classified as "Monday" had less than 8 of such answers. With this final action we now had a complete model that set directions for classifying a puzzle for each day of the week. Criteria 3b:

If total number of answers containing multiple words < 8, predict "Monday" If total number of answers containing multiple words â&#x2030;Ľ 8, predict "Tuesday"

Subjective versus Objective Features In our model we opted to do away with subjective features of the puzzle. By doing so we accomplished a number of things. First, by taking a quantitative approach to classifying the puzzles we removed any chance of their being classification discrepancies based upon a classifier's level of knowledge in regards to facts. As one of the goals of the model was to create a system that would allow virtually anyone to classify puzzles correctly, our model did away with the subjective aspect of puzzles so that an individual's ability to solve a puzzle would have little to do with his or her ability to classify a puzzle. As puzzle answers are based upon a number of facets of knowledge (i.e. film, literature, history, popular culture, etc.) our model was designed in such a manner as to remove any variance that may arise due to a classifier's ignorance in regards to subject matter. In addition, by keeping our model quantitative and objective in nature it was possible to create a system that had clear, definitive directions that could be carried out through relatively simple procedures. To utilize the model all a person would have to do is be able to understand written directions in plain English and perform relatively simple counting and mathematical tasks. If verbal communication of the model's structure proved difficult to comprehend for a particular classifier, the visual manner in which the model is expressed (i.e. the flowchart) would aid in the process of explaining the workings and way in which to use the classification system. The procedures themselves require only basic knowledge of counting and simple arithmetic. By doing away with all subjective facets of the puzzle we not only did away with the possibility of subjective classification that could produce a greater degree of error, but also simplified the system so that a variety of individuals could successfully utilize it for classifying puzzles and therefore built in a level of repeatability. Accuracy After completing the model the sample set of puzzles was run through to check for the degree of accuracy that would be achieved via our method. In adhering to our guidelines and following procedure it was found that only 4 of 24 puzzles were misclassified. This gave us 83% accuracy in regards to the sample set. With such a small sample to work with we concluded that this was an acceptable level of accuracy. With regards to how close our incorrect puzzles came to being classified properly, one "Thursday" puzzle was classified as "Friday", one "Saturday" puzzle was classified as "Thursday", one "Tuesday" puzzle was classified as "Wednesday" and one "Wednesday" puzzle was classified as "Thursday". With the exception of the "Saturday" puzzle that was classified as "Thursday", all other erroneous classifications were only off by one degree (i.e. one day). With that said, despite error, the classification model illustrated that it still, for the most part, placed puzzles almost perfectly. With 75% of our errors being incorrect by the degree of just one day we found our model to be satisfactory.