Issuu on Google+


Approximate/Fuzzy String Matching using Mutation Probability Matrices

Linguistic

Computing issues in non-English languages are

We consider the approximate/fuzzy string

generally being addressed with less depth and breadth,

matching problem in Malayalam language

especially for languages which have small user base.

and propose a log-odds scoring matrix for

Malayalam, one such language, is one of the four major

score-based alignment. We report a pilot

Dravidian languages, with a rich literary tradition. The

study designed and conducted to collect a

native language of the South Indian state of Kerala and the

statistics about what we have termed as “accepted

mutation

probabilities”

of

characters in Malayalam, as they naturally occur. Based on the statistics, we show how a scoring matrix can be produced for

Lakshadweep Islands in the west coast of India, Malayalam is spoken by 4% of India‘s population. While Malayalam is integrated fairly well with computers, with a user base that may not generate huge market interest, such fine issues of

Malayalam which can be used effectively in

language computing for Malayalam remains unaddressed

numeric scoring for the approximate/fuzzy

and unattended.

string matching. Such a scoring matrix would enable search engines to widen the

If we were to search Google to look for information on the

search operation in Malayalam. Being a

senior author of this paper, Achuthsankar, and we gave the

unique and first attempt, we point out a

query as Achutsankar or Achudhsankar, in both cases

large number of areas on which further

Google would land us correctly in the official web page of

research and consequent improvement are

the author. This ―Did you mean‖ feature of Google is

required. We limit ourselves to a chosen

managed by the Google-diff-match-patch [4]. The match

set of consonant characters and the matrix

part of the algorithm uses a technique known as the

we report

approximate string matching or fuzzy pattern matching

is a prototype for

further

improvement.

[10]. The close/fuzzy match to any query that is received by he search engine is routine and obvious to the English language user. However, when a non-English language such as Malayalam is used to query Google, the same facility is

Authors:

Dr. Achuthsankar S Nair Hon. Director, Centre for Bioinformatics University of Kerala

not seen in action.

When the word for

Sajilal Divakaran FTMS School of Computing, Kuala Lumpur

the

number

(Pathinaayiram – Malayalam word ten

thousand)

is

used

as

a

query in Google Malayalam search, we are directed to docu ments that contain a similar word

(Payinaayiara

m a common mispronunciation of the original word ) but not the word

.

This is because approximate/fuzzy string matching has not been addressed in Malayalam. In this paper we make preliminary attempts toward addressing this very special issue of approximate/fuzzy string matching Malayalam Approximate/Fuzzy String Matching

The

field described as approximate or fuzzy string matching in computer science has been firmly established since

1980s. Patrick & Geoff [5] define approximate string matching problem as follows: Given a string s drawn from some set S of possible strings (the set of all strings composed of symbols drawn from some alphabet A), find a string t which approximately matches this string, where t is in a subset T of S. The task is either to find all those strings in T that are ―sufficiently like‖ s, or the N strings in T that are ―most like‖ s. One of the important requirements to analyze similarity is to have a scientifically derived measure of similarity. The soundex system of Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. CLEAR Sep.2012

1


Odell and Russell[13] is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. These have been used successfully in hospital databases

and

airline reservation systems [8]. Damerau-Leveshtein metric[2]

among a small group of school children

proposed

operations

(N=30). The observed mistakes (natural

(insertions, deletions, substitutions, or reversals) to change one

mutations) are tabulated in Table 2 as

string into another. This metric can be used with standard

probabilities. It is noted that the sample

optimization techniques[14] to derive the optimal score for each

size of N=30 is inadequate for a linguistic

string matching and thereby choose matches in the order of

study of this kind. However, as already

closeness. Approximate or fuzzy string matching is in vogue not

highlighted, this paper reports a pilot

only in natural languages but also in artificial languages. In fact

study

approximate string matching has been developed into a fine art in

concept. Moreover, the sample size can be

computational sciences, such as bioinformatics. Bioinformatics

made larger once the research community

deals

whets the approach put forward by us.

a

measure

mainly

-

with

the

smallest

bio

number

sequences

of

derived

from

to

demonstrate

proof

of

the

DNA, RNA, and Amino Acid Sequences[9]. Dynamic programming algorithm

(Needleman–Wunch

and

Smith–Waterman

algorithms)[11] which enable fast approximate string matching using carefully crafted scoring

matrices are in great use in

bioinformatics. The equivalent of Google for modern

biologist is

basic local alignment search tool (BLAST)[1], which uses scoring matrices such as point accepted mutation matrices (PAM)[3] and BLOcks of Amino Acid SUbstitution Matrix (BLOSUM)[6]. To the best of the knowledge of the authors, such a scoring system is not in existence for any natural language including English.

Log-odds Scoring Matrix It is possible to use Table 2 itself for

Recently an attempt has been made in this direction for English

scoring string matches. However, it might

language[7]. The statistics for accepted mutation in English was

be unwieldy in practice. For long strings

cleverly derived based on already designed Google searches. In

we will need to multiply probabilities,

the case of Malayalam, statistics of character mutations are not

which might result in numeric underflow.

easily derivable from any corpus or any existing search engines

Hence,

or other language computing tools. Hence, data for this needs to b

transformation. Another effect that we will

e generated to go ahead with development

matrix

use is to convert from probability to odds.

system. We will now describe generation of primary data of

The odds can be defined as the ratio of

natural mutation in Malayalam.

the probability of occurrence of an event

of

scoring

Malayalam has a set of 51 characters, and basic statistics of its matrix.

The

mutation are required for developing a scoring

occurrence

probabilities

are

will

use

a

logarithmic

to the probability that it does not. If the

Occurrence and Mutation Probabilities occurrence and

we

available,

derived

from corpus of considerable size in 1971 and again in 2003[12].

probability of an event is p, then odds is p/1-p. We will however not use this formula directly, but define odds for any given match i-j as: Sij = 10 log (Pij/Pi)

We describe here only a subset of characters in view of economy of space. In Table 1, we give the probabilities of one set of consonants, which we have extracted from a small test corpus of

In

the

above

equation,

pij

is

the

probability that character i mutates to character j and pj is the probability of

Malayalam text derived from periodicals.

natural occurrence of character j. Thus We then designed and conducted a study to extract the character mutation probabilities. We selected 150 words that cover all the chosen

consonant

among a CLEAR Sep.2012

characters.

A

dictation

was

administered

the negative score for a mutation of a less frequently

occurring

character

will

be

more in this scheme. The multiplier 10 is ed 2


used just to bring the scores to a convenient range. Table 3

References [1] Altschul, S F, et al. (1990). ―Basic local alignment search tool‖, Molecular Biology, 215(3), 403-410.

shows the log- odds score thus derived using occurrence probabilities and mutation probabilities given in Table 1 and 2. These can be used to score approximate matches and select the

[2] Damerau, F J (1964). ―A technique for computer detection and correction of spelling errors‖, ACM C ommunications, 7(3), 171-176.

most similar one.

[3] Dayhoff, M O, et al. (1978). ―A model of Evolutionary Change in Proteins‖, Atlas of protein sequence and structure, 5(3), 345-358. [4] Google-diff-match-patch, [Online]. Available: http://code.google.com/p/google-diffmatch patch/, Accessed on 20 Jan. 2012. Results, Discussions, and Conclusion The prototype scoring matrix we have designed above can be demonstrated to be capable of scoring approximate matches and can therefore be a means of selecting the closest match. We will demonstrate this with an example of scoring four approximate matches for the word k. Table 4 lists the scores for the four different matches and the exact match scores best. The next best match as per the new scoring scheme is കക.

[5] Hall, P A V and Dowling, G R (1980). ―Approximate String Matching‖, ACM Computing Surveys, 12(4), 381- 402. [6] Henikoff, S and Henikoff, J G (1992). ―Amino Acid Substitution Matrices from Protein Blocks‖, Proceedings of the National Academy of Sciences of the United States of America, 22(22),1091510919. [7] Kanitha, D (2011). ―A scoring matrix for English‖, MPhil Dissertation in Computational Linguistics, Dept. Of Linguistics, University of Kerala. [8] Leon, D (1962). ―Retrieval of 24 misspelled names in an airlines passenger record system‖, ACM Communications, 5, 169-171.

Our demonstration has been on a chosen set of consonant characters, but it can be expanded to cover all Malayalam characters. For demonstrating more general words, scoring matrix for vowels is essential. We have computed the same and will be reporting it in a forthcoming publication. During our studies, we also noticed that the grouping of characters as done conventionally may not suit our studies. For example, we found that the character though

they

are

is a possible mutation for , very rarely, even not

grouped

together

conventionally.

A

regrouping based on natural mutations is a work we see as requiring attention. To the best of our knowledge, our work is a unique proposition for the Malayalam language, which can be incorporated into Malayalam search engines. We would like to reiterate that our work is in prototype stage. The sample size of the corpus as well as the size of the subjects in the survey is not substantial. The

[9] Nair, A S (2007). ―Computational Biology & Bioinformatics: A Gentle Overview‖, Communications of the Computer Society of India, 31(1), 1-13. [10] Navarro, G (2001). ―A Guided Tour to Approximate String Matching‖, ACM Computing Surveys, 33(1), 31 88. [11] Needleman, S B and Wunsch, C D (1970). ―A general method applicable to the search for similarities in the amino acid sequence of two proteins‖, Journal of Molecular Biology, 48(3), 443-453. [12] Prema, S (2004). ―Report of Study on Malayalam Frequency Count‖, Dept. Of Linguistics, University of Kerala. [13] Soundex, [Online]. Available: http://en.wikipedia.org /wiki/Soundex, Accessed on 2 Dec. 2011. [14] Wagner, R A and Fischer, M J (1974). ―The String-to-String Correction Problem‖, Journal of the ACM, 21(1), 168-178.

authors hope to expand the work with a sizable database from which statistics is extracted and then the scoring matrix can be made more reliable. We also propose to validate the scoring

This article was published in CSI MAY 2012 and reused here with author's permission.

approach with sample trials involving language experts. CLEAR Sep.2012

3


INDIAN SEMANTICS AND NATURAL LANGUAGE PROCESSING The

Author:

history of modern linguistics is chronologically divided into

two as BC (Before Chomsky) and AD (After Dissertation). Here dissertation means the thesis which Chomsky submitted to Pennisilvania University for Doctorate degree. His ideas are considered epoch making comparable to the Darvin‘s theory of evolution

and

Therefore

took

Chomsky

time

to get recognition

himself

published

it

like as

M.Jathavedan, Emeritus Professor, Department of Computer Applications, CUSAT, Cochin mjvedan@cusat.ac.in

Darvin.

‗Syntactic

Structures‘. Paninian grammar was introduced to modern linguistics as a

Presented

a

Sanskrit:

An

paper

entitled

Inter-lingua

for

Sastric Machine

Translation ‗.

forerunner of Chomsky‘s generative grammar introduced in the above book. ‘Many linguists, foreign and Indian, joined the bandwagon and paused as experts in Paninian grammar in Chomskian terms ( Joshy S.D.). The renewed interest

had

influenced the interpretation of Paninian grammar itself as generative grammar – the idea that grammar consists of modules in a hierarchy or levels. The first contribution in this direction was due to Kiparsky and Staal (1969 ) who proposed a hierarchy of four levels of representation. This was criticized by Hauben (2002)as they did not permit semantic factors. Other important contributions are due to Caradona (1976).

Thus computational Sanskrit emerged as a new branch of research. Apart from

Joshy continues: ‗Somewhat later Chomsky had drastically

computer assisted teaching and research

reversed his ideas and after the enthusiasm for Chormsky

of

subsided, it became clear that the idea of transformation is

automated reconstruction of Sanskrit texts

alien to Panini. Now a new type of linguistics has come up,

and machine aided translation

called Sanskrit Computational Linguistics with three capital

designing a working system of Paninian

letters. Although Chomsky is out , Panini is still there ready to

grammatical

framework

for

machine

be acclaimed as the forerunner of SCL.‘ But SCL was identified

translation

especially

for

Indian

as a branch of study in 2007 only and there were other factors

Languages, it‘s possible applications in

that led to its formation.

cognitive science, AI are some areas of

Sanskrit

(like

any

other

subject), (MAT),

active research in Sanskrit departments of In a paper entitled ‗Knowledge representation in Sanskrit and

many universities and computer science

Artificial

departments of many institutes.

Intelligence‘

a

NASA

scientist

Rick

Briggs

drew

attention of computer scientists to the works on semantics in Sanskrit literature instead of Paninium. note is that he was referring

The important fact to

the ‗Vaiyakarana

Siddhanta

Laghu Manjusha‘ of Bhatta- Nagesa (1730-1810), perhaps the last Sanskrit scholar in the Indian tradition. This paper, rightly or wrongly, aroused great enthusiasm among Sanskrit scholars. Some of them went even to the extent of claiming that the future direction of research in artificial language would be decided by Sanskrit. The immediate result was the ‗ First Seminar

It is a surprising fact that we are not able to locate any more contribution of Briggs in

this

field.

Further,

comments

are

pouring in the internet for and against the arguments put forward by Briggs. Another point to be noted is that the authority of the paper is Briggs in person and not NASA as ill-conceived by many.

on Knowledge Representation and

Samskritam ‗ (1986) held at Bangalore in which Briggs

CLEAR Sep.2012

4


A question that naturally raised was the role of Sanskrit as a

the

“Kriya is the action of the verb in

development of a compiler for use of Sanskrit instructions.

dedicated

programming

language

which

meant

the sentence. The other words

C-DAC, Bangalore had initiated some work in this direction

which are “factors in the action “of

in early 1990s itself. It was claimed that Astadhyayi

the verb are called karakas.”

(Paninium ) was useful in this matter – i.e., meta-rule, meta-language and linguistic marker system of Panini to draw up the specification and requirements of such a

The formal categories in their discussions

processor. To what extent the search has been successful

were mainly those established in Paninium

after twenty years is a question.

and

investigated

semantically

and

philosophically by Bhartruhary. We will The International Symposiums on Sanskrit Computational

consider two or three of them.

Linguistics ( SCL )were the results of the attempt to provide a common platform for the traditional Sanskrit Scholars and

As an example we consider the sentence:

the computational linguists. It was a culmination of the World Sanskrit Conferences, especially the thirteenth one held at Edinburg and the First National Symposium on Modeling and shallow parsing of Indian Languages in Mumbai, both held in the year 2006. The first Symposium was held in France in 2007 and the last one at Jawaharlal Nehru University, New Delhi (2010).

„Rama cooks rice‟ In the subdivision of a sentence into words, the grammarians take the verb as important. Other words are related to this meaning-bearing word in one way or other. Kriya is the action of the verb in the sentence.

The

other

words

which

are

―factors in the action ―of the verb are LINGUISTICS AND PHILOSOPHY

called

Linguistics is considered as a part of philosophy in India. It

karakas.

karakas. Panini has defined six

is often said that ‗ the grammatical method of Panini is as fundamental to the Indian thought as is the geometrical method of Euclid for the western thought.‘

For the sentence in our example the grammarians

may

give

the

following

analytical description: Semantics in Sanskrit was never a well –defined domain of

It is the activity of cooking, taking place in

a separate discipline ( Hauben, ). Rather, it remained the

the present time, having an agent which is

battle field for exegetes, logicians and grammarians with

identical with Rama, having an object

various backgrounds and philosophical commitments. It

identical with rice.

was only a few centuries after Bhartrhari (4 th century A.D. ) that a sophisticated specialized language and terminology

Thus the sentence is split into elements

were developed for discussing semantic problems and

such as stem, root, affix, ending and the

theories of verbal understandings. Thus during the period

attribution of well-defined

from

each

thirteenth to sixteenth centuries semantic issues

linguistic

element.

meaning to The

central

were seriously taken up for discussion between different

element in this analysis is the meaning

philosophical schools not only focussing on language but

expressed by the verb ‗cooks‘, or to be

also from a religious point of view.

more precise, the meaning of the verb root ‗to

cook‘

(pac).

The

verbal

form

(in

Sanskrit the verbal ending ti in pa(ca)ti ) indicates that the activity takes place in the present time. The agent of the action is expressed by the grammatical subject, Rama, the object of the action is the grammatical object rice.

CLEAR Sep.2012

5


For the Mimamsa thinkers also the verb is the central

But

centuries

were

elapsed

before

element in a sentence. While grammarians take the verbal

Bhartrhari (4th century AD )developed his

root and the activity expressed by it as more important than

sphota theory after Panini (4th century

the verbal ending and its meaning, the latter are more

B.C.).

important for Mimamsakas. According to them the basic

Bhatta Nagesa gave completion to sphota

meaning of all verbs is a creative urge which stimulates

theory in eighteenth century. The later

action. This basic urge is expressed – transmitted to the

development

listener – by the verbal ending, not by the verbal root which

considered as a continuation of this.

Again centuries elapsed before

of

linguistics

can

be

merely qualifies this creative urge. Thus according to them the sentence in our example can be given the following structural description:

There are four factors involved in a proper cognition – expectancy, mutual compatibility, proximity and intention of

“It is the creative urge which is conducive to cooking , taking

the speaker. It is difficult to include the

place in the present time, having the same substratum as the

last one in any syntatic solution. According

agent residing in Rama, having as object rice. ―

to

Bhartrhari

a

speaker

can

seldom

communicate through words all that he Now for the Nyaya school, it is not the verb which is the central element in the sentence, but, generally the noun in

intended to and the hearer understands more or at times less than what he hears!

the first ending ( nominative ). Thus the structure of the verbal knowledge in our example according to them is:

Thus there is mutual dependency of Indian theories of syntax and semantics. It is

― It is Rama who possesses the volitional effort conducive to cooking which produces the softening and moistening which is based in rice. ―

said that the Indian linguists of the fifth century B.C. knew more of the subject than western linguists of the nineteenth century A.D. Further, if there is any area

Underlying all these descriptions is the presupposition that the main structural relation in the sentence is that between qualifier and the thing to be qualified (visesana/visesya ) and unlike grammarians and Mimamsakas for whom the visesya is verb, for Nyaya thinkers the visesya is the noun in the first

where the ancient Sanskrit scholars have been

much

developments, semantics

and

it

ahead

of

is

the

in

systems

of

modern field

of

knowledge

representation.

ending. SANSKRIT COMPUTATIONAL LINGUISTICS I have already quoted S.D.Joshy. The sentences were from his paper ‗ Background of the Astadhyayi ‗ read in the third International

Symposium

on

Sanskrit

Computational

Linguistics held in 2009 at Hyderabad. He continues: ‘ Contrary to some western misconceptions the starting point of Panini‘s analysis is not meaning or the intention of the speaker, but words from elements.

Panini

starts from

morphology to arrive at a finished word.‘ But ‗he developed a number of theoretical concepts which can be applied to other languages also.‘ Coming back to Briggs, we note that in contrast to other works his paper has for the first time drew attention of computer scientists to the semantic theories available in Sanskrit. Since it is meaning that is important in a sentence, syntax is developed to tackle the semantic problem.

CLEAR Sep.2012

REFERENCES: 1. Briggs, Rick, 1985, Knowledge representation in Sanskrit and artificial intelligence, The AI magazine. 2. Briggs, Rick, 1986, Shastric Sanskrit: an interlingua for machine translation, First National Conferece on Knowledge Representation, Bangalore. 3. Chormsky, N, 1957, Syntactic Structures, The Hague, Mouton. 4. Caradona, George, 1976, Panini: A survey of Research, The Hague, Mouton. 5 .Kiparsky, Paul and Staal J.F., 1969, Syntactic and semantic relations in Panini, FL 5. 6. Hauben, E.M, 2002, Semantic in the Sanskrit tradition on the eve of colonialism, Project report, Leiden University. 7. Joshy, S.D., 2009, Background of the Astadhyayi, Third International Symposium on Sanskrit Linguistics, Hyderabad.

6


Overview of Question Answering System Interaction between humans and computers is one of the most important active areas of research in this modern world. Particularly interaction with natural language becomes more popular. Natural Language Processing is a computational technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to achieve human-like language processing for a wide range of applications.

One of the most powerful applications of NLP is

Question Answering System. The need for automated question answering systems becomes more urgent due to the enormous growth of digital information in text form. QA system involves analysis of both questions and answers. In this overview, we focus on Question Type Classification, Question Generation, and Answer Generation for both closed and open domain.

Authors

Introduction

Research

in Natural Language Processing [1] has

been going on for several decades dating back to the late 1940s. The goal of NLP is to accomplish human-

K.M. Arivuchelvan, Research Scholar, Periyar Maniammai University.

like language processing. The discipline and practice of

NLP

are:

Linguistics

-

focuses

on

formal,

K. Lakshmi Professor, Periyar Maniammai University.

structural models of language and the discovery of language universals - in fact the field of NLP was originally referred to as Computational Linguistics;

general knowledge about the structure of the world

Computer Science - is concerned with developing

that language users must have in order to maintain

internal

a conversation.

representations

processing

of

these

of

data

structures,

and and;

efficient Cognitive

Psychology - looks at language usage as a window

Natural language processing is used for a wide

into human cognitive processes, and has the goal of

range

modelling the use of language in a psychologically

applications

plausible way.

Retrieval

The most explanatory method for presenting what actually

happens

within

a

Natural

Language

Processing system is by means of the ‗levels of

of

applications. utilizing

(IR),

The

NLP

Information

Question-Answering,

most

includes

frequent

Information

Extraction

Summarization,

(IE),

Machine

Translation, Dialogue Systems. In this paper we discuss more towards Question-Answering.

language‘ approach. Phonology concerns how words are

related

to

the

sounds

that

realize

them.

Morphology concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language. Syntax level concerns how words can be put

together

to

form

correct

sentences

and

determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic level concerns what words mean and these meanings combine in sentences to form sentence meanings. Pragmatic level concerns how sentences are used in different situations and how use effects the interpretation of the sentence. Discourse

level

concerns

how

the

immediately

preceding sentences affect the interpretation of the next sentence. World knowledge includes the

CLEAR Sep.2012

Question-Answering system can be performed in two domains: Closed and Open Domain. Closeddomain

question

answering

[4]

deals

with

questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit

domain-specific

formalized

in

knowledge

ontologies.

frequently

Alternatively,

closed-

domain might refer to a situation where only a limited type of questions are accepted, such as questions procedural

asking

for

information.

descriptive

rather

Open-domain

than

question

answering [4] deals with questions about nearly anything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available

7


from which to extract the answer.

1.

Interpretation: What does X mean?

2.

Causal antecedent: Why/how did X

Question Answering [5] is a specialized form of information

retrieval.

Given

a

collection

occur?

of

3.

documents, a Question Answering system attempts

Causal

consequence:

What

next?

What if?

to retrieve correct answers to questions posed in

4.

natural language. Open-domain question answering

Goal orientation: Why did an agent do X?

requires question answering systems to be able to

5.

answer questions about any conceivable topic. Such

Instrumental/procedural: How did an agent do X?

systems cannot, therefore, rely on hand crafted

6.

domain specific knowledge to find and extract the

Enablement:

What

enabled

X

to

occur?

correct answers.

7.

Expectation: Why didn‘t X occur?

8.

Judgmental: What do you think of X

Question Classification

9.

Assertion:

Question Classification [2] is an important task in

10. Request/Directive

Question-Answering. The most well known question taxonomy was one proposed by Graesser and

After analyzing 5,117 questions in the research

Person (1994) based on their two studies about

methods and 3,174 questions in the algebra

human

sample,

tutors

and

students‘

questions

during

tutoring sessions in a college research method course and middle school algebra course. Six trained human judges coded the questions in the

they

found

four

frequent

question

categories: verification, instrumental-procedural, concept completion, and quantification questions.

transcripts, obtained from the tutoring sessions, on

Question Generation (QG)

four dimensions: Question Identification, Degree

For the first time in history [], a person can ask a

Specification (e.g. High Degree means questions

question on the web and receive answers in a few

contain more words that refer to the elements of

seconds. Twenty years ago it would take hours or

desired information), Question-content Category,

weeks to receive answers to the same questions

and Question Generation mechanism (the reasons

as a person hunted through documents in a

for generating questions include knowledge deficit

library. In the future, electronic textbooks and

in

information sources will be main stream and they

the

learner own

knowledge base,

ground

between

dialogue

actions

among

dialogue

common

participants,

social

participants,

and

conversation control ). They defined following 18 question categories according to the content of information sought rather than on the interrogative words (i.e. why, how, where, etc).

will be accompanied by sophisticated question asking and answering facilities. Applications

facilities

are

sample, some of which are addressed in this

Where? What

are

did

an

and

computer

deeper learning. 3.

16. Quantification: How much? How many?

Questions that human

tutors might ask to promote and assess

the

properties of X? How

other media. 2.

14. Example: What is an example of X?

Suggested good questions that learners might ask while reading documents and

13. Concept completion: Who? What? When?

17. Instrumental/procedural:

QG

endless and far reaching. Below are listed a small

1.

12. Disjunctive: Is X, Y, or Z the case?

specification:

automated

report:

11. Verification: invites a yes or no answer.

15. Feature

of

Suggested questions for patients and caretakers in medicine.

agent do X? 18. Comparison: How is X similar to Y?

CLEAR Sep.2012

8


4.

5.

Suggested questions that might be asked in

These data often comprise text documents in

legal contexts by litigants or in security

which the structure of the document or certain

contexts by interrogators.

extracted information is expressed by a markup. from

Such markups can be attributed manually (e.g.,

information repositories as candidates for

the structure of a document) and/or in an

Frequently Asked Question (FAQ) facilities.

automatic way, e.g., markups for identified

Questions

automatically

generated

The time is ripe for a coordinated effort to tackle QG in the field of computational linguistics and to launch

person

and

company

names

and

their

relationships in newspaper articles.

a multi-year campaign of shared tasks in Question Generation (QG). We can build on the disciplinary

Conclusion

and interdisciplinary work on QG that has been

Question answering is a complex task needing

evolving in the fields of education, the social

effective improvements of different research

sciences and computer science. The QG system

areas including, question generation, question

operates

ranking,

directly

on

the

input

text,

executes

question

classification,

information

implemented QG algorithms, and consults relevant

retrieval, natural language processing, database

information sources. Very often there are specific

technologies,

goals that constrain the QG system.

human computer interaction, speech processing

Semantic

Web

technologies,

and computer vision. Question Answering Today‘s question answering [7] is not limited by the type of document or data repository – it can address both traditional databases and more advanced ones that

contain

text,

images,

audio

and

video.

Structured and unstructured data collections can be considered

as

information

sources

in

question

answering. Unstructured data allows querying of raw features (for example, words in a body of text), extracting attached. structured

information Related and

to

with this

unstructured

clear

semantics

distinction data

between

there

is

a

traditional distinction between restricted domain question answering, or RDQA, and open domain question answering (ODQA). RDQA systems are designed to answer questions posed by users in a specific domain of competence, and usually rely on manually constructed data or knowledge sources. They often target a category of users

who

know

and

use

the

domain-specific

terminology in their query formulation, as, for example, in the medical domain. ODQA focuses on answering

questions

regardless

of

the

subject

domain. Extracting answers from a large corpus of textual documents is a typical example of an ODQA system. Recently, we have witnessed an approach of question answering involving semi-structured data.

REFERENCES 1. Liddy, E. D. In Encyclopaedia of Library and Information Science, 2nd Ed. Marcel Decker, Inc. 2. Ming

Liu

Intelligent

Rafael

A.

Automatic

Calvo

―G-Asks:

Question

An

Generation

System for Academic Writing Support‖ Dialogue and Discourse 3(2) (2012) 101–124. 3. Mark

Andrew

Greenwood

―Open-Domain

Question Answering‖ September 2005. 4. http://en.wikipedia.org/wiki/Question_answer ing. 5. Andrew Lampert ―A Quick Introduction to Question Answering‖ December 2004. 6. Workshop Report ―The Question Generation Shared

Task

and

Evaluation

Challenge‖

Sponsored by the National Science Foundation. 7. Oleksandr Kolomiyets, Marie-Francine Moens ―A survey on question answering technology from

an

information

retrieval

perspective‖

Information Sciences 181 (2011) 5412–5434.

CLEAR Sep.2012

9


I-Search.... Future of Search Engines Author

Manu Madhavan

In

this

web-age,

searching-or

more

M. Tech Computational Linguistics Govt. Engg. College, Sreekrishnapuram mmnamboodiry@gmail.com

precisely

surfing the web may be a casual phrase in day to day business. The netizens continuously enrich the web-vocabulary by words like ―Googling‖. What this

Semantic Search

speaks is how search engines are important in this digital era. A web search engine is designed to

A semantics search engine attempts to make sense of search results based on context. It

search for information on the World Wide Web.

automatically identifies the concepts structuring Today‘s

search

engines

come

in

two

types.

Directory-based engines, like Yahoo, are still built manually. What that means is that you decide what your directory categories are going to be Business, and Health, and Entertainment and then you put a person in charge of each category, and that person builds up an index of relevant links. Crawler-based engines, like Google, employ a software program — called a crawler — hat goes out and follows links, grabs the relevant information, and brings it back to build your index. Then you have an index engine that allows you to retrieve the information in some order, and an interface that allows you to see it. It‘s

the

texts.

―election‖

For instance, a

semantic

if

you

search

search

engine

for

might

retrieve documents containing the words ―vote‖, ―campaigning‖ and ―ballot‖, even if the word ―election‖ is not found in the source document. Semantic

Search

systems

consider

various

points including context of search, location, intent,

variation

of

words,

synonyms,

generalized and specialized queries, concept matching

and

natural

language

queries

to

provide relevant search results. Major search engines like Google and Bing incorporate some elements of Semantic Search. The objective of this article is to discuss the recent advances in

all done automatically.

area of Semantic Search. As the Web continues to grow, however, and to be more

and

more

important

for

commerce,

Google's Knowledge Graph:

communication, and research, information-retrieval

Google usually returns the search result for any

problems become a more serious handicap. The

query based on the text and the content. To put

percentage of Web content that shows up on search

it right, it does not understand the exact

engines continues to wane. And as search engines

meaning of the words. It matches the keywords

struggle

the

of the query with those of the sites and returns

information they provide may be increasingly out-of-

pages that have a significant authority on those

date.

words.

Recent advances in intelligent search suggest that

Amit Singhal, Google‘s senior VP of engineering,

these limitations can be partially overcome by

said [1]: “The introduction of Knowledge Graph

providing search engines with more intelligence and

enables Google to understand whether a search

with the user‘s underlying knowledge. That is called

for

natural language processing. It might also have to

confectionary manufacturer.

understand what the user need, even when he

about discovery' – the basic human need to

doesn‘t say it. And that requires some knowledge of

learn and broaden your horizons”.

to

add

more

and

more

content,

„Mars‟

refers

to

the

planet

or

the

'Search is a lot

the user. These ideas lead to the birth of a new generation of web technologies, popularly known as Semantic Web.

CLEAR Sep.2012

10


“The introduction of Knowledge Graph enables Google to understand whether a search for „Mars‟ refers to the planet or the confectionary manufacturer.

'Search is a lot about

discovery' – the basic human need to learn and broaden your horizons”. Amit Singhal

Bing's Semantics Search

By making search more natural and intuitive,

Microsoft specifically brands Bing as a "decision

Powerset is fundamentally changing how we

engine," and not as a general purpose search

search the web, and delivering higher quality

engine--even though it provides that functionality

results. [3]

as well--in order to differentiate it from Google Search. Bing's search is based on semantic technology from Powerset that was acquired by Microsoft in 2008. Notable changes include the listing of search suggestions as queries are entered and a list of related searches (called "Explore

pane").

capabilities

like

captions

based

analysis

of

extraction

Bing

presenting on

linguistic

content. is

features

The

leveraged

more

readable

and

semantic

concept in

semantic

Bing,

of

entity

providing

knowledge on phrases and what they uniquely refer to. [2]

Hakia: Hakia is a general purpose semantic search engine, that search structured corpora (text) like Wikipedia. For some queries (typically popular queries and queries where there is little

ambiguity),

Hakia

produces resumes.

These are portals to all kinds of information on the subject. Every resume has an index of links to the information presented on the page for quick reference. Often, Hakia will propose related

queries,

which

is

also

great

for

research. [3] Bing‘s new product Adaptive Search strives to capitalize

on

semantic

search

technology.

Cognition

Adaptive Search will take into consideration your

Cognition has a search business based on a

user behaviour, then tailor your Bing results to be

semantic map, built over the past 24 years,

most appropriate. So if you‘ve searched for a

which

word then clicked on a specific site previously,

comprehensive

Bing will predict that it‘s likely that what you‘re

English language available today. It is used in

searching for falls into the context of that site,

support

thus it can provide you with results that are more

translation, document search, context search,

tailored. [5].

and much more. [3]

Powerset

Swoogle:

Powerset is a Microsoft owned Company building

Swoogle, the Semantic web search engine, is a

a transformative consumer search engine based

research project carried out by the ubiquity

on natural language processing. Their unique

research group in the Computer Science and

innovations in search are rooted in breakthrough

Electrical

technologies that take advantage of the structure

University of Maryland. It‘s an engine tailored

and nuances of natural language. Using these

towards finding documents on the semantic

advanced techniques; Powerset is building a

web.

large-scale

search

engine

that

breaks

the

company

of

and

claims complete

business

Engineering

is

the

map

analytics,

Department

most of

the

machine

at

the

the

confines of keyword search.

CLEAR Sep.2012

11


Swoogle is capable of searching over 10,000 ontologies and indexes more that 1.3 million web documents. It also computes the importance of a Semantic Web document. The techniques used for indexing are the more Google-type page ranking and also mining the documents for interrelationships that are the basis for the semantic web. [4]

PyLucene PyLucene is a GCJ-compiled version

of

Java

Lucene

integrated with Python. Its goal is to allow you to use

Conclusion

Lucene's text indexing and

NLP is a complex area of research, requiring a solid understanding of grammars (not just grammar), and a good grounding in computational

searching

capabilities

from

Python.

linguists (in order to apply the techniques to machine, which is not always easy). Understanding the techniques used in NLP allows us to provide the best format and patterns for the search engine. Seeing as NLP seeks to mimic human language understanding, using common sense is a good idea. But before any broader, more sophisticated sort of intelligence can be placed into a machine we humans will have to get a better grasp on just what intelligence is.

References: 1. http://mashable.com 2. http://semanticweb.com 3. http://thenextweb.com 4. http://web2innovations.com 5. http://blogs.wsj.com

Google synonyms and natural language processing Google just blogged about synonyms as they related to searcher intent. They provide several examples of how a concept as simple as a synonym complicates natural language processing. This also brings up some important recommendations for site owners with respect to SEO. Prospective customers type in all kinds of variations on your most obvious keywords (hence the need for keyword research). Often they make use of synonyms, some common, some not. These variations often represent less competitive opportunities for high search engine rankings if you can incorporate those synonyms into your website. In particular: Use common variations within your existing copy rather than using the same phrase repeatedly. (This also tends to make long blocks of text more readable.) Develop pages that specifically focus on each of the most common and valuable synonyms. If there are enough synonyms and industry-specific terms, consider developing a glossary of terms. Find opportunities to talk about the synonyms, such as a blog post or article that talks about how synonyms may actually be somewhat different or whose similarity is up for debate (e.g. SEM vs. Search Engine Advertising). http://www.web1marketing.com

CLEAR Sep.2012

12


Remolding Professional sectors: the SaaS way.. SaaS : Purpose and Functions The costs and time to market benefits of outsourcing business

Author

services like payroll, Storage space, Customer Relationship

Dr. Sudheer S Marar

Management (CRM) applications, and company websites has been proven for many businesses. The term for these types of outsourced services is most recently known as Software As A Service (SaaS).

MCA MBA PhD Associate Professor and HOD, Department of MCA Nehru College of Engineering and Research Centre

Some

applications

immediate

success

are

an

in

the

market while others take time or in the worst case never get toehold in a given market. The ideal introduction scenario for a Introducing new technology is an expensive undertaking,

carrier would be that they could

usually requiring high capital outlays and can take many

try a new service in a particular

months of training, installation and integration before service

market without having to make

can be delivered in network. Outsourcing these services to

a significant investment all the

organizations that are experts in the technology lowers costs,

while gaining key market data.

increases uptime, accelerates revenue realization and provides increased flexibility & functionality.

Therefore, companies today are faced with the challenges of

Due to these results, hosting for these critical business

controlling

functions continues to grow and many companies are looking

operating costs, protecting their

for similar opportunities in other operational areas.

current investments and having the

ability

equipment

to

and

deploy

new

Effects of Downturn

applications quickly. To add to

As stated in Movius Corporation annual report, The economic

the challenge, many carriers

downturn has globally forced many companies to reduce

are faced with older application

spending across the board. This has put companies that are in

platforms that are limited in

highly competitive and innovation driven industries, say

capability

telecommunications in a exigent balancing act. While they

approaching end of life. These

need to try to control expenses, if they are not also continuing

companies need cost-effective

to introduce the latest applications and services, they will

solutions

quickly begin to lose their market share.

conversion

and

that

potentially

permit

from

the legacy

networks to IP infrastructure The ideal situation for a carrier would be, to almost suddenly

without

introduce new services without risking precious in hand

network

capital. Under the best possible scenario the carrier could

application design.

major

changes

infrastructure

to or

begin generating revenue in a matter of weeks after making the decision to launch a new service. If the service could be introduced without the need to add additional staff, the solution is essentially risk free.

CLEAR Sep.2012

13


Enterprise-level applications

The Futuristic.

As extracted from a lead article of IDC-SAP initiated

Clearly SaaS applications are maturing. The

paper, ―..Professional service firms focus their business

number of companies that either are using

management energy on optimizing the utilization of an

SaaS

expert's or a consultant's time. They attempt to develop

applications in the next year has grown

service offerings or skill sets that clients will find

considerably

compelling. Ultimately, they focus on properly charging

suggesting that the barriers to adoption —

and receiving payment from clients. Larger firms tend to

either

broaden their offerings to ensure a greater wallet share.

overcome. We see a bright future for SaaS

Meanwhile Smaller firms tend toward key-field focusing

across a broad range of application areas

and deep industry expertise, hoping to foster continuing

and for large and small professional services

relationships with a small number of clients.‖

firms.

In short, All firms balance developing a talent pipeline

SaaS is not without its problems, however.

with maximizing utilization rates. Client satisfaction and

Functionality and security concerns hang

trusting relationships drive both repeat business and

back, and while these concerns are more a

referrals

in

most

professional

services

applications or plan

real

over or

the

to use

past

perceived

few

SaaS years,

are

being

segments.

perception than reality, it is important when

Therefore, firms seek to ensure deliverables of the

considering applications from a SaaS vendor

highest possible quality and strive to fully meet client

that appropriate due diligence be applied to

expectations throughout the engagement process. Firms

ensure that the functionality meets critical

increasingly use technology to support all parts of their

business

business: Finance and scheduling software are common,

corporate client to have a good choice on its

knowledge management and data warehouse capability

SaaS vendor, not all are created equal. As

help improve service quality, and client management and

this domain is a maturing capability, one

engagement management software are increasingly used

should make it sure to select a vendor that

to monitor and maximize customer satisfaction. The

brings experience, financial stability, and a

increased use of technology has both aided and hindered

good reputation for working effectively with

professional services firm-constrains to improve their key

professional

value propositions.

thereby ensuring the client on its business benefits,

Benefits of SaaS

needs.

It‘s

services

scalable

important

of

growth,

the and

for

any

company, business

continuity.

The cost of a complex business management software implementation is often the starting point for a discussion and often a point where the discussion meets a quick end. In their research, IDC has identified several areas where SaaS system delivery costs differ from on-premise delivery costs. Primarily, They are the following: •

License fees. Both initial and Maintenance cost.

Hardware costs.

IT infrastructure costs.

Test Environment maintaining development cost

IT personnel/support costs.

Security, backups, and disaster recovery.

CLEAR Sep.2012

14


Apple's SIRI What is Siri?

Author

Siri

Robert Jesuraj K

(Speech Interpretation and Recognition Interface) is

an intelligent personal assistant and knowledge navigator which

M. Tech Computational Linguistics Govt. Engg College Sreekrishnapuram

works as an application for Apple's iOS. The application uses a natural language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Web services.

rajaroberjesuraj@gmail.com

Siri was originally introduced as an iOS application available in the App Store by Siri, Inc. Siri, Inc. was acquired by Apple on April 28, 2010. Siri, Inc. had announced that their software would be available for BlackBerry and for Android-powered phones, but all development efforts for non-Apple platforms were cancelled after the acquisition by Apple. Siri is now an integral part of iOS 5, and available only on the iPhone 4S, launched on October 14, 2011.

On

November

8,

2011,

Apple

publicly

announced that it had no plans to support Siri on any of its older devices Obviously, Siri won't be able to answer every

Using Siri The app transcribes spoken text and then takes these commands and routes them to the right web services. If you try to book a table at a Thai restaurant ("get me a table at a good

Thai

restaurant nearby"), for example, Siri will check where you are, query Yelp for reviews of nearby Thai restaurants, show you the options and then pre-populate a reservation form on OpenTable with your information. All you have to do is to confirm Siri's selection.

query - and sadly the app doesn't use Wolfram Alpha to give you answers to factual questions (yet). Should that happen, Siri will just route your query to a search engine and display the search results. As the Siri team told us, however, users tend to learn which queries work best pretty quickly (just like we learned how to structure effective queries for Google). To use the iPhone app, you just have to say aloud a command like "Book a table for six at

The software is surprisingly good at translating

7pm at McDonalds" (I'm sure you're classier

voice queries into text. The application works so

than that, but let's stick with it for now), and

well because it is able to recognize the context of

then using speech-recognition technology and

your queries. This kind of semantic analysis is a

the iPhone's GPS capabilities, your command is

very computing intensive problem, so most of the

translated

actual number crunching happens on Siri's servers.

responding with confirmation of booking—or

Siri outsources the voice recognition to Nuance and

lack of availability.

and

processed

by

the

app,

if you are not comfortable with speaking into your phone, you can always use a regular text query as well.

CLEAR Sep.2012

15


Siri, which has ties with Stanford Research Institude

DARPA Helps Invent The Internet And

and

Helps Invent Siri

DARPA,

has

collaborated

with

OpenTable,

MovieTickets, StubHub, CitySearch and TaxiMagic to help with bookings and information, which pretty

With Siri, Apple is using the results of over 40

much wipes out the reason why you'd want to

years

download any of those services' apps individually.

(http://www.darpa.mil/

Siri is all this and something that could only be held to the definition of true synergy, e.g.: ―Two or more things functioning together to produce a result not independently obtainable‖. None of the individual parts are "new" but the combination Siri created has never really been seen before. It has been the Holy Grail of computer researchers to one day create a device that could become conversational and intelligent in such a way that it would appear that the dialog is human generated.

of

research

International‘s

funded

by

)

Artificial

DARPA

via

SRI

Intelligence

Center

(http://www.ai.sri.com/ Siri Inc. was a spin off of SRI Intentional) through the Personalized Assistant

That

Learns

Program

(PAL,

https://pal.sri.com) and Cognitive Agent that Learns and Organizes Program (CALO). This includes the combined work from research teams from Carnegie Mellon University, the University of Massachusetts, the University of Rochester,

the

Institute

for

Human

and

Machine Cognition, Oregon State University, the University of Southern California, and Stanford University. This technology has come

Apple Siri can speak Hindi now When Siri was announced with the iPhone 4S,

a very long way with dialog and natural

everyone

language

thought

the

device

would

never

understanding,

machine

learning,

understand the Indian accent let alone be able to

evidential and probabilistic reasoning, ontology

speak Hindi. We were however left bewildered when

and

we found a video online where Siri responds to

reasoning and service delegation.

knowledge

representation,

planning,

users queries in Hindi! Similar applications for hand-held devices Siri‘s support for Hindi comes to us courtesy Kunal

1) S Voice is a intelligent personal assistant

Kaul. The hack connects Siri to Kunal‘s Google API

and knowledge navigator which works as an

server and interacts in Hindi.

application

for

Samsung's

Android

smartphones, similar to Apple inc's Siri on the Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.

iPhone. It first appeared on the Samsung Galaxy S III on May 3, 2012. The application uses a natural language user interface to answer

questions,

make

recommendations,

and perform actions by delegating requests to a set of Web services. 2) Assistant is the codename of a rumored upcoming Google application that will integrate voice recognition and a virtual assistant into Android. It is expected to launch in Q4 of 2012. Before March 2, 2012, the project was known as "Google Majel", and that name was originated from Majel Barrett-Roddenberry, the actress best known

as

the

voice

of

the

Federation Computer from Star Trek.

CLEAR Sep.2012

16


The software is an evolution of Google's Voice

With the app, an Android user can just "ask"

Actions that is currently available on most Android

Iris

phones while adding natural language processing.

information. The developers claim Iris can talk

Where Voice Actions required the users to issue

on topics ranging from Philosophy, Culture,

specific

History,

commands

like

"send

text

to…"

or

instead

of

science

"Google-searching"

to

general

for

conversation.

"navigate to…", "Assistant" will allow the users to

However, Android users need to have "Voice

perform

language.

Search" and "TTS library" installed in their

According to search engineer Mike Cohen, the

actions

in

their

natural

phones for Iris to work. Among its features are

"Assistant" project has three parts: "getting the

voice

world's knowledge into a format a computer can

searching on the web, and looking for a

understand, creating a personalization layer —

contact.

actions

including

calling,

texting,

Experiments like Google +1 and Google+ are Google's way of gathering data on precisely how people interact with content; building a mobile, voice-cantered "Do engine" ('Assistant') that's less

About Whoosh

about returning search results and more about Whoosh

accomplishing real-life goals".

is

indexing

a

fast,

and

featureful

full-text

searching

library

implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part

of

how

Whoosh

works

can

be

extended or replaced to meet your needs 3) Iris is a personal assistant application for Android. The application uses natural language processing to answer questions based on user voice request. Iris currently supports Call, Text, Contact Lookup, and Web Search actions including playing videos, looking for: lyrics, movies reviews, recipes, news, weather, places and others. It was developed in 8 hours by Narayan Babu and his team

at

Dexetra

Software

Solutions

Private

Limited, a Kochi (India) based firm. The name is actually

Siri

spelled

backwards,

which

is

the

original application for the same use built by Apple Inc.

exactly. Some of Whoosh's features include: Pythonic API. Pure-Python. No compilation or binary

packages

needed

no

mysterious crashes. Fielded indexing and search. Fast

indexing

and

retrieval

--

faster than any other pure-Python search solution I know of. See Benchmarks. Pluggable

scoring

algorithm

(including BM25F), text analysis, storage, posting format, etc. Powerful query language. Pure Python spell-checker (as far as I know, the only one).

http://packages.python.org/Whoosh/quick start.html#a-quick-introduction

CLEAR Sep.2012

17


CLEAR Sep.2012

18


Inviting Articles for CLEAR Dec2012 We are cordially inviting thought-provoking articles, interesting dialogues and healthy debates on multi-faceted aspects of Computational Linguistics, for the second issue of CLEAR, publishing on Dec 2012. The topics of the articles would preferably be related to the areas of Natural Language Processing, Computational Linguistics and Information Retrieval. Authors are requested to send their articles in doc/odt format to the Editor, before 15

th

November 2012, by email simplequest.in@gmail.com. -Editor

Thanks To Principal, Govt. Engg. College Sreekrishnapuram, Staffs and Students, Dept. of CSE, Govt. Engg. College Sreekrishnapuram, Authors of CLEAR Sep 2012- Dr. Achutsankar, Prof. Jathavedan M, Dr. Sudheer S Marar, Mr. Sajilal D, Dr. Lakshi K, Mr. Arivuchelvan



CLEAR