Approximate/Fuzzy String Matching using Mutation Probability Matrices
Computing issues in non-English languages are
We consider the approximate/fuzzy string
generally being addressed with less depth and breadth,
matching problem in Malayalam language
especially for languages which have small user base.
and propose a log-odds scoring matrix for
Malayalam, one such language, is one of the four major
score-based alignment. We report a pilot
Dravidian languages, with a rich literary tradition. The
study designed and conducted to collect a
native language of the South Indian state of Kerala and the
statistics about what we have termed as “accepted
characters in Malayalam, as they naturally occur. Based on the statistics, we show how a scoring matrix can be produced for
Lakshadweep Islands in the west coast of India, Malayalam is spoken by 4% of India‘s population. While Malayalam is integrated fairly well with computers, with a user base that may not generate huge market interest, such fine issues of
Malayalam which can be used effectively in
language computing for Malayalam remains unaddressed
numeric scoring for the approximate/fuzzy
string matching. Such a scoring matrix would enable search engines to widen the
If we were to search Google to look for information on the
search operation in Malayalam. Being a
senior author of this paper, Achuthsankar, and we gave the
unique and first attempt, we point out a
query as Achutsankar or Achudhsankar, in both cases
large number of areas on which further
Google would land us correctly in the official web page of
research and consequent improvement are
the author. This ―Did you mean‖ feature of Google is
required. We limit ourselves to a chosen
managed by the Google-diff-match-patch . The match
set of consonant characters and the matrix
part of the algorithm uses a technique known as the
approximate string matching or fuzzy pattern matching
is a prototype for
. The close/fuzzy match to any query that is received by he search engine is routine and obvious to the English language user. However, when a non-English language such as Malayalam is used to query Google, the same facility is
Dr. Achuthsankar S Nair Hon. Director, Centre for Bioinformatics University of Kerala
not seen in action.
When the word for
Sajilal Divakaran FTMS School of Computing, Kuala Lumpur
(Pathinaayiram – Malayalam word ten
query in Google Malayalam search, we are directed to docu ments that contain a similar word
m a common mispronunciation of the original word ) but not the word
This is because approximate/fuzzy string matching has not been addressed in Malayalam. In this paper we make preliminary attempts toward addressing this very special issue of approximate/fuzzy string matching Malayalam Approximate/Fuzzy String Matching
field described as approximate or fuzzy string matching in computer science has been firmly established since
1980s. Patrick & Geoff  define approximate string matching problem as follows: Given a string s drawn from some set S of possible strings (the set of all strings composed of symbols drawn from some alphabet A), find a string t which approximately matches this string, where t is in a subset T of S. The task is either to find all those strings in T that are ―sufficiently like‖ s, or the N strings in T that are ―most like‖ s. One of the important requirements to analyze similarity is to have a scientifically derived measure of similarity. The soundex system of Odell and Russell is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. CLEAR Sep.2012
Odell and Russell is perhaps one of the earliest of such attempts to use such a measure. It uses a soundex code of one letter and three digits. These have been used successfully in hospital databases
airline reservation systems . Damerau-Leveshtein metric
among a small group of school children
(N=30). The observed mistakes (natural
(insertions, deletions, substitutions, or reversals) to change one
mutations) are tabulated in Table 2 as
string into another. This metric can be used with standard
probabilities. It is noted that the sample
optimization techniques to derive the optimal score for each
size of N=30 is inadequate for a linguistic
string matching and thereby choose matches in the order of
study of this kind. However, as already
closeness. Approximate or fuzzy string matching is in vogue not
highlighted, this paper reports a pilot
only in natural languages but also in artificial languages. In fact
approximate string matching has been developed into a fine art in
concept. Moreover, the sample size can be
computational sciences, such as bioinformatics. Bioinformatics
made larger once the research community
whets the approach put forward by us.
DNA, RNA, and Amino Acid Sequences. Dynamic programming algorithm
algorithms) which enable fast approximate string matching using carefully crafted scoring
matrices are in great use in
bioinformatics. The equivalent of Google for modern
basic local alignment search tool (BLAST), which uses scoring matrices such as point accepted mutation matrices (PAM) and BLOcks of Amino Acid SUbstitution Matrix (BLOSUM). To the best of the knowledge of the authors, such a scoring system is not in existence for any natural language including English.
Log-odds Scoring Matrix It is possible to use Table 2 itself for
Recently an attempt has been made in this direction for English
scoring string matches. However, it might
language. The statistics for accepted mutation in English was
be unwieldy in practice. For long strings
cleverly derived based on already designed Google searches. In
we will need to multiply probabilities,
the case of Malayalam, statistics of character mutations are not
which might result in numeric underflow.
easily derivable from any corpus or any existing search engines
or other language computing tools. Hence, data for this needs to b
transformation. Another effect that we will
e generated to go ahead with development
use is to convert from probability to odds.
system. We will now describe generation of primary data of
The odds can be defined as the ratio of
natural mutation in Malayalam.
the probability of occurrence of an event
Malayalam has a set of 51 characters, and basic statistics of its matrix.
mutation are required for developing a scoring
to the probability that it does not. If the
Occurrence and Mutation Probabilities occurrence and
from corpus of considerable size in 1971 and again in 2003.
probability of an event is p, then odds is p/1-p. We will however not use this formula directly, but define odds for any given match i-j as: Sij = 10 log (Pij/Pi)
We describe here only a subset of characters in view of economy of space. In Table 1, we give the probabilities of one set of consonants, which we have extracted from a small test corpus of
probability that character i mutates to character j and pj is the probability of
Malayalam text derived from periodicals.
natural occurrence of character j. Thus We then designed and conducted a study to extract the character mutation probabilities. We selected 150 words that cover all the chosen
among a CLEAR Sep.2012
the negative score for a mutation of a less frequently
more in this scheme. The multiplier 10 is ed 2
used just to bring the scores to a convenient range. Table 3
References  Altschul, S F, et al. (1990). ―Basic local alignment search tool‖, Molecular Biology, 215(3), 403-410.
shows the log- odds score thus derived using occurrence probabilities and mutation probabilities given in Table 1 and 2. These can be used to score approximate matches and select the
 Damerau, F J (1964). ―A technique for computer detection and correction of spelling errors‖, ACM C ommunications, 7(3), 171-176.
most similar one.
 Dayhoff, M O, et al. (1978). ―A model of Evolutionary Change in Proteins‖, Atlas of protein sequence and structure, 5(3), 345-358.  Google-diff-match-patch, [Online]. Available: http://code.google.com/p/google-diffmatch patch/, Accessed on 20 Jan. 2012. Results, Discussions, and Conclusion The prototype scoring matrix we have designed above can be demonstrated to be capable of scoring approximate matches and can therefore be a means of selecting the closest match. We will demonstrate this with an example of scoring four approximate matches for the word k. Table 4 lists the scores for the four different matches and the exact match scores best. The next best match as per the new scoring scheme is കക.
 Hall, P A V and Dowling, G R (1980). ―Approximate String Matching‖, ACM Computing Surveys, 12(4), 381- 402.  Henikoff, S and Henikoff, J G (1992). ―Amino Acid Substitution Matrices from Protein Blocks‖, Proceedings of the National Academy of Sciences of the United States of America, 22(22),1091510919.  Kanitha, D (2011). ―A scoring matrix for English‖, MPhil Dissertation in Computational Linguistics, Dept. Of Linguistics, University of Kerala.  Leon, D (1962). ―Retrieval of 24 misspelled names in an airlines passenger record system‖, ACM Communications, 5, 169-171.
Our demonstration has been on a chosen set of consonant characters, but it can be expanded to cover all Malayalam characters. For demonstrating more general words, scoring matrix for vowels is essential. We have computed the same and will be reporting it in a forthcoming publication. During our studies, we also noticed that the grouping of characters as done conventionally may not suit our studies. For example, we found that the character though
is a possible mutation for , very rarely, even not
regrouping based on natural mutations is a work we see as requiring attention. To the best of our knowledge, our work is a unique proposition for the Malayalam language, which can be incorporated into Malayalam search engines. We would like to reiterate that our work is in prototype stage. The sample size of the corpus as well as the size of the subjects in the survey is not substantial. The
 Nair, A S (2007). ―Computational Biology & Bioinformatics: A Gentle Overview‖, Communications of the Computer Society of India, 31(1), 1-13.  Navarro, G (2001). ―A Guided Tour to Approximate String Matching‖, ACM Computing Surveys, 33(1), 31 88.  Needleman, S B and Wunsch, C D (1970). ―A general method applicable to the search for similarities in the amino acid sequence of two proteins‖, Journal of Molecular Biology, 48(3), 443-453.  Prema, S (2004). ―Report of Study on Malayalam Frequency Count‖, Dept. Of Linguistics, University of Kerala.  Soundex, [Online]. Available: http://en.wikipedia.org /wiki/Soundex, Accessed on 2 Dec. 2011.  Wagner, R A and Fischer, M J (1974). ―The String-to-String Correction Problem‖, Journal of the ACM, 21(1), 168-178.
authors hope to expand the work with a sizable database from which statistics is extracted and then the scoring matrix can be made more reliable. We also propose to validate the scoring
This article was published in CSI MAY 2012 and reused here with author's permission.
approach with sample trials involving language experts. CLEAR Sep.2012
INDIAN SEMANTICS AND NATURAL LANGUAGE PROCESSING The
history of modern linguistics is chronologically divided into
two as BC (Before Chomsky) and AD (After Dissertation). Here dissertation means the thesis which Chomsky submitted to Pennisilvania University for Doctorate degree. His ideas are considered epoch making comparable to the Darvin‘s theory of evolution
to get recognition
M.Jathavedan, Emeritus Professor, Department of Computer Applications, CUSAT, Cochin firstname.lastname@example.org
Structures‘. Paninian grammar was introduced to modern linguistics as a
forerunner of Chomsky‘s generative grammar introduced in the above book. ‘Many linguists, foreign and Indian, joined the bandwagon and paused as experts in Paninian grammar in Chomskian terms ( Joshy S.D.). The renewed interest
influenced the interpretation of Paninian grammar itself as generative grammar – the idea that grammar consists of modules in a hierarchy or levels. The first contribution in this direction was due to Kiparsky and Staal (1969 ) who proposed a hierarchy of four levels of representation. This was criticized by Hauben (2002)as they did not permit semantic factors. Other important contributions are due to Caradona (1976).
Thus computational Sanskrit emerged as a new branch of research. Apart from
Joshy continues: ‗Somewhat later Chomsky had drastically
computer assisted teaching and research
reversed his ideas and after the enthusiasm for Chormsky
subsided, it became clear that the idea of transformation is
automated reconstruction of Sanskrit texts
alien to Panini. Now a new type of linguistics has come up,
and machine aided translation
called Sanskrit Computational Linguistics with three capital
designing a working system of Paninian
letters. Although Chomsky is out , Panini is still there ready to
be acclaimed as the forerunner of SCL.‘ But SCL was identified
as a branch of study in 2007 only and there were other factors
Languages, it‘s possible applications in
that led to its formation.
cognitive science, AI are some areas of
active research in Sanskrit departments of In a paper entitled ‗Knowledge representation in Sanskrit and
many universities and computer science
departments of many institutes.
attention of computer scientists to the works on semantics in Sanskrit literature instead of Paninium. note is that he was referring
The important fact to
Laghu Manjusha‘ of Bhatta- Nagesa (1730-1810), perhaps the last Sanskrit scholar in the Indian tradition. This paper, rightly or wrongly, aroused great enthusiasm among Sanskrit scholars. Some of them went even to the extent of claiming that the future direction of research in artificial language would be decided by Sanskrit. The immediate result was the ‗ First Seminar
It is a surprising fact that we are not able to locate any more contribution of Briggs in
pouring in the internet for and against the arguments put forward by Briggs. Another point to be noted is that the authority of the paper is Briggs in person and not NASA as ill-conceived by many.
on Knowledge Representation and
Samskritam ‗ (1986) held at Bangalore in which Briggs
A question that naturally raised was the role of Sanskrit as a
“Kriya is the action of the verb in
development of a compiler for use of Sanskrit instructions.
the sentence. The other words
C-DAC, Bangalore had initiated some work in this direction
which are “factors in the action “of
in early 1990s itself. It was claimed that Astadhyayi
the verb are called karakas.”
(Paninium ) was useful in this matter – i.e., meta-rule, meta-language and linguistic marker system of Panini to draw up the specification and requirements of such a
The formal categories in their discussions
processor. To what extent the search has been successful
were mainly those established in Paninium
after twenty years is a question.
philosophically by Bhartruhary. We will The International Symposiums on Sanskrit Computational
consider two or three of them.
Linguistics ( SCL )were the results of the attempt to provide a common platform for the traditional Sanskrit Scholars and
As an example we consider the sentence:
the computational linguists. It was a culmination of the World Sanskrit Conferences, especially the thirteenth one held at Edinburg and the First National Symposium on Modeling and shallow parsing of Indian Languages in Mumbai, both held in the year 2006. The first Symposium was held in France in 2007 and the last one at Jawaharlal Nehru University, New Delhi (2010).
„Rama cooks rice‟ In the subdivision of a sentence into words, the grammarians take the verb as important. Other words are related to this meaning-bearing word in one way or other. Kriya is the action of the verb in the sentence.
―factors in the action ―of the verb are LINGUISTICS AND PHILOSOPHY
Linguistics is considered as a part of philosophy in India. It
karakas. Panini has defined six
is often said that ‗ the grammatical method of Panini is as fundamental to the Indian thought as is the geometrical method of Euclid for the western thought.‘
For the sentence in our example the grammarians
analytical description: Semantics in Sanskrit was never a well –defined domain of
It is the activity of cooking, taking place in
a separate discipline ( Hauben, ). Rather, it remained the
the present time, having an agent which is
battle field for exegetes, logicians and grammarians with
identical with Rama, having an object
various backgrounds and philosophical commitments. It
identical with rice.
was only a few centuries after Bhartrhari (4 th century A.D. ) that a sophisticated specialized language and terminology
Thus the sentence is split into elements
were developed for discussing semantic problems and
such as stem, root, affix, ending and the
theories of verbal understandings. Thus during the period
attribution of well-defined
thirteenth to sixteenth centuries semantic issues
meaning to The
were seriously taken up for discussion between different
element in this analysis is the meaning
philosophical schools not only focussing on language but
expressed by the verb ‗cooks‘, or to be
also from a religious point of view.
more precise, the meaning of the verb root ‗to
Sanskrit the verbal ending ti in pa(ca)ti ) indicates that the activity takes place in the present time. The agent of the action is expressed by the grammatical subject, Rama, the object of the action is the grammatical object rice.
For the Mimamsa thinkers also the verb is the central
element in a sentence. While grammarians take the verbal
Bhartrhari (4th century AD )developed his
root and the activity expressed by it as more important than
sphota theory after Panini (4th century
the verbal ending and its meaning, the latter are more
important for Mimamsakas. According to them the basic
Bhatta Nagesa gave completion to sphota
meaning of all verbs is a creative urge which stimulates
theory in eighteenth century. The later
action. This basic urge is expressed – transmitted to the
listener – by the verbal ending, not by the verbal root which
considered as a continuation of this.
Again centuries elapsed before
merely qualifies this creative urge. Thus according to them the sentence in our example can be given the following structural description:
There are four factors involved in a proper cognition – expectancy, mutual compatibility, proximity and intention of
“It is the creative urge which is conducive to cooking , taking
the speaker. It is difficult to include the
place in the present time, having the same substratum as the
last one in any syntatic solution. According
agent residing in Rama, having as object rice. ―
communicate through words all that he Now for the Nyaya school, it is not the verb which is the central element in the sentence, but, generally the noun in
intended to and the hearer understands more or at times less than what he hears!
the first ending ( nominative ). Thus the structure of the verbal knowledge in our example according to them is:
Thus there is mutual dependency of Indian theories of syntax and semantics. It is
― It is Rama who possesses the volitional effort conducive to cooking which produces the softening and moistening which is based in rice. ―
said that the Indian linguists of the fifth century B.C. knew more of the subject than western linguists of the nineteenth century A.D. Further, if there is any area
Underlying all these descriptions is the presupposition that the main structural relation in the sentence is that between qualifier and the thing to be qualified (visesana/visesya ) and unlike grammarians and Mimamsakas for whom the visesya is verb, for Nyaya thinkers the visesya is the noun in the first
where the ancient Sanskrit scholars have been
ending. SANSKRIT COMPUTATIONAL LINGUISTICS I have already quoted S.D.Joshy. The sentences were from his paper ‗ Background of the Astadhyayi ‗ read in the third International
Linguistics held in 2009 at Hyderabad. He continues: ‘ Contrary to some western misconceptions the starting point of Panini‘s analysis is not meaning or the intention of the speaker, but words from elements.
morphology to arrive at a finished word.‘ But ‗he developed a number of theoretical concepts which can be applied to other languages also.‘ Coming back to Briggs, we note that in contrast to other works his paper has for the first time drew attention of computer scientists to the semantic theories available in Sanskrit. Since it is meaning that is important in a sentence, syntax is developed to tackle the semantic problem.
REFERENCES: 1. Briggs, Rick, 1985, Knowledge representation in Sanskrit and artificial intelligence, The AI magazine. 2. Briggs, Rick, 1986, Shastric Sanskrit: an interlingua for machine translation, First National Conferece on Knowledge Representation, Bangalore. 3. Chormsky, N, 1957, Syntactic Structures, The Hague, Mouton. 4. Caradona, George, 1976, Panini: A survey of Research, The Hague, Mouton. 5 .Kiparsky, Paul and Staal J.F., 1969, Syntactic and semantic relations in Panini, FL 5. 6. Hauben, E.M, 2002, Semantic in the Sanskrit tradition on the eve of colonialism, Project report, Leiden University. 7. Joshy, S.D., 2009, Background of the Astadhyayi, Third International Symposium on Sanskrit Linguistics, Hyderabad.
Overview of Question Answering System Interaction between humans and computers is one of the most important active areas of research in this modern world. Particularly interaction with natural language becomes more popular. Natural Language Processing is a computational technique for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis to achieve human-like language processing for a wide range of applications.
One of the most powerful applications of NLP is
Question Answering System. The need for automated question answering systems becomes more urgent due to the enormous growth of digital information in text form. QA system involves analysis of both questions and answers. In this overview, we focus on Question Type Classification, Question Generation, and Answer Generation for both closed and open domain.
in Natural Language Processing  has
been going on for several decades dating back to the late 1940s. The goal of NLP is to accomplish human-
K.M. Arivuchelvan, Research Scholar, Periyar Maniammai University.
like language processing. The discipline and practice of
K. Lakshmi Professor, Periyar Maniammai University.
structural models of language and the discovery of language universals - in fact the field of NLP was originally referred to as Computational Linguistics;
general knowledge about the structure of the world
Computer Science - is concerned with developing
that language users must have in order to maintain
Psychology - looks at language usage as a window
Natural language processing is used for a wide
into human cognitive processes, and has the goal of
modelling the use of language in a psychologically
The most explanatory method for presenting what actually
Processing system is by means of the â€—levels of
Translation, Dialogue Systems. In this paper we discuss more towards Question-Answering.
languageâ€˜ approach. Phonology concerns how words are
Morphology concerns how words are constructed from more basic meaning units called morphemes. A morpheme is the primitive unit of meaning in a language. Syntax level concerns how words can be put
determines what structural role each word plays in the sentence and what phrases are subparts of what other phrases. Semantic level concerns what words mean and these meanings combine in sentences to form sentence meanings. Pragmatic level concerns how sentences are used in different situations and how use effects the interpretation of the sentence. Discourse
preceding sentences affect the interpretation of the next sentence. World knowledge includes the
Question-Answering system can be performed in two domains: Closed and Open Domain. Closeddomain
questions under a specific domain (for example, medicine or automotive maintenance), and can be seen as an easier task because NLP systems can exploit
domain might refer to a situation where only a limited type of questions are accepted, such as questions procedural
answering  deals with questions about nearly anything, and can only rely on general ontologies and world knowledge. On the other hand, these systems usually have much more data available
from which to extract the answer.
Interpretation: What does X mean?
Causal antecedent: Why/how did X
Question Answering  is a specialized form of information
documents, a Question Answering system attempts
to retrieve correct answers to questions posed in
natural language. Open-domain question answering
Goal orientation: Why did an agent do X?
requires question answering systems to be able to
answer questions about any conceivable topic. Such
Instrumental/procedural: How did an agent do X?
systems cannot, therefore, rely on hand crafted
domain specific knowledge to find and extract the
Expectation: Why didnâ€˜t X occur?
Judgmental: What do you think of X
Question Classification  is an important task in
Question-Answering. The most well known question taxonomy was one proposed by Graesser and
After analyzing 5,117 questions in the research
Person (1994) based on their two studies about
methods and 3,174 questions in the algebra
tutoring sessions in a college research method course and middle school algebra course. Six trained human judges coded the questions in the
categories: verification, instrumental-procedural, concept completion, and quantification questions.
transcripts, obtained from the tutoring sessions, on
Question Generation (QG)
four dimensions: Question Identification, Degree
For the first time in history , a person can ask a
Specification (e.g. High Degree means questions
question on the web and receive answers in a few
contain more words that refer to the elements of
seconds. Twenty years ago it would take hours or
desired information), Question-content Category,
weeks to receive answers to the same questions
and Question Generation mechanism (the reasons
as a person hunted through documents in a
for generating questions include knowledge deficit
library. In the future, electronic textbooks and
information sources will be main stream and they
conversation control ). They defined following 18 question categories according to the content of information sought rather than on the interrogative words (i.e. why, how, where, etc).
will be accompanied by sophisticated question asking and answering facilities. Applications
sample, some of which are addressed in this
deeper learning. 3.
16. Quantification: How much? How many?
Questions that human
tutors might ask to promote and assess
properties of X? How
other media. 2.
14. Example: What is an example of X?
Suggested good questions that learners might ask while reading documents and
13. Concept completion: Who? What? When?
endless and far reaching. Below are listed a small
12. Disjunctive: Is X, Y, or Z the case?
11. Verification: invites a yes or no answer.
Suggested questions for patients and caretakers in medicine.
agent do X? 18. Comparison: How is X similar to Y?
Suggested questions that might be asked in
These data often comprise text documents in
legal contexts by litigants or in security
which the structure of the document or certain
contexts by interrogators.
extracted information is expressed by a markup. from
Such markups can be attributed manually (e.g.,
information repositories as candidates for
the structure of a document) and/or in an
Frequently Asked Question (FAQ) facilities.
automatic way, e.g., markups for identified
The time is ripe for a coordinated effort to tackle QG in the field of computational linguistics and to launch
relationships in newspaper articles.
a multi-year campaign of shared tasks in Question Generation (QG). We can build on the disciplinary
and interdisciplinary work on QG that has been
Question answering is a complex task needing
evolving in the fields of education, the social
effective improvements of different research
sciences and computer science. The QG system
areas including, question generation, question
implemented QG algorithms, and consults relevant
retrieval, natural language processing, database
information sources. Very often there are specific
goals that constrain the QG system.
human computer interaction, speech processing
and computer vision. Question Answering Today‘s question answering  is not limited by the type of document or data repository – it can address both traditional databases and more advanced ones that
Structured and unstructured data collections can be considered
answering. Unstructured data allows querying of raw features (for example, words in a body of text), extracting attached. structured
information Related and
traditional distinction between restricted domain question answering, or RDQA, and open domain question answering (ODQA). RDQA systems are designed to answer questions posed by users in a specific domain of competence, and usually rely on manually constructed data or knowledge sources. They often target a category of users
terminology in their query formulation, as, for example, in the medical domain. ODQA focuses on answering
domain. Extracting answers from a large corpus of textual documents is a typical example of an ODQA system. Recently, we have witnessed an approach of question answering involving semi-structured data.
REFERENCES 1. Liddy, E. D. In Encyclopaedia of Library and Information Science, 2nd Ed. Marcel Decker, Inc. 2. Ming
System for Academic Writing Support‖ Dialogue and Discourse 3(2) (2012) 101–124. 3. Mark
Question Answering‖ September 2005. 4. http://en.wikipedia.org/wiki/Question_answer ing. 5. Andrew Lampert ―A Quick Introduction to Question Answering‖ December 2004. 6. Workshop Report ―The Question Generation Shared
Sponsored by the National Science Foundation. 7. Oleksandr Kolomiyets, Marie-Francine Moens ―A survey on question answering technology from
Information Sciences 181 (2011) 5412–5434.
I-Search.... Future of Search Engines Author
M. Tech Computational Linguistics Govt. Engg. College, Sreekrishnapuram email@example.com
surfing the web may be a casual phrase in day to day business. The netizens continuously enrich the web-vocabulary by words like ―Googling‖. What this
speaks is how search engines are important in this digital era. A web search engine is designed to
A semantics search engine attempts to make sense of search results based on context. It
search for information on the World Wide Web.
automatically identifies the concepts structuring Today‘s
Directory-based engines, like Yahoo, are still built manually. What that means is that you decide what your directory categories are going to be Business, and Health, and Entertainment and then you put a person in charge of each category, and that person builds up an index of relevant links. Crawler-based engines, like Google, employ a software program — called a crawler — hat goes out and follows links, grabs the relevant information, and brings it back to build your index. Then you have an index engine that allows you to retrieve the information in some order, and an interface that allows you to see it. It‘s
For instance, a
retrieve documents containing the words ―vote‖, ―campaigning‖ and ―ballot‖, even if the word ―election‖ is not found in the source document. Semantic
points including context of search, location, intent,
generalized and specialized queries, concept matching
provide relevant search results. Major search engines like Google and Bing incorporate some elements of Semantic Search. The objective of this article is to discuss the recent advances in
all done automatically.
area of Semantic Search. As the Web continues to grow, however, and to be more
Google's Knowledge Graph:
communication, and research, information-retrieval
Google usually returns the search result for any
problems become a more serious handicap. The
query based on the text and the content. To put
percentage of Web content that shows up on search
it right, it does not understand the exact
engines continues to wane. And as search engines
meaning of the words. It matches the keywords
of the query with those of the sites and returns
information they provide may be increasingly out-of-
pages that have a significant authority on those
Recent advances in intelligent search suggest that
Amit Singhal, Google‘s senior VP of engineering,
these limitations can be partially overcome by
said : “The introduction of Knowledge Graph
providing search engines with more intelligence and
enables Google to understand whether a search
with the user‘s underlying knowledge. That is called
natural language processing. It might also have to
understand what the user need, even when he
about discovery' – the basic human need to
doesn‘t say it. And that requires some knowledge of
learn and broaden your horizons”.
'Search is a lot
the user. These ideas lead to the birth of a new generation of web technologies, popularly known as Semantic Web.
“The introduction of Knowledge Graph enables Google to understand whether a search for „Mars‟ refers to the planet or the confectionary manufacturer.
'Search is a lot about
discovery' – the basic human need to learn and broaden your horizons”. Amit Singhal
Bing's Semantics Search
By making search more natural and intuitive,
Microsoft specifically brands Bing as a "decision
Powerset is fundamentally changing how we
engine," and not as a general purpose search
search the web, and delivering higher quality
engine--even though it provides that functionality
as well--in order to differentiate it from Google Search. Bing's search is based on semantic technology from Powerset that was acquired by Microsoft in 2008. Notable changes include the listing of search suggestions as queries are entered and a list of related searches (called "Explore
knowledge on phrases and what they uniquely refer to. 
Hakia: Hakia is a general purpose semantic search engine, that search structured corpora (text) like Wikipedia. For some queries (typically popular queries and queries where there is little
These are portals to all kinds of information on the subject. Every resume has an index of links to the information presented on the page for quick reference. Often, Hakia will propose related
research.  Bing‘s new product Adaptive Search strives to capitalize
Adaptive Search will take into consideration your
Cognition has a search business based on a
user behaviour, then tailor your Bing results to be
semantic map, built over the past 24 years,
most appropriate. So if you‘ve searched for a
word then clicked on a specific site previously,
Bing will predict that it‘s likely that what you‘re
English language available today. It is used in
searching for falls into the context of that site,
thus it can provide you with results that are more
translation, document search, context search,
and much more. 
Powerset is a Microsoft owned Company building
Swoogle, the Semantic web search engine, is a
a transformative consumer search engine based
research project carried out by the ubiquity
on natural language processing. Their unique
research group in the Computer Science and
innovations in search are rooted in breakthrough
technologies that take advantage of the structure
University of Maryland. It‘s an engine tailored
and nuances of natural language. Using these
towards finding documents on the semantic
advanced techniques; Powerset is building a
confines of keyword search.
Swoogle is capable of searching over 10,000 ontologies and indexes more that 1.3 million web documents. It also computes the importance of a Semantic Web document. The techniques used for indexing are the more Google-type page ranking and also mining the documents for interrelationships that are the basis for the semantic web. 
PyLucene PyLucene is a GCJ-compiled version
integrated with Python. Its goal is to allow you to use
Lucene's text indexing and
NLP is a complex area of research, requiring a solid understanding of grammars (not just grammar), and a good grounding in computational
linguists (in order to apply the techniques to machine, which is not always easy). Understanding the techniques used in NLP allows us to provide the best format and patterns for the search engine. Seeing as NLP seeks to mimic human language understanding, using common sense is a good idea. But before any broader, more sophisticated sort of intelligence can be placed into a machine we humans will have to get a better grasp on just what intelligence is.
References: 1. http://mashable.com 2. http://semanticweb.com 3. http://thenextweb.com 4. http://web2innovations.com 5. http://blogs.wsj.com
Google synonyms and natural language processing Google just blogged about synonyms as they related to searcher intent. They provide several examples of how a concept as simple as a synonym complicates natural language processing. This also brings up some important recommendations for site owners with respect to SEO. Prospective customers type in all kinds of variations on your most obvious keywords (hence the need for keyword research). Often they make use of synonyms, some common, some not. These variations often represent less competitive opportunities for high search engine rankings if you can incorporate those synonyms into your website. In particular: Use common variations within your existing copy rather than using the same phrase repeatedly. (This also tends to make long blocks of text more readable.) Develop pages that specifically focus on each of the most common and valuable synonyms. If there are enough synonyms and industry-specific terms, consider developing a glossary of terms. Find opportunities to talk about the synonyms, such as a blog post or article that talks about how synonyms may actually be somewhat different or whose similarity is up for debate (e.g. SEM vs. Search Engine Advertising). http://www.web1marketing.com
Remolding Professional sectors: the SaaS way.. SaaS : Purpose and Functions The costs and time to market benefits of outsourcing business
services like payroll, Storage space, Customer Relationship
Dr. Sudheer S Marar
Management (CRM) applications, and company websites has been proven for many businesses. The term for these types of outsourced services is most recently known as Software As A Service (SaaS).
MCA MBA PhD Associate Professor and HOD, Department of MCA Nehru College of Engineering and Research Centre
market while others take time or in the worst case never get toehold in a given market. The ideal introduction scenario for a Introducing new technology is an expensive undertaking,
carrier would be that they could
usually requiring high capital outlays and can take many
try a new service in a particular
months of training, installation and integration before service
market without having to make
can be delivered in network. Outsourcing these services to
a significant investment all the
organizations that are experts in the technology lowers costs,
while gaining key market data.
increases uptime, accelerates revenue realization and provides increased flexibility & functionality.
Therefore, companies today are faced with the challenges of
Due to these results, hosting for these critical business
functions continues to grow and many companies are looking
operating costs, protecting their
for similar opportunities in other operational areas.
current investments and having the
Effects of Downturn
applications quickly. To add to
As stated in Movius Corporation annual report, The economic
the challenge, many carriers
downturn has globally forced many companies to reduce
are faced with older application
spending across the board. This has put companies that are in
platforms that are limited in
highly competitive and innovation driven industries, say
telecommunications in a exigent balancing act. While they
approaching end of life. These
need to try to control expenses, if they are not also continuing
companies need cost-effective
to introduce the latest applications and services, they will
quickly begin to lose their market share.
networks to IP infrastructure The ideal situation for a carrier would be, to almost suddenly
introduce new services without risking precious in hand
capital. Under the best possible scenario the carrier could
begin generating revenue in a matter of weeks after making the decision to launch a new service. If the service could be introduced without the need to add additional staff, the solution is essentially risk free.
As extracted from a lead article of IDC-SAP initiated
Clearly SaaS applications are maturing. The
paper, ―..Professional service firms focus their business
number of companies that either are using
management energy on optimizing the utilization of an
expert's or a consultant's time. They attempt to develop
applications in the next year has grown
service offerings or skill sets that clients will find
compelling. Ultimately, they focus on properly charging
suggesting that the barriers to adoption —
and receiving payment from clients. Larger firms tend to
broaden their offerings to ensure a greater wallet share.
overcome. We see a bright future for SaaS
Meanwhile Smaller firms tend toward key-field focusing
across a broad range of application areas
and deep industry expertise, hoping to foster continuing
and for large and small professional services
relationships with a small number of clients.‖
In short, All firms balance developing a talent pipeline
SaaS is not without its problems, however.
with maximizing utilization rates. Client satisfaction and
Functionality and security concerns hang
trusting relationships drive both repeat business and
back, and while these concerns are more a
applications or plan
perception than reality, it is important when
Therefore, firms seek to ensure deliverables of the
considering applications from a SaaS vendor
highest possible quality and strive to fully meet client
that appropriate due diligence be applied to
expectations throughout the engagement process. Firms
ensure that the functionality meets critical
increasingly use technology to support all parts of their
business: Finance and scheduling software are common,
corporate client to have a good choice on its
knowledge management and data warehouse capability
SaaS vendor, not all are created equal. As
help improve service quality, and client management and
this domain is a maturing capability, one
engagement management software are increasingly used
should make it sure to select a vendor that
to monitor and maximize customer satisfaction. The
brings experience, financial stability, and a
increased use of technology has both aided and hindered
good reputation for working effectively with
professional services firm-constrains to improve their key
thereby ensuring the client on its business benefits,
Benefits of SaaS
The cost of a complex business management software implementation is often the starting point for a discussion and often a point where the discussion meets a quick end. In their research, IDC has identified several areas where SaaS system delivery costs differ from on-premise delivery costs. Primarily, They are the following: •
License fees. Both initial and Maintenance cost.
IT infrastructure costs.
Test Environment maintaining development cost
IT personnel/support costs.
Security, backups, and disaster recovery.
Apple's SIRI What is Siri?
Robert Jesuraj K
(Speech Interpretation and Recognition Interface) is
an intelligent personal assistant and knowledge navigator which
M. Tech Computational Linguistics Govt. Engg College Sreekrishnapuram
works as an application for Apple's iOS. The application uses a natural language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Web services.
Siri was originally introduced as an iOS application available in the App Store by Siri, Inc. Siri, Inc. was acquired by Apple on April 28, 2010. Siri, Inc. had announced that their software would be available for BlackBerry and for Android-powered phones, but all development efforts for non-Apple platforms were cancelled after the acquisition by Apple. Siri is now an integral part of iOS 5, and available only on the iPhone 4S, launched on October 14, 2011.
announced that it had no plans to support Siri on any of its older devices Obviously, Siri won't be able to answer every
Using Siri The app transcribes spoken text and then takes these commands and routes them to the right web services. If you try to book a table at a Thai restaurant ("get me a table at a good
restaurant nearby"), for example, Siri will check where you are, query Yelp for reviews of nearby Thai restaurants, show you the options and then pre-populate a reservation form on OpenTable with your information. All you have to do is to confirm Siri's selection.
query - and sadly the app doesn't use Wolfram Alpha to give you answers to factual questions (yet). Should that happen, Siri will just route your query to a search engine and display the search results. As the Siri team told us, however, users tend to learn which queries work best pretty quickly (just like we learned how to structure effective queries for Google). To use the iPhone app, you just have to say aloud a command like "Book a table for six at
The software is surprisingly good at translating
7pm at McDonalds" (I'm sure you're classier
voice queries into text. The application works so
than that, but let's stick with it for now), and
well because it is able to recognize the context of
then using speech-recognition technology and
your queries. This kind of semantic analysis is a
the iPhone's GPS capabilities, your command is
very computing intensive problem, so most of the
actual number crunching happens on Siri's servers.
responding with confirmation of bookingâ€”or
Siri outsources the voice recognition to Nuance and
lack of availability.
if you are not comfortable with speaking into your phone, you can always use a regular text query as well.
Siri, which has ties with Stanford Research Institude
DARPA Helps Invent The Internet And
Helps Invent Siri
MovieTickets, StubHub, CitySearch and TaxiMagic to help with bookings and information, which pretty
With Siri, Apple is using the results of over 40
much wipes out the reason why you'd want to
download any of those services' apps individually.
Siri is all this and something that could only be held to the definition of true synergy, e.g.: ―Two or more things functioning together to produce a result not independently obtainable‖. None of the individual parts are "new" but the combination Siri created has never really been seen before. It has been the Holy Grail of computer researchers to one day create a device that could become conversational and intelligent in such a way that it would appear that the dialog is human generated.
(http://www.ai.sri.com/ Siri Inc. was a spin off of SRI Intentional) through the Personalized Assistant
https://pal.sri.com) and Cognitive Agent that Learns and Organizes Program (CALO). This includes the combined work from research teams from Carnegie Mellon University, the University of Massachusetts, the University of Rochester,
Machine Cognition, Oregon State University, the University of Southern California, and Stanford University. This technology has come
Apple Siri can speak Hindi now When Siri was announced with the iPhone 4S,
a very long way with dialog and natural
understand the Indian accent let alone be able to
evidential and probabilistic reasoning, ontology
speak Hindi. We were however left bewildered when
we found a video online where Siri responds to
reasoning and service delegation.
users queries in Hindi! Similar applications for hand-held devices Siri‘s support for Hindi comes to us courtesy Kunal
1) S Voice is a intelligent personal assistant
Kaul. The hack connects Siri to Kunal‘s Google API
and knowledge navigator which works as an
server and interacts in Hindi.
smartphones, similar to Apple inc's Siri on the Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.Another interesting aspect of the video is that the questions are asked in English and the responses given by Siri are in Hindi and the devanagari script appears on screen. The face that the questions are asked in English has led us to believe that Siri does not understand questions asked in Hindi.
iPhone. It first appeared on the Samsung Galaxy S III on May 3, 2012. The application uses a natural language user interface to answer
and perform actions by delegating requests to a set of Web services. 2) Assistant is the codename of a rumored upcoming Google application that will integrate voice recognition and a virtual assistant into Android. It is expected to launch in Q4 of 2012. Before March 2, 2012, the project was known as "Google Majel", and that name was originated from Majel Barrett-Roddenberry, the actress best known
Federation Computer from Star Trek.
The software is an evolution of Google's Voice
With the app, an Android user can just "ask"
Actions that is currently available on most Android
phones while adding natural language processing.
information. The developers claim Iris can talk
Where Voice Actions required the users to issue
on topics ranging from Philosophy, Culture,
"navigate to…", "Assistant" will allow the users to
However, Android users need to have "Voice
Search" and "TTS library" installed in their
According to search engineer Mike Cohen, the
phones for Iris to work. Among its features are
"Assistant" project has three parts: "getting the
world's knowledge into a format a computer can
searching on the web, and looking for a
understand, creating a personalization layer —
Experiments like Google +1 and Google+ are Google's way of gathering data on precisely how people interact with content; building a mobile, voice-cantered "Do engine" ('Assistant') that's less
about returning search results and more about Whoosh
accomplishing real-life goals".
implemented in pure Python. Programmers can use it to easily add search functionality to their applications and websites. Every part
extended or replaced to meet your needs 3) Iris is a personal assistant application for Android. The application uses natural language processing to answer questions based on user voice request. Iris currently supports Call, Text, Contact Lookup, and Web Search actions including playing videos, looking for: lyrics, movies reviews, recipes, news, weather, places and others. It was developed in 8 hours by Narayan Babu and his team
Limited, a Kochi (India) based firm. The name is actually
original application for the same use built by Apple Inc.
exactly. Some of Whoosh's features include: Pythonic API. Pure-Python. No compilation or binary
mysterious crashes. Fielded indexing and search. Fast
faster than any other pure-Python search solution I know of. See Benchmarks. Pluggable
(including BM25F), text analysis, storage, posting format, etc. Powerful query language. Pure Python spell-checker (as far as I know, the only one).
Inviting Articles for CLEAR Dec2012 We are cordially inviting thought-provoking articles, interesting dialogues and healthy debates on multi-faceted aspects of Computational Linguistics, for the second issue of CLEAR, publishing on Dec 2012. The topics of the articles would preferably be related to the areas of Natural Language Processing, Computational Linguistics and Information Retrieval. Authors are requested to send their articles in doc/odt format to the Editor, before 15
November 2012, by email firstname.lastname@example.org. -Editor
Thanks To Principal, Govt. Engg. College Sreekrishnapuram, Staffs and Students, Dept. of CSE, Govt. Engg. College Sreekrishnapuram, Authors of CLEAR Sep 2012- Dr. Achutsankar, Prof. Jathavedan M, Dr. Sudheer S Marar, Mr. Sajilal D, Dr. Lakshi K, Mr. Arivuchelvan