Issuu on Google+

CLEAR June2013

1


CLEAR June2013

2


Editorial …… ……. 5 C

SIMPLE News & Updates ……. ……… 6 Details of M.Tech Projects………….. 31 CLEAR Sep 2013Invitation…………… 40

CLEAR June 2013 Volume-2 Issue-2 CLEAR Magazine (Computational Linguistics in Engineering And Research) M. Tech Computational Linguistics Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad 678633 simplequest.in@gmail.com Chief Editor Dr. P. C. Reghu Raj Professor and Head Dept. of Computer Science and Engineering Govt. Engineering College, Sreekrishnapuram, Palakkad Editors Manu Madhavan Robert Jesuraj. K Athira P M Sreejith C

Last word…………. 41

A Novel Approach for Automated Question Answering ............................................. 7 Before the internet and electronic data storage, the time to search........

Software Localization and Malayalam Computing ............................................ 12 localize software in order to overcome cultural barriers for their ..

Extracting Precise Answers using Question Answering System................................. 16 Support Vector Machine employed tree kernel with a SVM classifier for question classification ...

Human Language Processing: Biological Perspective.... ........................................21 Scientific understanding of the role of genes in hearing is also increasing at an ......

ClearTK: Can I be a Competitor for NLTK and CoreNLP? ...............................................26 The ClearTK feature extraction library is highly configurable and .........

Cover page and Layout MujeebRehman. O

CLEAR June2013

3


CLEAR June2013

4


Greetings!

We are happy to release this edition of CLEAR at the start of the new academic year with contributions on a variety of topics. Interestingly, there is an article on the biological perspective of language processing as well. This is a positive signal for the CLEAR team, as their efforts are being received by an interdisciplinary audience. We hope that this edition fuels further thoughts on the mysteries of the language phenomenon!

With Best Wishes, Dr. P. C. Reghu Raj (Chief Editor)

CLEAR June2013

5


NEWS & UPDATES Industrial Training at IIITM-K: Virtual Resource Centre for Language Computing(VRC-LC) department of IIITM-K had organized a short course and industrial training on Natural Language Processing exclusively for the PG students of GEC, Sreekrishnapuram. The course was mainly related to Malayalam computing with an emphasis on the need for enabling localization. It was a 10 days programme (from 18th May - 28th May 2013) .During the course, various eminent faculties and research scholars of VRC-LC delivered their sessions on various aspects of language processing.

Congratulations!!!! The

paper

titled

Algorithm

for

Publication

"N-gram

based

distinguishing

between Hindi and Sanskrit texts" authored by Sreejith C and Indu M from

M.Tech

Computational

Linguistics has been accepted for the presentation

for

International

the

2013

Conference

Computing,

IEEE

Divya S from M.Tech Computation -al Linguistics presented a paper titled "News Summarizati on based on Sentence Clustering and Sentence Ranking", in ICMCMM 2013, conducted by MACFAST at Thiruvalla.

on

Communication

and Networking Technologies(ICCCNT)

SIMPLE Groups Congratulates Divya S for her achievement!!!

-2013. SIMPLE

Groups

Sreejith

and

Indu

Congratulates for

their

achievement!!!

CLEAR June2013

6


A Novel Approach for Automated Question Answering K.Ramya

K.M Arivuchelvan

Student, M.Tech (CSE) Periyar Maniammai University Vallam ramya.devi43@gmail.com

Assistant Professor (CSE) Periyar Maniammai University Vallam arivu@pmu.edu

Question Generation (QG) is a key challenging systems that interact with natural languages. The potential benefit of using automated systems to generate questions helps to reduce the dependency on humans. In particular, Anaphora resolution and Up-Keys to generate questions from input documents (i.e.) paragraph. This paper presents an approach to generate question from a paragraph. Since the paragraph may have complex sentences the system will generate the simple sentences. The simple sentences are transformed into interrogative sentences and use hybrid ranking to select the best questions. The current automatic question generation focus on factual question generation for reading comprehension or vocabulary assessment to find the different types of question from a paragraph (Yes-No, Who, What, Where and How questions). The system generates the answers according to the above framed questions.

I. Introduction Question Generation (QG) is the task of

and text books. In the field of Automatic

generating reasonable questions from an input,

Question

which can be structured (e.g. a database) or

focus on the text-to-question task where a set

unstructured (e.g. a text). Question Generation

of content-related

is believed to play a crucial role in a variety of

based on a given text. Usually, the answers to

cognitive faculties, such as Comprehension and

the generated questions are contained in the

reasoning. Asking good questions is a great

text. Question Generation can be divided into

skill of human and we could not expect such

deep QG and shallow QG. Deep QG generates

great skill from everyone. Therefore they would

deep

benefits from automated QG systems to assist

thinking (such as why, why not, what-if, what

them in meeting their inquiry needs. This

if-not and how questions) whereas shallow QG

section reports some of the research that

generates shallow questions that focus more on

supports our claim that human question asking

facts (such as who, what, when, where, which,

is extremely limited in both quantity and

how many/much and yes/no questions).

quality.

Question

generation

(QG)

for

the

purpose of creating reading assessments about the factual information that is present in expository texts such as encyclopedia articles

Generation

questions

(AQG),

most systems

questions

that

are generated

involve

more

logical

QG system can be helpful in the following areas: Intelligent tutoring systems. QG can ask questions based

on

learning

materials in order to Check learners'

CLEAR June2013

7


accomplishment or help them focus on

years,

the keystones in study. QG can also

automatic question generation. In ICITA‘05 [1],

help

they introduced a template-based approach to

tutors

to

prepare

questions

intended for learners or prepare for potential questions from learners. Closed-domain

Question

Answering

(QA) systems. Some closed-domain QA systems

use

predefined

(sometimes

hand-written) question-answer pairs to provide QA services. By employing a QG approach such systems could be ported to other domains with little or no effort. Natural

language

summarization/

generation systems. QG can help to generate,

for

instance,

Frequently

Asked Questions from the provided information source in order to provide a list of FAQ candidates. The advantage of this approach is that the mapping

from

sentence

is

declarative done

on

to

interrogative

the

semantic

representations. In this way, we are able to use an

independently

generator for the

developed analysis

parser

and

and generation

stage.

new

preoccupations

appeared

for

generate questions on four types of entities. An approach to question generation using parse tree manipulation, named entity recognition, and

Up-Keys

(significant

document).Existing question

phrases

method

generation

in

a

described

methods:

two

one

for

generating factoid questions, and another for generating definitional questions. We showed how our question generation approach can generate multiple questions from a single input sentence.

We

demonstrated

through

an

evaluation that our factoid question generation method shows promise, and we discussed our plans to use question generation for question answering [8]. Numerous approaches to text compression and simplification have been proposed; see [5, 4] for

reviews

of

various

techniques.

One

particularly closely related method is discussed in [3]. That method extracts simple sentences from each verb in the syntactic dependency trees

of

complex

sentences.

The

task

of

generating a question about a given text can be

II. Background and Related Literature

decomposed into three subtasks. First, given

NLP techniques have been used to develop a

the source text, a content selection step is

number of tutoring and feedback systems for

necessary to select a target to ask about, such

academic

of

as the desired answer. Second, given a target

computational linguistics, dealing with Question

answer, select question type, i.e., the form of

Generation (QG) is getting more attention from

question to ask, such as a cloze or why

the researchers [6]. Before the internet and

question.

electronic data storage, the time to search and

question type, construct the actual question in

find an answer for questions could extend for

a question construction step. These steps are

weeks hunting for documents in the library.

calls

Electronic books and information sources will be

Determination

the mainstream in the future. In the last few

[9],[10]

writing

CLEAR June2013

support.

In the

field

Third,

Concept

given

the

Selection, and

content,

Question

Question

and

Type

Construction.

8


In Paragraph Processing the complex sentences

III. Methodology

are converted into simple sentences. This in

A. Question Taxonomy

turn helps to extract the important keyword

Following 18 question categories according to

from the sentence.

the content of information sought rather than on the interrogative words (i.e. why, how, where, etc)[11]. 1. Verification:

invites

a

yes

or

no

answer. 2.

Disjunctive: Is X, Y, or Z the case?

3.

Concept

completion:

Who?

What?

When? Where? 4. Example: What is an example of X? 5. Feature specification: What are the properties of X? 6. Quantification: How much? How many? 7. Definition: What does X mean? 8. Comparison: How is X similar to Y? 9. Interpretation: What does X mean? 10. Causal antecedent: Why/how did X occur? 11. Causal consequence: What next? What if? 12. Goal orientation: Why did an agent do X? 13. Instrumental/procedural: How did an agent do X?

Figure 1 System Flow Diagram C. Sentence Classification In this module the input is the elementary sentences. Using the syntactic parser to parse the elementary sentence, and based on the associated POS and NE tagged Information,

14. Enablement: What enabled X to occur?

preposition and verb. This information is used

15. Expectation: Why didn‘t X occur?

to classify the sentences.

16. Judgmental: What do you think of X 17. Assertion:

1. Human: This will have any subject that is the name of a person.

18. Request/Directive

2. Entity: This includes animals, plant, These are the various types of question to be used in question generation module to produce the questions. B. Input Paragraph Processing

CLEAR June2013

mountains and any object. 3.

Location: This will be the words that represent locations, such as country, city, School etc.

9


4. Time: This will be any time, date or

Sentences in a target document and extracted

period such as year, Monday, 9 am, last

the question answer pair. So we click the

week, Etc.

question it will display the answer.

5. Count: This class will hold all the counted elements, such as 9 men, 7 workers, measurements like weight and

Abraham

size, etc. we get from each elementary

1809 – April 15, 1865), the 16th

sentence the subject, object,

President of the United States,

D. Question Generation from Paragraph The

Question

Generation

from

Paragraphs

(QGP) task has been defined such that it is application-independent.

Application-

independent means questions will be judged based

Example of Input Paragraph

on

content

paragraph.

For

analysis

this

task,

of

the

input

questions

are

Lincoln

(February

successfully

led

through

greatest

its

his

12,

country internal

crisis, the American Civil War, preserving the Union and ending slavery.

As

opponent slavery

of in

an the

outspoken

expansion

the

United

of

States,

Lincoln won the Republican Party nomination

in

1860

and

was

considered important if they ask about the core

elected

idea(s)

year. His tenure in office was

in

the

paragraph.

Questions

are

president

later

that

considered interesting if an average person

occupied

reading the paragraph would consider them so

defeat

based on a quick analysis of the contents of the

Confederate States of America in

paragraph.

the

primarily of

the

American

with

secessionist

Civil

introduced

the

War.

measures

He that

Simple, trivial questions such as what is X? Or

resulted

a

slavery,

issuing

paragraph about? were avoided. In addition,

Emancipation

Proclamation

implied questions were not allowed as the

1863 and promoting the passage

emphasis

is

and

of the Thirteenth Amendment to

answered

by

Diagram

the Constitution. As the civil

generic

paragraph.

question

on

such

as

questions

the

System

Questions

what

is

triggered Flow

should

the

not

be

compounded as in what is … and who …? Questions

must

be

grammatically

and

semantically correct and related to the topic of the given input paragraph. Question types (who/what/why/…) paragraph

should

generated be

diverse,

for if

each

possible.

Unique question types are preferred in the set of returned questions.

war

was

Lincoln American

in

the

drawing became

abolition

to

of his

a

close,

the

president

in

first to

be

assassinated. Examples of Questions Who is Abraham Lincoln? What major measures did President Lincoln introduce? How did President Lincoln die?

E. Question Generation with Answer Retrieval

CLEAR June2013

10


When Abraham Lincoln was elected president?

[3] Beigman Klebanov, B., Knight, K., Marcu, D.: Text simplification for Information Seeking

When

was

President

Lincoln

assassinated?

applications.

On

the

Move

to

Meaningful

Internet Systems (2004)

What party did Abraham Lincoln belong to?

[4] Clarke, J.: Global Inference for Sentence Compression: An Integer Linear Programming

IV. Conclusion and Future Works

Approach. Ph.D. thesis, University of Edinburgh

In this paper we presented an approach to

(2008)

question generation using Up-Keys (significant phrases in a document). We show how our question generation approach can generate multiple questions from an input. The proposed approach will automatically generate questions for

given

text.

We

sentences

from

complex

syntactic

extracted

information

elementary

sentences

and

using

classified

the

[5] Dorr, B., Zajic, and D.: Hedge Trimmer: A parse-andtrim approach to headline generation. In:

Proc.

Of

Workshop

on

Automatic

Summarization (2003) [6] Leung, H., Li, F. & Lau, R. Advances in Web Based Learning - ICWL 2007: 6th International Conference

elementary sentences. We generated questions based

on

the

subject,

verb,

object

and

[7] In V. Rus and A. Graesser, editors, The

preposition using a predefined interaction rules.

Question

Based on the questions the system will be

Evaluation Challenge Workshop Report. The

generating the answer. Since human generated

University of Memphis, 2009.

questions tend to have words with different meanings and senses, the system can be improved

with

the

inclusion

of

semantic

information and word sense disambiguation.

A.

&

Sniders,

E.

(2005).Automated Question Answering: Review of the Main Approaches. In Proceedings of the 3rd International Conference on Information and

Applications

(ICITA‘05),

Sydney, Australia.

international

and

[8] Heilman, M., & Smith, N. (2009). Question generation via over generating Transformations and ranking. Technical Report CMU-LTI-09-013,

In

Proceedings

conference

Phillips, Michael Wallis, Mladen Vouk and James Lester(2009).An Empirically-Derived Question Taxonomy for Task-Oriented Tutorial Dialogue. In

Proceedings

of

The

2nd

Workshop

on

Question Generation. [10] Ming liu, Rafael A.Calvo, Vasile RusG-

[2] Xin Li and Dan Roth. Learning question classifiers.

Task

[9] Kristy E. Boyer, William Lahti, Robert

Andrenucci,

Technology

Shared

Carnegie Mellon University

REFERENCES [1]

Generation

on

of

the

19th

Computational

Asks:

An

Generation

Intelligent System

for

Automatic Academic

Question Writing

Support.

linguistics, Morristown, NJ, USA, 2002.

CLEAR June2013

11


Software Localization and Malayalam Computing Sreejith C M. Tech Computational Linguistics Govt. Engineering College, Palakkad

“Enabling computers to understand human language is one of the major challenge in the field of technology. “ Extending your global reach is challenging –

regions

and when it comes to software, application

Localization

quality and tight release deadlines add to the

internationalized software for a specific region

complexity. The English language is sometimes

or

described as the lingua franca of computing. In

components and translating text. Hence there

comparison to other sciences, where Latin and

is a rigid development in the area of computing

Greek are the principal sources of vocabulary,

from the global English language to the local

Computer Science borrows more extensively

languages.

from English. Due to the technical limitations of early computers, and the lack of international

without is

engineering

the

language

process

by

adding

of

changes. adapting

locale-specific

Software Localization

standards on the Internet, computer users

Software localization is the process of adapting

were limited to using English and the Latin

a software product to the linguistic, cultural and

alphabet. However, this historical limitation is

technical requirements of a target market.

less present today. Most software products are

Software

localized in numerous languages and the use of

translation

of

the Unicode character encoding has resolved

Companies

localize

problems

Some

overcome cultural barriers for their products to

limitations have only been changed recently,

reach a much larger target audience. Software

such as with domain names, which previously

localization is the translation and adaptation of

allowed only ASCII characters.

a software or web product, including the

In

with

computing,

non-Latin

alphabets.

internationalization

and

localization are means of adapting computer software

to

different

languages,

regional

differences and technical requirements of a target

market.

Internationalization

is

software

Localization

itself

documentation.

a

is

product's

and

more User

software

all

Traditional

in

related

than

the

Interface. order

to

product

translation

is

typically an activity performed after the source document has been finalized.

the

process of designing a software application so that it can be adapted to various languages and

CLEAR June2013

12


Software localization projects, on the other

The standard localization process includes the

hand,

following basic steps:

often

run

in

parallel

with

the

development of the source product to enable simultaneous

shipment

of

versions.

example,

the

For

all

language

translation

required

software strings may often start while the

● ●

A software product that has been localized

market. Here are just a number of points that have

to

be

considered,

as

well

as

date formats (long and short), paper sizes, fonts, default font selection, case differences, character sets, sorting, word separation and hyphenation,

local

regulations,

copyright

Creation and

maintenance

of

target language

Adaptation of the

user

interface,

including resizing of forms and dialogs, as required

Localization of graphics,

scripts

or

other media containing visible text, symbols,

language, in order to effectively localize a number formats, address formats, time and

linguistic

Translation to the

the

software product or website: measuring units,

and

properly has the look and feel of a product originally written and designed for the target

Cultural, technical

terminology glossaries

involved such as project management, software engineering, testing and desktop publishing.

for localization

assessment

Translation is only one of the activities in a localization project – there are other tasks

and

evaluation of the tools and resources

of

software product is still in the beta phase.

Analysis of the material received

etc.

Compilation and

build

of

the

localized files for testing

Linguistic and

functional

quality

assurance

Project delivery

issues, data protection, payment methods, currency conversion, taxes.

CLEAR June2013

13


" is the slogan of the Malayalam as classical language

organization, which translates to "My language

After nearly three years of deliberation at

for/on My Computer". SMC has been active

various levels, the Union Cabinet on Thursday

since October 2002 and has been working to

may 24 2013 declared Malayalam as a classical

provide Malayalam language tools that work on

language. This is a welcome news for bringing

all layers of computing including and not limited

together the governments, the academia and

to rendering fixes, fonts, input mechanisms,

the research institutions and developers and

translations

the industry associations on a common ground

engines, dictionaries, spell checkers and other

for

computing.

indic script based language computing specific

Securing classical language tag would have

tools across operating systems. They are the

benefits as well as it will result in flow of

upstream for Malayalam fonts and tools for

resources and

and

popular GNU/Linux based operating systems

writers of eminence in Malayalam through

such as Fedora and Debian. They also maintain

awards. The benefits extended to Classical

localizations

Languages

Desktops (GNOME/KDE), popular applications

promoting

local

language

recognition of scholars

include,

two

major

annual

international awards for scholars of eminence

(localization),

for

popular

text-to-speech

Free

Software

such as Firefox and Libre Office.

and setting up of a Centre of Excellence for Studies in Classical Languages. This will also

Virtual

Resource

lead to more works in the area of malayalam computing.

and Kerala State IT Mission have embarked on journey

to

give

a

fillip

to

Malayalam

computing and research. The aim is to provide a common platform for existing isolated works and set a benchmark for the future research works, that till now was a missing link in promoting Malayalam computing. There are also

For

Language

Computing VRC-LC [4] is a research and project lab of

Thunchath Ezhuthachan Malayalam University a

Centre

several

other

groups

and

institutes

IIITM-K language

to

promote

with

the

Technology.VRC-LC is

and support

strengthen of

Information

a research and project

lab of IIITM-K to promote and strengthen local language

with

Technology.

the

support

of

Information

VRC-LC web portal will act as

the repository of various information about the research and projects related to language computing and the software tools, standards to enable localization in a computer.

working on this area such as :

local

The works

mainly concentrate to Malayalam Language, Swathanthra Malayalam Computing

that can address the linguistic barrier of our

Swathanthra Malayalam Computing (SMC) [3] is

a

free

software

collective

engaged

in

development, localization, standardization and

people using computer and also to enable the usage of various e-Governance projects to common man.

popularization of various Free and Open Source Softwares

in

Malayalam

CLEAR June2013

language.

"

14


Simple groups Computational Linguistics

5% people know English and rest are deprived

Lab @ GEC Sreekrishnapuram

of

Students' Innovation in Morphology Phonology and Language Engineering (Abbreviated as SIMPLE) [5] is the official group of M.Tech computational Linguistics students at Govt. Engineering College, Palakkad. Computational linguistics linguistics

(CL) and

is

a

computer

discipline science

between which

is

concerned with the computational aspects of the human language processing. It belongs to the cognitive sciences and overlaps with the field of artificial intelligence (AI), a branch of computer

science

that

is

aiming

at

computational models of human cognition. As the name indicates, SIMPLE is a platform for showcasing innovations, ideas and activities in the field of Computational Linguistics. SIMPLE group

is

also

working

to

promote

and

the

benefits

development.

of

The

information benefits

of

technology information

technology can reach to the common man only when

software

tools

and

human

machine

interface systems are available in their own languages. To enable wide proliferation in Indian languages, tools, products and resources should be freely available to the public. Thus in a multilingual country like India the scope of localization is enormous. Much efforts should be done in this area. Malayalam language and Malayalam

computing

should

also

be

encouraged

and

works

should

be

more

proposed and initialised. India can thus, poised to emerge as a Multilingual Computing hub. Reference 1 http://malayalam.kerala.gov.in

strengthen local language with the support of

2

Information Technology and computationally

technology/what-is-software-localization.html

http://www.sdl.com/technology/language-

driven statistical approaches. The malayalam computing activities by Simple groups includes malayalam indexing, pos tagger, subject object identifier,

spell

checker,

lemmatization,

3 http://smc.org.in/ 4 http://www.iiitmk.ac.in/vrclc/en/index.html

question answering systems etc.

5 http://www.simplegroups.in

India is a multilingual country, with 22 official

6 http://tdil.mit.gov.in/

languages and 12 scripts. In India only about

CLEAR June2013

15


Extracting Precise Answers Using Question Answering System K.Subalokshini M.Tech, CSE

R.Poonguzhali Assistant Professor(SS),CSE Periyar Maniammai University Vallam

Periyar Maniammai University Vallam

rpg_pmu@rediffmail.com

suba.candy@gmail.com

Question Answering Systems provides answers to the users questions in concise form which fulfills the expectation of the user. Question answering system is based on keywords search. This is similar to Web search. The Question Answering System should be able to provide answer for the user‘s questions in a user friendly way. Judging the correctness of the answer is an important issue in the field of question answering. In this paper, question classification is one of the heuristics for answer validation. Question classification is used to determine the type of question. This paper focus on context based retrieval of information. This paper provides an efficient method for extracting exact textual answers from the returned documents that are retrieved by traditional IR system in large-scale collection of texts.

Introduction The World Wide Web is the major source of of

Natural Language Processing (NLP) is the

information are available on the World Wide

computerized approach to analyzing text that

Web in one or another form. Managing such a

is based on both a set of theories and a set

huge volume of data is not an easy task.

of technologies. It is a very active area of

Search engines like Google and Yahoo return

research and development, there is not a

links to the documents for the user query. Most

single agreed-upon definition that

would

often, web pages retrieved by these search

satisfy

some

engines do not provide precise information and

aspects,

may contain irrelevant information in even top

knowledgeable person.

information

for

everyone.

All

kinds

ranked results. This makes the user to look for an alternate information retrieval system that can provide answers of the user queries in succinct form.

Question

Answering

Systems,

unlike

other information retrieval systems, combine question classification, information retrieval, and

information

extraction

techniques

to

present precise answers to user questions posed in a natural language.

CLEAR June2013

everyone, which

but

would

there be

are

part

of

any

The definition is Natural Language Processing is

a

theoretically

motivated

range

of

computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing

for

a

range

of

tasks

or

applications. The goal of NLP as stated above is

to

accomplish

human-like

language

processing.

16


A typical pipeline Question Answering System

Automatic

consists

question

divided into two main approaches known as

classification, Question Processing, Document

machine learning and language modelling. The

Processing,

primary machine learning algorithm used for

of

different

phases:

Answer Processing and

Answer

question

classification

question

Background and Related Work

Machine employed tree kernel with a SVM

NLP techniques are used in applications that

classifier

make queries to databases, extract information

reported 80.2% accuracy without the use of

from text, retrieve relevant documents from a

syntactic

collection, translate from one language to

modelling

another, generate text responses, or recognize

approaches try to compute the probability of

spoken words converting them into text.

the question for a given question class.

A common feature of NLP systems is that they

The use of context in information retrieval

convert text input into formal representation of

systems

has

meaning such as logic (first order predicate

Recently,

the

calculus),

conceptual

temporal contextual clues [8], category labels

frame-based

[9], and top-ranking related sentences [6] has

representations.[1]. NLP-based (QASs) systems

been explored empirically through user studies

may utilize machine learning to improve their

in

syntax rules [1], lexicon [4], semantic rules

Interactive

[5],or the world model [4].

interest in interface issues associated with

dependency

networks,

diagrams,

or

a

for or

question semantic

based

Web

is

further

extraction.

semantic

classification

is

Support classification

features.

question

been

environment. at

Language

of

studied.

spatial

Furthermore,

TREC

and

classification

extensively

effectiveness

Track

Vector

has

and

the

generated

information retrieval systems. [7] Compared a Early

QA

systems,

e.g.

1960s‘

Intelligent

Question-Answering Systems by Coles et.al [3],

single-document and multi-document view of IR results for a question answering task.

focused on how to kill the semantic ambiguity of questions using artificial intelligence (AI),

Methodology

and evolved to expertise systems [2].

A. Question Classification Question classification is used to determine the

There are two main approaches for question

question type. The question type makes the

classification: manual and automatic. Question

user a clear view to identify the expected

Answering Systems using manual classifications

answer type. With the help of question type it

(Hermjakob, 2001) apply hand-crafted rules to

is easy to retrieve the answer form the large

identify expected answer types. These rules

collection of documents. There are different

may be very accurate but these are time

types of question types. Some of them are

consuming,

listed below.

tedious,

and

non-extendible

in

nature.

CLEAR June2013

17


Functional

Non-Wh

Who/Whose/Whom Questions: Questions falling

questions (except how) fall under the category

under this category usually ask about an

of Functional Word Questions. The functional

individual or an organization.

word

Word

questions

Questions:

usually

All

start

with

non-

significant verb phrases.

Example: Who wrote ‗Thirukural‘? Why Questions: ‗‗Why Questions‖ always ask for certain reasons or explanations. Example:

Why

do

heavier

objects

travel

downhill faster? How Question: ‗‗How Questions‖ have two types of patterns: For the first pattern, expected answer type is description of some process while second pattern returns some number as answer. Example: How data travels in internet? How many states in India?

Figure.1 System Flow Diagram Example: list the properties of acids.

What

When Questions: ‗‗When Questions‖ start with ‗‗When‖ keyword and usually refers for date or time.

Questions:

‗‗What

Questions‖

have

several types of patterns? ‗‗What Questions‖ can ask for virtually anything. Many ‗‗What Questions‖

are

disguised

in

the

form

of

‗‗Functional Word Questions‖

Example: When was Lincoln born? Where Questions: ‗‗Where Questions‖ start with

Example: What is android?

‗‗Where‖ keyword and usually related to the

B. Keyword Extraction:

location. It may be of mountains, geographical

Keyword extraction is used to extract only the

boundaries,

as

keyword from the users question and remove

temple, or some virtual location or fictional

the stop words and stem words. With the use of

place

keywords it is very easy to extract the answers.

manmade

locations

such

Example: Where is Tajmahal?

The keywords are extracted from the questions and are further used as a root to extract the

Which Questions: ‗‗Which Questions‖ start with

answers from the available online resources. To

‗‗Which‖ keyword and usually referred with the

obtain the keywords we can utilize some syntax

noun phrase associated with the noun phrase in

parsing tools.

the question. Example: Which is the best laptop?

CLEAR June2013

18


C. Information Retrieval:

The goal of a question answering system is to

Information retrieval is the activity of obtaining

retrieve answers to questions rather than full

information, relevant to the information needed

documents or best-matching passages, as most

from a collection of information resources.

information retrieval systems. Although our

Information retrieval is used to get the related

method takes advantage of the redundancy of

information about the questions asked by the

answer across stream and allowed significantly

user. The information retrieval can be obtained

reduce

from

Google,

presented to the user, question answering

will

system gives a succinct form of answer to

many

Wikipedia.

online

The

resources

keywords

like

obtained

be

helpful to retrieve the needed information from

the

number

of

incorrect

answer

user‘s question in natural language.

the available online resources. REFERENCES D. Collecting Frequent Item Sets Collecting frequent item sets help to identify

[1] H.

Feili,

Natural

Language

Processing

whether the keyword is occurring frequently in

Projects, [PowerPoint] Sharif UT, Tehran,

that document or not. So that it is easy to

Iran, 2003 [2] Robert

identify the top most relevant document.

F.

questions

Simmons, by

Answering

computer:

a

english survey,

E. Answer Extraction:

Communications of the ACM, Vol. 8, No. 1,

Answer extraction is used to get the succinct

pp.:53-70, Jan. 1965

form of answer for the given question. From

[3] L. Stephen Coles, An on-line question-

or

answering systems with natural language

passages from the information retrieval, the

and pictorial input, Proceedings of the 23rd

answer extraction performs detailed analysis

ACM national conference, Princeton, ACM,

and pin-points the answer to the question.

August 1968.

the

given

top

N

relevant

documents

Usually answer extraction produces a list of

[4] A. Kirschenbaum, S. Wintner, "Minimally

answer candidates and ranks them according to

supervised

some scoring functions.

translation",

transliteration Proceedings

for of

machine The

12th

Conference of the European Chapter of the IV Conclusion and Future Work

Association for Computational Linguistics

This paper summarizes the categories of QA

(EACL-09), April 2009.

system, and also helps us to understand the

[5] E.

Sneiders,

Automated

Question

Question

Answering: Template-Based Approach, PhD

answering system is one of the hot-spots in

thesis, Stockholm University / KTH press,

natural language processing. Compared with

Sweden, 2002

types

of

traditional

question

classification.

keyword-based

search

engine,

[6] White, R., Ruthven, I., and Jose, J. Finding

Question Answering system allows users to ask

relevant

documents

using

top

ranking

questions in natural language.

sentences: An evaluation of two alternative schemes. In Proceedings of SIGIR 2002.

CLEAR June2013

19


[7] Belkin, N., Keller, A., Kelly, D., Carballo, J., Sikora,

C.,

and

Sun,

question-answering

Y. in

Support

for

interactive

information retrieval: [8] Rutgers‘

TREC-9

experience.

In

interactive

Proceedings

of

track TREC-9,

2000. [9] Park, J. and Kim, J. Effects of contextual navigation aids on browsing diverse web systems. In Proceedings of CHI 2000. [10]

Dumais, S., Cutrell, E., and Chen, H.

Optimizing search by showing results in context. In Proceedings of CHI 2001.

“When a language dies, a way of understanding the world dies with it, a way of looking at the world. “ - George Steiner

CLEAR June2013

20


Human Language Processing: Biological Perspective Priyesh Sankar MBBS Student Government Medical College Kozhikode

"Communication is truly a multisensory experience. For most individuals, the pathway from creating sound (speaking) to receiving, processing, and interpreting sound (hearing) is critical." 1. Introduction Sound

offers

Contemporary hearing research is guided by of

lessons learned from sensory research, namely

communication. Our sense of hearing enables

that specialized nerve cells respond to different

us to experience the world around us through

forms

sound. Because our sense of hearing allows us

electromagnetic—and convert this energy into

to

sounds

electrochemical impulses that can be processed

continuously and without conscious effort, we

by the brain. The brain then works as the

may take this special sense of communication

central

for granted. But, did you know that

perceives

gather,

Human

us

a

process,

powerful

and

communication

means

interpret

is

multisensory,

involving visual, tactile, and sound cues? audible to painful, is over 100-trillion-fold? hair

cells,

are

responsible

processor and

of

sensory

interprets

chemical,

impulses.

them

using

or

It a

―computational‖ approach that involves several notion is different from the long-held view that the brain processes information one step at a

Tiny specialized cells in the inner ear, as

energy—mechanical,

regions of the brain interacting all at once. This

The range of human hearing, from just

known

of

time in a single brain region. Over the past

for

decade, scientists have begun to understand

converting the vibrational waves of sound into

the intricate mechanisms that enable the ear to

electrical signals that can be interpreted by the

convert the mechanical vibrations of sound to

brain?

electrical energy, thereby allowing the brain to

Tinnitus, commonly known as ―ringing in

process and interpret these signals.

the ears,‖ is actually a problem that originates in the brain?

Scientific understanding of the role of genes in hearing is also increasing at an impressive rate. The first gene associated with hearing was isolated in 1993. By the end of 2000, more than

60

genes

related

identified.

In

addition,

pinpointed

over

100

to

hearing

scientists

chromosomal

were have

regions

believed to harbor genes affecting the hearing pathway. Many genes were first isolated in the mouse, and from this, the human genes were

CLEAR June2013

21


identified. Completion of the Mouse and Human

of our lives. Anything we hear in the context of

Genome Projects is helping scientists isolate

speech after that will be sorted into one of the

these genes.

pre-existing percepts.

The rapid growth in our understanding is of

3. Major Concepts Related to Hearing and

more than academic interest. In a practical

Communication

sense, sharing this information with young people can enable them to adopt a lifestyle that

3.1 Communication is multisensory

promotes the long-term health of their sense of

Communication with others makes use of sound

hearing. With this in mind, this supplement will

and vision.

address several key issues, including

Although

some

people

might

define

What is the nature of sound?

communication as an interaction between two

What mechanism allows us to process

or more living creatures, it involves much more

sounds with great precision—from the softest

than this. For example, we are constantly

whisper to the roar of a jet engine, from a

receiving information from, and changing our

high-pitched whistle to a low rumble?

relationship

What are the roles of hearing, processing, and speaking in human communication? What

happens

when

the

with,

our

environment.

This

communication is received through our senses of smell, taste, touch, vision, and hearing.

hearing

Communication with others makes use of vision

mechanism is altered or damaged? How does

(making

sound processing change?

language) and sound (using speech or other

What

can

be

done

to

prevent

or

accommodate damage to our sense of hearing?

eye

contact

or

assessing

body

sounds, such as laughing and crying). When a group of people shares a need or desire to communicate, language is born. The most

2. Language Processing

common human language is the language of

The language center of the brain (Wernicke‘s

words. Words may be communicated in various

area in the dominant temporal lobe) is the

ways. Although they are usually spoken, they

―dictionary‖ of the brain – translating words

also

into

words.

expressed through sign language. Words may

Wernicke‘s area has input from auditory and

be communicated by writing, speaking, and

visual areas of the brain, which makes sense.

signing

concepts

and

concepts

into

may

be

written,

finger

spelled,

or

In essence, Werkincke‘s area hears speech and then translates those sounds into words that

3.2 Language acquisition: imprinting and

have abstract meaning our language cortex can

critical periods

recognize a limited set of speech sounds, or

Our brains have specific regions devoted to

components (called percepts). We learn these

speech, hearing, and language functions.

in the first four years of our life from hearing

Since the time of Plato, there has been debate

speech, and then the ―language window‖ closes

over the nature of language. Some believe that

and we are limited to those sounds for the rest

language is inborn and purposeful, while others

CLEAR June2013

22


believe it to be artificial and arbitrary. Some

which refers to the ability of some animals to

consider

learn rapidly at a very early age and during a

language

to

be

an

evolutionary

product, while others do not. It appears that

well-defined

words are not ―built into‖ the brain, because

Imprinting generally refers to the ability of

language is a relatively recent evolutionary

offspring to acquire the behaviors characteristic

development and also because languages differ

of their parents. This process, once it occurs, is

substantially from one another. Language and

not reversible

communication

are

made

possible

period

in

their

development.

by

specialized structures. We have evolved a

A second concept, related to imprinting, is

sophisticated apparatus for both speech and

critical periods. A nonhuman example of a

hearing.

regions

critical period is the limited time frame within

devoted to speech, hearing, and language

which a male bird must acquire his song. 8 For

functions.

instance,

Our

brains

Still,

the

have

specific

mechanisms by which

a

male

white-crowned

sparrow

children acquire language are only partially

usually begins singing his full song between

understood.

100

and

200

days

is

needed

acquisition

of

age.

for

Proper

mating

song

and

for

marking territory. However, to learn his song, the young bird must be exposed to an adult bird‘s song consistently and frequently between one week and two months after hatching Very soon after birth, human infants learn to distinguish speech sounds from other types of sound. Within the next month or two, the infant learns to distinguish between different speech sounds.4, recognize

14

An

and

18-month-old use

the

toddler

sounds

can

(called

phonemes) of his or her language and can construct two-word phrases. A 3½-year-old child can construct nearly all of the possible sentence types. From this point on, vocabulary and language continue to expand and be refined. 3.4 Perception of sound has a biological basis When sound, as vibrational energy, arrives at There are two concepts important to the

the ear, it is processed in a complex but distinct

acquisition of language. One is imprinting,

series

CLEAR June2013

of

steps.

These

steps

reflect

the

23


anatomical division of the ear into the outer

This allows the brain to approximate the

ear, middle ear, and inner ear

sound‘s location. Interestingly, the position and orientation of the pinna, at the side of the head, help reduce sounds that originate behind us. This helps us hear sounds that originate in the direction we are looking and reduces distracting background noises. Some students (and adults) may believe that the size of the ear is an indication of the organism‘s hearing ability—that is, the larger the ear, the better the ability to hear. This misperception doesn‘t take into account the

Figure Anatomy of the human ear.

internal structures of the ear that process sound vibrations. A large pinna may serve a

The pathway from the outer ear to the inner

function that is unrelated to hearing. For

ear is remarkable in its ability to precisely

example,

process sounds from the very softest to the

elephant is filled with small blood vessels that

very loudest and to distinguish very small

help the animal dissipate excess heat. The

changes in the frequency of sound (pitch).

external ear may be specialized in other ways,

Humans can discern a difference in frequency

as

of just 0.1 percent. This means that humans

undoubtedly

can tell the difference between sounds at

movement of their pet‘s pinnae as the animal

frequencies of 1,000 Hz and 1,001 Hz.

attempts to locate the source of a sound. The

well.

the

external ear of

Cat

owners,

observed

for the

the

African

example, rather

have

dramatic

cochlea is divided into an upper chamber, The outer ear. The outer ear is composed of

called the scala vestibuli or vestibular canal,

two parts. The pinna is the outside portion of

and

the ear and is composed of skin and cartilage.

tympani or tympanic canal. These are seen

The second part is called the ear canal (also

mo st easily if the cochlea is represented as

called the external auditory canal). The pinna,

uncoiled,

a

lower

chamber,

called

the

scala

with its twists and folds, serves to enhance high-frequency sounds and to focus sound waves into the middle and inner portions of the ear. The pinna also helps us determine the direction

from

which

a

sound

originates.

However, the greatest asset in judging the location of a sound is having two ears. Because one ear is closer to the source of a sound than the other, the brain detects slight differences in the times and intensities of the arriving signals.

CLEAR June2013

24


Both the upper and lower chambers are filled

lower chamber, to the round window. The

with a fluid, called perilymph, which is nearly

round window allows the release of the

identical to spinal fluid. The stapes vibrates

hydraulic pressure caused by vibration of the

against

fluid

stapes in the oval window. Additionally, the

vibrations that are transmitted as pressure

diameter of the chambers decreases from base

waves all the way through the cochlea. As

(closest to the windows) to apex.

the

oval

window,

creating

represented by the arrows in Figure, these waves move from the upper chamber to the

“All communication involves faith; indeed, some linguisticians hold that the potential obstacles to acts of verbal understanding are so many and diverse that it is a minor miracle that they take place at all.� -Terry Eagleton

CLEAR June2013

25


CLEARTK : Can I be a Competitor for NLTK and Stanford CoreNLP? Robert Jesuraj K

M.Tech Computational Linguistics Government Engineering College, Sreekrishnapuram ClearTK is a toolkit for developing statistical natural language processing components in Java and is based on the Apache Unstructured Information Management Architecture (UIMA) framework for text analysis It is developed by the Centre for Computational Language and Education Research (CLEAR) at the University of Colorado at Boulder.

The overall size of ClearTK (cleartk-release-1.4.1-bin) is

177Mb. Features:

Most of ClearTK is distributed under the BSD

A common interface and wrappers for popular machine learning libraries such as SVMlight, LIBSVM, OpenNLP MaxEnt, and Mallet.

license. However, there are a couple of subprojects that are licensed under the GPL license because they depend on GPL licensed third party libraries. ClearTK can be used to achieve state-of-the-art

performance

on

biomedical

A rich feature extraction library that can

part-of-speech tagging. UIMA provides a set of

be used with any of the machine

interfaces

for

learning classifiers. Under the covers,

analyzing

unstructured

ClearTK understands each of the native

provides

infrastructure

machine

configuring,

learning

libraries

and

defining

running,

components

for

information for

and

creating,

debugging,

and

translates your features into a format

visualizing these components. But, ClearTK

appropriate to whatever model you're

focused on UIMA‘s ability to process textual

using.

data. All components are organized around a type system which defines the structure of the

Infrastructure

for

NLP

annotations that can be associated with each

components for specific tasks such as

document. This information is instantiated in a

part-of-speech

data structure called the Common Analysis

chunking,

named

semantic

role

creating

tagging,

BIO-style

entity

recognition,

labeling,

temporal

relation tagging, etc.

as the Snowball stemmer, the OpenNLP the

MaltParser

dependency

parser, and the Stanford CoreNLP tools. Corpus readers for collections like the Penn Treebank, ACE 2005, CoNLL 2003,

(CAS).

There

is

one

CAS

per

document that all components that act on a document

Wrappers for common NLP tools such tools,

Structure

can

access

and

update.

Every

annotation that is created is posted to the CAS which is then made available for other UIMA components to use and modify. Here is a short list of the most important kinds of components: Collection Reader – a component that reads in documents and initializes the

Genia, TimeBank and TempEval.

CLEAR June2013

26


CAS

with

any

available

annotation

information.

mode in which it performs feature extraction

Analysis Engine – a component that performs analysis on the document and adds

classifier annotator can also be run in training

annotations

to

the

CAS

or

modifies existing ones.

and then writes out training data which is then used for building a model. Feature Extraction The ClearTK feature extraction library is highly

CAS Consumer – a component that

configurable and easily extensible. Each feature

processes the resulting CAS data (e.g.

extractor produces a feature or set of features

write annotations to a database or a

for a given annotation (or pair or collection of

file)

annotations as the feature extractor requires)

Collection Processing Engine (CPE) – an aggregate component that defines a pipeline that typically consists of one collection

reader,

a

sequence

of

analysis engines, and one or more CAS consumers.

for the purpose of characterizing the annotation in a machine learning context. A feature in ClearTK is a simple object that contains a value (i.e. A string, boolean, integer, or float value), a name, and a context that describes how the feature value was extracted. Most features are created by querying the CAS for information

While UIMA provides a solid foundation for

about existing annotations. Because features

processing text, it does not directly support

are typically many in number, short lived, and

statistical NLP. ClearTK provides a framework

dynamic in nature (i.e. features often derive

for

use

from previous classifications), they are not

for

represented in the CAS but rather as simple

creating

statistical

UIMA

learning

components as

the

that

foundation

decision making and annotation creation.

Java objects.

Statistical NLP in ClearTK ClearTK was designed and implemented with

The spanned text extractor is a very simple

special attention given to creating reusable and

example of a feature extractor that takes an

flexible code for performing statistical NLP. As

annotation and returns a feature corresponding

such, the library provides classes that facilitate

to the covered text of that annotation. The type

extracting features, generating training data,

path extractor is a slightly more complicated

building classifiers, and classifying annotations.

feature extractor that extracts features based on a path that describes a location of a value

ClearTK introduces classifier annotators which

defined by the type system with respect to the

are

annotation type being examined.

analysis

engines

that

perform

feature

extraction, classify the extracted features using a machine learning model, and interpret the

For example, Figure below shows a simple

results of the classification by e.g. labelling

hypothetical type system. A type path extractor

annotations or creating new annotations. A

initialized

CLEAR June2013

with

the

path

27


headword/partOfSpeech can extract features

The last three letters of the first two

corresponding to the part-of-speech of the

words of a named

head word of examined constituents.

annotation.

A much more sophisticated feature extractor is

The

the window feature extractor. It operates in

sentences.

conjunction with a simple feature extractor (such as the spanned text extractor or type path extractor) and extracts features over some numerically bounded and oriented range of annotations (e.g. five token to the left) relative to a focus annotation (e.g. a named entity annotation or syntactic constituent) that are within some window annotation (e.g. a sentence

or

paragraph

annotation.)

The

―featured‖ annotations, the focus annotation and the window annotation are all configurable with respect to the type system. This allows the window feature extractor to be used in a wide array of contexts. The window feature extractor also handles boundary conditions such that e.g.

lengths

of

the

annotation

appears

in

would

previous

10

A feature extractor is any class that generates feature

objects. For example,

the

window

extractor has a method that takes a focus annotation

(e.g.

a

word)

and

a

window

annotation (e.g. a sentence) and produces features relative to these two annotations according to how the feature extractor was initialized. Many feature extractors implement an interface that designate them as simple feature extractors which allows them to be used by more complicated feature extractors such as the window

extractor. It is the

responsibility of the classifier annotator to know how to initialize feature extractors and how to call

words appearing outside the sentence that the focus

entity mention

them.

be

considered as ―out-of-bounds.‖ This feature extractor allows one to extract features such as: The three part-of-speech tags to the left a word. The part-of-speech tag of the head word of Constituents

to

the

right

of

an

annotation.

NLP Components in ClearTK ClearTK provides a growing library of UIMA components that support a variety of NLP tasks. The library consists of three main types

The identifiers of recognized concepts

of components: collection readers, analysis

to the left an annotation.

engines, and classifier annotators which is summarized in table below

The penultimate word of a named entity mention annotation.

CLEAR June2013

28


Component

description

Penn Tree Reader

Reads the Penn Treebank corpus

PropBank

Reads the PropBank corpus

ACE2005 reader

Reads in named entity mentions from the ACE 2004 and 2005 tasks

CoNLL2003 reader

Reads in named entity mentions from the CoNLL 2003 task

GENIA reader

Reads in the GENIA corpus

Tokenizer

Penn Treebank style tokenizer

Sentence detector

Wrapper around OpenNLP sentence detector

syntax parser

Wrapper around Open NLP syntax parser

Stemmer

Wrapper around the Snowball stemmer

Gazetteer annotator

Finds mentions of entries in a gazetteer using simple string matching

POS tagger

Performs part-of-speech tagging

BIO chunker

Performs BIO-style chunking

Predicate annotator

Identifies predicates

Argument annotator

Identifies and classifies semantic arguments of predicates

The collection readers of particular interest

CAS such that the full syntactic parse of each

provided by ClearTK are those that read in

sentence is represented in the CAS such that

widely used annotated corpora such as Penn

constituents

Treebank or PropBank. The Penn Treebank

retrieved. The PropBank reader extends this

reader reads in constituent parse trees into the

reader by layering on the predicate/argument

CLEAR June2013

and

their

relations

can

be

29


structure provided by the PropBank corpus.

location‖ are used for words that begin a

There are also collection readers for reading in

person

the ACE 2005 corpus and the CoNLL 2003

mention, respectively. The BIO chunker is used

shared task data.

for named entity recognition, shallow parsing,

The

analysis engines provided by ClearTK

include a pattern-based tokenizer, a gazetteer annotator, and various wrappers around other NLP libraries. The tokenizer is based on Penn Treebank tokenization rules . The gazetteer annotator finds entries from a gazetteer in text using simple string matching. Other analysis

mention

or

are

inside

a

location

and tokenization. Semantic role labelling is achieved

by

the

predicate

and

argument

annotators. The predicate annotator decides whether constituents of a syntactic parse are predicates or not. The argument annotator runs subsequently and finds the arguments of a predicate.

engines include wrappers around the OpenNLP part-of-speech tagger, sentence detector, and syntax parser and a wrapper around the Snowball stemmer. ClearTK currently provides

References: 1. http://code.google.com/p/cleartk/

a small handful of classifier annotators: a partof-speech tagger, a BIO-style chunker, and a pair

of

semantic

classifier role

annotators

labelling.

The

2. Philip V. Ogren and Philipp G. Wetzler

that

support

and Steven Bethard, ClearTK: A UIMA

BIO

chunker

toolkit for statistical natural language

erforms text chunking using the popular Begin,

processing,

Inside, Outside labelling scheme for classifying

Interoperability for Large HLT Systems:

annotations as members of some kind of

UIMA for NLP workshop at Language

―chunk.‖

Resources and Evaluation Conference

For

example,

in

named

entity

recognition labels such as ―B-person‖ or ―I-

CLEAR June2013

Towards

Enhanced

(LREC), 2008.

30


M.Tech Computational Linguistics Department of Computer Science and Engineering Details of Master Research Projects Title Name of Student Abstract

Opinion Mining Ancy K Sunny Opinion Mining can be performed in various methods and in various domains. The Proposed system finds information about a product from the internet and extracts the sentences which expresses opinions and finds out the features which are commented. It then calculates the polarity of overall opinions. The first task is to identify whether the sentence collected is subjective (opinionated) or objective. This phase uses a bootstrap method which employs high precision (and low recall) classifiers to extract a number of subjective sentences. The labelled sentences are then fed to an extraction pattern learner, which produces a set of extraction patterns that are statistically correlated with the subjective sentences. These patterns are then used to identify more sentences within the un-annotated texts that can be classified as subjective. Next step is to extract object features that have been commented on in each sentence. In last phase the system finds the polarity of the opinion and summarizes the opinion on same features. To find the polarity of the opinion Adverb Adjective Combinations are used.

Tools

Python, NLTK, Sentiwordnet.

Place of Work

Govt. Engineering College, Sreekrishnapuram

Title

Discourse Analysis: Clustering Approach

Name of Student Abstract

Christopher Augustine Discourse analysis is concerned with coherent processing of text segments larger than the sentence and assumes that this requires something more than just the interpretation of the individual sentences.

While syntax and semantics work with

sentence-length units, the discourse level of NLP works with units of text longer than a sentence. Several types of discourse processing can occur at this level, two of the most common being anaphora resolution and discourse structure recognition. A discourse usually concentrated on a group of nouns. The clustering of nouns with appropriate boundary corrections can segment a text at discourse level. Tools

Python

Place of Work

Govt. Engineering College, Sreekrishnapuram

CLEAR June2013

31


Title Name of Student Abstract

Word Sense Disambiguation In Malayalam Mujeeb Rehman O The Peculiarity of any language is that, there might have lot of ambiguous words. Word Sense Disambiguation (WSD) is the task to determine which of the senses is invoked in a particular context. A standard approach to WSD is to consider the context of the words use in particular the words that occur in some predefined neighbouring context. Like many other languages Malayalam also have the ambiguous words. They can call as Nanarthas. This project adopted the Lesk Algorithm for disambiguating the Malayalam word sense disambiguation, in other words resolving Nanartha words.

Tools

Python

Place of Work

Govt. Engineering College, Sreekrishnapuram

Title

Dysarthric speech recognition and enhancement

Name of Student Abstract

Divya Das Dysarthria is a motor-neuro disorder. It causes the functioning of the speech production system. Clinically it can be healed by medicines and speech therapy. For improving the intelligibility of speech computer based approaches can be used. Here the input for computer based systems in disordered speech from dysarthric people. This paper deals with speech enhancement method to improve the intelligibility of dysarthric speech. Nemours database is used for getting the dysarthric speech. Mild Dysarthric speech is used for the experiment. The directories BB, FB, LL, MF of Nemours database contains the speech of mild dysarthria. Praat is used for analyzing the Dysarthric speech. The dysarthric speech is separated into voiced and unvoiced components. Speech enhancement is done on the voiced part. The voiced part of the speech mainly contains the information. Then from this voiced speech formant frequencies are extracted using burg algorithm. After that the extracted formant frequencies specially F1 and F2 are passed through a 4-order high pass filter. In dysarthric speech formant frequencies does not have more variations as that of a normal speech. So by applying high pass filtering more variations can be introduced to the dysarthric speech. In dysarthic speech spectral slope is lesser than that of the normal speech. By applying high pass filtering on formant frequencies the spectral slope can also be increased.

Tools

HTK, Praat, Matlab, Colea, Wavesurfer, P563, P862, Composite, fAI.

Place of Work

Amrita University, Coimbatore

CLEAR June2013

32


Title Name of Student Abstract

Chronological News Summarization Divya S News articles are one of the most exponentially increasing types of documents that we can find on Internet. And it has reached such a level that finding and recalling relevant news events is a difficult task. News summarization aims to identify common information among multiple related news documents and fuse it into a coherent text to produce an abstract of a news event. The proposed system is intended to produce informative summaries, highlighting common and most relevant information found in news documents in a user friendly manner. This will help Web users to pinpoint information that they need without extensive reading. This system takes as input a cluster of news stories on the same event and produces a summary which synthesizes common information across input stories. For a particular news event the system collects all the related stories from a particular time stamp (the beginning of the news event) to produce the abstract using Statistical approaches and natural language processing techniques. The summary is intended to contain all the relevant points of the news event from the starting of the event till date.

Tools

Python, NLTK, Hierarchical cluster, Wordnet, Hadoop

Place of Work

Govt. Engineering College, Sreekrishnapuram

Title

Ontology-based Domain-specific Natural Language Question Answering System

Name of Student Abstract

Athira PM Question answering (QA) system aims at retrieving precise information from a large collection of documents. This paper describes the architecture of a Natural Language Question Answering (NLQA) system for a specific domain based on the ontological information. The proposed system describes four basic modules suitable for enhancing current QA capabilities with the possibility of processing complex questions. The first module is the question processing which analyses and classifies the question and also reformulates the user query. The second module allows the process of retrieving the relevant documents. The third module processes the retrieved documents and finally the last module performs the extraction and generation of response. Ontology and domain knowledge is used for reformulation of queries and identifying the relations. The aim of the system is to generate short and specific answers to the question that is asked in the natural language in a specific domain.

Tools

Python nltk Stanford Core NLP, Verbnet, ProtĂŠgĂŠ

Place of Work

Govt. Engineering College, Sreekrishnapuram

CLEAR June2013

33


Title

Scalable Natural Language Report Management using Distributed IE and NLG from Ontology

Name of Student Abstract

Manu Madhavan The automatic text analysis and creation of Knowledge base from the natural language reports are the key ideas in the field of semantic web. In the age of information explosion, performing these tasks of big data become tedious and impractical. MapReduce, a programming paradigm proposed by Google, gives us a new approach to solve problems related to big-data analysis, by making use of the power of multimachines. This project make use Hadoop - an open source implementation of MapReduce to model a Scalable Natural Language Report Management system using distributed information extraction from large-scale natural language reports (in a specific domain). In this project, the knowledge is imparted to the machine in the form of ontology. The persistent storage of ontology is done using open source graph database - Neo4j. It also uses the techniques of Natural Language Generation (NLG) for querying and analysing knowledge base. Antlr an open source tool for generating domain specific grammar is used for rule-based information extraction.

Tools

Hadoop, Antlr, Jena, DOM Parser, SPARQL, Pellet, NaturalOWL, Postgres

Place of Work

Centre for Artificial Intelligence and Robotics(CAIR), DRDO, Bangalore

Title

Question Answering in Domain Specific Malayalalm Documents

Name of Student Abstract

Pragisha K This work attempts to find answers of Malayalam factual questions by using a repository of Malayalam documents.It uses Information Retreieval and Natural Language Processing in Malayalam to perform the extraction of appropriate responses. The proposed system is designed with three modules. The first one, question analysis, identifies the question word(s) and query words. It also generates answer templates. Next module performs text retrieval and answer snippet extraction. An IR module is used to interact with the document repository to obtain the documents for answer selection. These documents are analysed for the answer snippet extraction. The third module is responsible for the answer identification by using a scoring method. The system uses the language resources stemmer, POS tagger, named entity recognition system and wordnet for the Natural Language Processing in Malayalam.

Tools

Python, NLTK

Place of Work

Govt. Engineering College, Sreekrishnapuram

CLEAR June2013

34


Title

HMM-based Malayalam Text to Speech Synthesis

Name of

Rechitha C R

Student Abstract

Since speech is obviously one of the most important ways for human to communicate, there have been a great number of efforts to incorporate speech into human-computer communication environments. The function of a Text- to-Speech system is to convert some language text into its spoken equivalent by a series of modules. This involves the integration of speech technology and language technology. The task of a TTS system is thus a complex one that involves mimicking what human readers do. TTS synthesis system contains components supporting front-end processing of the input text, language modelling, and speech synthesis using its signal processing module. The proposed work involves the design of a TTS synthesis system for Malayalam.

Tools

Hidden Markov Model Toolkit (HTK), The Festival Speech Synthesis System, Speech Signal Processing Toolkit (SPTK), HMM-based SpeechSynthesisSystem(HTS),

Place of Work

Amrita University, Coimbatore

Title

Interlingua for Malayalam

Name of

Sibi S

Student Abstract

Automatic translation between human languages (‗Machine Translation‘) is a Science Fiction staple, and a long-term scientific dream of enormous social, political, and scientific importance. Machine Translation (MT) has lot of application in multilingual countries like India. A good MT system can help an individual to read and write any language, even if he is novice to those languages. So, it is necessary to implement a translation system that will translate from one language to another. This paper proposes a method for generating an intermediate form of the Machine Translation from Malayalam. It contains words with necessary information such as subject-object roles, gender, person, number, case, tense, etc. Thus, the target language can be easily constructed from the intermediate form.

Tools

SVM, TnT

Place of Work

Virtual Language Recourse Centre, IIITM-Kerala

CLEAR June2013

35


Title Name of Student Abstract

Malayalam WordNet Renuka Babu T Malayalam is one of the 22 official languages in India, spoken by nearly 33 million people. WordNets are being built for about thirteen of these official languages at different institutions. Hindi WordNet, developed at IIT Bombay is the first WordNet developed for an Indian language. Malayalam is a morphologically very rich, free word order Indian language, where very little computational work is reported about malayalam. This paper is one of the efforts towards building a Malayalam WordNet. Malayalam WordNet is a database of Malayalam word forms (words and collocations) which are grouped together in the form of synsets. The synsets are interconnected to other synsets via a number of lexical and semantic relations such as hypernym and hyponym (the is-a relation), meronym and holonym (the part-of relation), antonyms etc. The lexical relationships hold between semantically related forms of words and the semantic relationships hold between related word definitions.

Tools

Python, NLTK

Place of Work

Govt. Engineering College, Sreekrishnapuram

Title

Morphological Analyzer for Malayalam

Name of Student Abstract

Rinju O R Enabling computers to understand human language is one of the major challenges in the field of computing. Morphological Analyzer is a very important part in many NLP related applications. An NLP system is started with analyzing the input. So if we do not have a Morph analyzer with considerably good accuracy then the accuracy of whole system will get affected. This paper proposes a Morphological analyzer for Malayalam, which is part of a promising research in various NLP applications in Malayalam. Morphological analyzer takes a word as input and returns its morphemes along with its grammatical information, depending upon its word category. For nouns this tool will provide gender, number, and case information. For verbs, it will provide tense and aspects. Malayalam morph-analyzer would help in automatic spelling and grammar checking, natural language understanding, machine translation, speech recognition, speech synthesis, part of speech tagging and parsing applications.

Tools

Python 3.3

Place of Work

Virtual Language Recourse Centre, IIITM-Kerala

CLEAR June2013

36


Title

Semantic Framework for Natural Language Report Management using Distributed Information Extraction and Scalable Ontology Processing

Name of Student Abstract

Robert Jesuraj The project aims to develop an Intelligent Report Management System using Distributed Ontology processing. The system analyzes the document (corpus) in a specific domain and makes an Ontology, this ontology will be useful later to query the system. The system will have an option to generate intelligent report based on the query. Hadoop-Free open source framework for Map/Reduce paradigm will help to process the system faster. In present scenario the field of computer science is focussing on how to process big-data (Big data Problem) and also ways to make computer speak and understand human languages. For solving this kind of problems knowledge has to be imparted to the system, some level of Artificial Intelligence (AI) is needed. The system is said to be intelligent if it is able to find a solution that the predicted output will be more similar with the human. In-order to generate such a system Ontology is required for the particular domain. So creation of domain specific ontology is an important research area in AI and computer science. Even if such a system is developed the solution will be generated very slowly (as text processing is very slow by computers). In-order to generate at a faster rate, the process has to be distributed to different nodes. Jena is used to build the ontology. Pellet Reasoner is used to reason inferences.

Tools

antlr, Hadoop, neo4j, swoop, protege, jena, pellet, postgresql

Place of Work

Centre for Artificial Intelligence and Robotics(CAIR), DRDO, Bangalore

Title

Information Extraction from Advertisements using Classification Approach

Name of Student Abstract

Saani H Advertisements on websites such as Craigslist are largely unstructured text even though individuals would naturally want to perform structured search over certain attributes of interest for purposes such as purchasing a car,a book or job searching. My aim is to build a system to perform information extraction over unstructured job advertisements using various natural language processing techniques including machine learning, Bayesian classification, named entity recognition along with rule based approaches. The information extracted from these advertisements can be used to perform search over certain attributes of interest.

Tools Place of Work

Govt. Engineering College, Sreekrishnapuram

CLEAR June2013

37


Title

HMM Based Malayalam Speech Recognition

Name of

Sumi S Nair

Student Abstract

Speech is the primary means of communication between people. People are comfortable with speech and also persons wish to interact with computers via speech. The goal of Speech Recognition system is to translate acoustic signals into a sequence of words. The recognizer makes use of phone-based continuous density Hidden Markov Model (HMM) for acoustic modelling and n-gram statistics estimated on text material. To deal with phonological variability, alternate pronunciations are included in the lexicon. Speech recognition system is applied with acoustic observation O and the goal is to find the corresponding word sequence W that has the maximum posterior probability P (W |O). The proposed work is to design an Automatic Speech Recognition system for Malayalam.

Tools

HTK, Audacity, Praat, wavesurfer

Place of Work

Amrita University, Coimbatore

Title

Extraction of Semantic Relation from Medical Records

Name of

Ayisha Noori V K

Student Abstract

Biomedical natural language processing deals with the application of text mining techniques to clinical documents and to scientific publications in the areas of biology and medicine. A crucial area of Natural Language Processing is semantic analysis, the study of the meaning of linguistic utterances. Natural language processing of biomedical text benefits from the ability to recognize broad semantic classes from different clinical notes. This thesis proposes a method that extract semantics from medical patient records using statistical machine learning techniques. In particular, this is concerned with the identification of relationships between different diseases and enlist the necessary medical tests(ECG, CT scan etc.) required for a patient. For example, if a patient is having pneumonia, this method is intended to identify some possible diseases that the patient may encounter and list out the necessary tests that the patient have to perform.

Tools

Python, NLTK,

Place of Work

Government Engineering College, Sreekrishnapuram

CLEAR June2013

38


Title Name of Student Abstract

Conceptual Indexing and Semantic Searching for Malayalam documents Radhika K. T Conceptual search, i.e., search based on meaning rather than just character strings, has been the motivation of a large body of research in the IR field. Here proposes the system for indexing Malayalam documents using the index terms based on the concept information, not merely the word strings. Subsequently the system searches for documents based on that conceptual index terms and finally produces two set of ranked files. One set contains the exact relevant documents with respect to the query concept and the second set contains documents related with the query concept.

Tools

Python3

Place of Work

Govt. Engineering College, Sreekrishnapuram

CLEAR June2013

39


M.Tech Computational Linguistics Dept. of Computer Science and Engg, Govt. Engg. College, Sreekrishnapuram Palakkad www.simplegroups.in simplequest.in@gmail.com

SIMPLE Groups Students Innovations in Morphology Phonology and Language Engineering

Article Invitation for CLEAR- Sep-2013 We are inviting thought-provoking articles, interesting dialogues and healthy debates on multifaceted aspects of Computational Linguistics, for the forthcoming issue of CLEAR (Computational Linguistics in Engineering And Research) magazine, publishing on Sep 2013. The suggested areas of discussion are:

The articles may be sent to the Editor on or before 10th Sep, 2013 through the email simplequest.in@gmail.com. For more details visit: http://simplegec.blogspot.in

Editor,

Representative,

CLEAR Magazine

SIMPLE Groups

CLEAR June2013

40


Hello World, While we are coming with the fourth issue of CLEAR, the recent recognition of Malayalam as Classical language makes this episode more special. Languages are not only a medium of communication, but also a strong idol of our culture and tradition. The knowledge encrypted in each language is incredible and invaluable.

Of Course, the fast growing technology and globalization has dismantled all such idols. Thence the civilization evolved with the local languages are in a threat. “Natural Selection� is an evergreen truth. By standing away from technology, no language can survive this digital era. In order to fit the tech pad, it is combined responsibility of technocrats and linguists to develop computational resources for their own languages.

We expect, the long wait recognition for Malayalam will benefit Malayalam computing and the related projects. SIMPLE Groups open hands to language enthusiasts for their volunteer works.

Wish you all the best.

Manu Madhavan.

CLEAR June2013

41


CLEAR June2013

42


CLEAR June2013

43


Clearjun2013