Page 1

Interfacing with Big Data Repositories Boris Katz MIT Computer Science and Artificial Intelligence Laboratory July 18, 2013


Our Claim As we develop storage capacity, compute platforms and algorithms for scaling to big data, we will need to create new ways to access and interact with massive scale data

MIT


The Big Picture Q: ----A: ----Q: ----A: -----

Data But, what about access?Language Visualization Interaction imagery feature spaces graphs diagrams

terms relationships descriptions

Analysis

Big Data unstructured

semi-structured

structured


We can view Big Data through…

Big Data

… language-colored glasses Query: “What diseases present with fever and a rash?”

Query: What diseases present with fever and a rash? Answer: Scarlet fever is an illness with a characteristic rash that is caused by a strep infection. Chickenpox, Fifth Disease and Systemic Lupus Erythematosus Roseola – this is one of the most common causes of fever and rash in infants and young children. It starts out with three days of moderate to ...

… visualizationcolored glasses

Language can help manipulate visualization Query: “Rule out patients under 25.” MIT


From Big Data to Manageable Data by understanding structure Parse into T-expressions

Apply S-Rules

OPV Model Manageable Data

Big Data

Annotate

Decomposition

•  Language focuses our attention on what is important in data and helps make data more manageable

MIT


START: Natural language tools ¢ 

Providing Machines with New Knowledge: NL text

¢ 

semantic representation

Explaining Computer Actions or Describing its Knowledge: semantic representation

¢ 

NL text

Testing Computer Understanding by Answering Questions: NL queries

semantic representation

NL responses computer actions

MIT


Building blocks in the START system ¢ 

Syntactic Analysis: parse trees

¢ 

Semantic Representation: ternary expressions

¢ 

Matching and Transformational Rules

¢ 

Language Generation

¢ 

Replying

¢ 

Object–Property–Value Data Model

¢ 

Question Decomposition MIT


Syntactic Analysis: parse trees “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.�

MIT


Semantic Representation: From Parse Trees to Ternary Expressions “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.” [subject relation object] [become because expect] [flooding become frequent] [become has_modifier likely] [somebody expect rise] [level related_to sea] [level is average] [level in Northeast] [frequent has_quantity more] [level rise null]

[rise due_to change] [change related_to climate] [flooding cause increase] [damage increase null] [damage related_to property] [damage in areas] [areas is coastal] …

MIT


Ternary expression representation

¢ 

a versatile syntax-driven representation of language

¢ 

highlights significant semantic relations

¢ 

very efficient for indexing, matching and retrieval

MIT


Three types of Ternary Expressions “A young man’s friend was visiting Taiwan” ¢ 

Related to the syntactic structure of the sentence [friend visit Taiwan] [friend related_to man] [man has_property young]

¢ 

Related to syntactic features that change from sentence to sentence [visit has_tense past] [visit is_progressive yes] [man has_det indefinite]

¢ 

Related to lexical features of words that don’t change from sentence to sentence [Taiwan is_proper yes] [man has_number singular]

MIT


Creating semantic representations

MIT


Matching T-Expressions Assertion: “Average sea level in the Northeast is expected to rise higher due to climate change.” Query:

“What sea levels are expected to rise?

T-Expressions" from Query"

Matcher"

T-Expressions" from Assertion"

[somebody expect rise]

[somebody expect rise]

[level related_to sea]

[level related_to sea] [level is average] [level in Northeast]

[level rise null]

[level rise null] [rise due_to change] [change related_to climate] MIT


Matching in START T-Exps from Questions

¢ 

term matching: l  l  l 

¢ 

T-Exps from Assertions

lexical match synonym match hyponym match

structure matching: l  l 

exact match match via transformational S-rules

MIT


Verb argument alternations and paraphrases “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Load”: “The crane loaded the ship with containers.” “The crane loaded containers onto the ship.” “Provide”: “Did Iran provide Syria with weapons?” “Did Iran provide weapons to Syria?”

MIT


Verb classes and S-Rules “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Confuse”: “The patient confused the doctor with his slow recovery.” “The patient’s slow recovery confuse the doctor.” … Emotional Reaction Verbs (semantic class): anger, confuse, disappoint, embarrass, frighten, impress, please, surprise, threaten, … S-Rule: If: Then:

[[subject verb object1] with object2] [object2 verb object1] [object2 related_to subject]

Provided:

verb ∈ emotional reaction class MIT


Language generation As intelligent systems become more mature, they will be expected to... l  l  l  l  l  l  l 

Explain their actions Answer complex questions Keep track of conversation history and state Engage in mixed-initiative dialog Offer related information of potential interest to the user Help users correct and refine their questions Indicate incomplete understanding of questions and offer partial responses

MIT


START's generator structural ternary expressions ternary expressions for syntactic features ternary expressions for lexical features user/machine-provided task specification Generator •  linguistic constraints •  syntactic rules •  morphological rules •  lexical knowledge •  anaphoric reference •  heuristic defaults natural language sentence MIT


Generator in action

MIT


Generator in action

MIT


Replying to a question after a match Generate a sentence from semantic representation related-to"

Query “How are the glucose molecules converted into pyruvate molecules?”

pyruvate" quantity"

into"

two" converts"

molecules"

related-to"

is"

related-to"

smaller"

chain" quantifier" glucose"

reactions" molecule"

A chain of reactions converts each molecule of glucose into two smaller molecules of pyruvate."

each"

Execute a procedure to obtain an answer from the data source Script"

Query “Who directed Gone with the Wind?”

Annotation"

• get http://us.imdb.com/ Details?0031381" • match regexp...

+!IMDb"

Gone with the Wind (1939) was directed by George Cukor, Victor Fleming, and Sam Wood. Source: The Internet Movie Database

MIT


START in action

MIT


Google in action

MIT


The Object–Property–Value data model The object–property–value (OPV) model applies to: ¢ 

¢ 

structured data: Record

Category

Manufacturer

Units in stock

Retail price

25387

keyboard

Dell

56

32.25

53289

mouse

Apple

72

39.99

heterogeneous semi-structured information sources: l  l  l 

countries and their capitals, areas, populations, … individuals and their biographies, birthdates, spouses, … cities and their weather reports, maps, elevations, …

The OPV Model makes it possible to view and use large segments of the Web as a database MIT


Implementing the OPV Model: START and Omnibase Omnibase supports START by providing access to structured and semi-structured information in databases, on the Web, etc. Data Resources

User Questions

World Factbook

START

structured query

1.  What does the question mean? 2.  Where can the answer be found? 3.  What are the object and property?

Omnibase

Wikipedia IMDb Internet Public Library NASA Big Data… etc.

1.  Go to the specific data source or Web page containing the answer. 2.  Extract the answer from the data source. MIT


Answering complex questions “How many people live in the capital of the 8th richest Asian country?” ¢ 

Syntactically decompose a complex question into a set of nested ternary expressions

¢ 

Successively resolve groups of ternary expressions containing variables l 

Answer sub-questions by replacing variables with obtained values

“How many people live in the capital of the 8th richest Asian country?” What is the 8th richest Asian country? What is its capital? How many people live there? MIT


Replying: syntactic decomposition

MIT


START Question Answering System

MIT


START: linguistically-motivated representations and approaches ¢ 

Ternary expressions representation

¢ 

OPV Data Model: Uniform access to heterogeneous resources

¢ 

Natural language annotations

¢ 

Decomposition of complex questions

¢ 

Same representation for sentence analysis, sentence generation, and question answering MIT


Contributions ¢ 

The START system pioneered language-based services on the Web. The public START server handles millions of questions from users all over the world.

¢ 

START provides high-precision “one-stop shopping” for information from diverse sources: structured, semi-structured, and unstructured.

¢ 

System responses can fuse information from multiple sources and multiple formats.

¢ 

Natural language interaction is a flexible and convenient way to access massive scale data. MIT


MIT


Recent Successes of Artificial Intelligence Applications ¢ 

Google’s Goggles

¢ 

Microsoft’s Kinect

¢ 

IBM’s Watson

¢ 

Apple’s Siri

… but are these systems truly intelligent? These systems don’t have any knowledge or understanding about the world outside of their narrow area of expertise. MIT


The challenge of creating a truly intelligent machine

Meeting this challenge will require moving across: ¢ 

modalities – language, vision, robotics, reasoning, …

¢ 

disciplines – AI, linguistics, cognitive science, neuroscience, …

Much work remains to be done!

MIT

dr._boris_katz