Page 1

Interfacing with Big Data Repositories Boris Katz MIT Computer Science and Artificial Intelligence Laboratory July 18, 2013

Our Claim As we develop storage capacity, compute platforms and algorithms for scaling to big data, we will need to create new ways to access and interact with massive scale data


The Big Picture Q: ----A: ----Q: ----A: -----

Data But, what about access?Language Visualization Interaction imagery feature spaces graphs diagrams

terms relationships descriptions


Big Data unstructured



We can view Big Data through…

Big Data

… language-colored glasses Query: “What diseases present with fever and a rash?”

Query: What diseases present with fever and a rash? Answer: Scarlet fever is an illness with a characteristic rash that is caused by a strep infection. Chickenpox, Fifth Disease and Systemic Lupus Erythematosus Roseola – this is one of the most common causes of fever and rash in infants and young children. It starts out with three days of moderate to ...

… visualizationcolored glasses

Language can help manipulate visualization Query: “Rule out patients under 25.” MIT

From Big Data to Manageable Data by understanding structure Parse into T-expressions

Apply S-Rules

OPV Model Manageable Data

Big Data



•  Language focuses our attention on what is important in data and helps make data more manageable


START: Natural language tools ¢ 

Providing Machines with New Knowledge: NL text


semantic representation

Explaining Computer Actions or Describing its Knowledge: semantic representation


NL text

Testing Computer Understanding by Answering Questions: NL queries

semantic representation

NL responses computer actions


Building blocks in the START system ¢ 

Syntactic Analysis: parse trees


Semantic Representation: ternary expressions


Matching and Transformational Rules


Language Generation




Object–Property–Value Data Model


Question Decomposition MIT

Syntactic Analysis: parse trees “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.�


Semantic Representation: From Parse Trees to Ternary Expressions “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.” [subject relation object] [become because expect] [flooding become frequent] [become has_modifier likely] [somebody expect rise] [level related_to sea] [level is average] [level in Northeast] [frequent has_quantity more] [level rise null]

[rise due_to change] [change related_to climate] [flooding cause increase] [damage increase null] [damage related_to property] [damage in areas] [areas is coastal] …


Ternary expression representation


a versatile syntax-driven representation of language


highlights significant semantic relations


very efficient for indexing, matching and retrieval


Three types of Ternary Expressions “A young man’s friend was visiting Taiwan” ¢ 

Related to the syntactic structure of the sentence [friend visit Taiwan] [friend related_to man] [man has_property young]


Related to syntactic features that change from sentence to sentence [visit has_tense past] [visit is_progressive yes] [man has_det indefinite]


Related to lexical features of words that don’t change from sentence to sentence [Taiwan is_proper yes] [man has_number singular]


Creating semantic representations


Matching T-Expressions Assertion: “Average sea level in the Northeast is expected to rise higher due to climate change.” Query:

“What sea levels are expected to rise?

T-Expressions" from Query"


T-Expressions" from Assertion"

[somebody expect rise]

[somebody expect rise]

[level related_to sea]

[level related_to sea] [level is average] [level in Northeast]

[level rise null]

[level rise null] [rise due_to change] [change related_to climate] MIT

Matching in START T-Exps from Questions


term matching: l  l  l 


T-Exps from Assertions

lexical match synonym match hyponym match

structure matching: l  l 

exact match match via transformational S-rules


Verb argument alternations and paraphrases “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Load”: “The crane loaded the ship with containers.” “The crane loaded containers onto the ship.” “Provide”: “Did Iran provide Syria with weapons?” “Did Iran provide weapons to Syria?”


Verb classes and S-Rules “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Confuse”: “The patient confused the doctor with his slow recovery.” “The patient’s slow recovery confuse the doctor.” … Emotional Reaction Verbs (semantic class): anger, confuse, disappoint, embarrass, frighten, impress, please, surprise, threaten, … S-Rule: If: Then:

[[subject verb object1] with object2] [object2 verb object1] [object2 related_to subject]


verb ∈ emotional reaction class MIT

Language generation As intelligent systems become more mature, they will be expected to... l  l  l  l  l  l  l 

Explain their actions Answer complex questions Keep track of conversation history and state Engage in mixed-initiative dialog Offer related information of potential interest to the user Help users correct and refine their questions Indicate incomplete understanding of questions and offer partial responses


START's generator structural ternary expressions ternary expressions for syntactic features ternary expressions for lexical features user/machine-provided task specification Generator •  linguistic constraints •  syntactic rules •  morphological rules •  lexical knowledge •  anaphoric reference •  heuristic defaults natural language sentence MIT

Generator in action


Generator in action


Replying to a question after a match Generate a sentence from semantic representation related-to"

Query “How are the glucose molecules converted into pyruvate molecules?”

pyruvate" quantity"


two" converts"






chain" quantifier" glucose"

reactions" molecule"

A chain of reactions converts each molecule of glucose into two smaller molecules of pyruvate."


Execute a procedure to obtain an answer from the data source Script"

Query “Who directed Gone with the Wind?”


• get Details?0031381" • match regexp...


Gone with the Wind (1939) was directed by George Cukor, Victor Fleming, and Sam Wood. Source: The Internet Movie Database


START in action


Google in action


The Object–Property–Value data model The object–property–value (OPV) model applies to: ¢ 


structured data: Record



Units in stock

Retail price











heterogeneous semi-structured information sources: l  l  l 

countries and their capitals, areas, populations, … individuals and their biographies, birthdates, spouses, … cities and their weather reports, maps, elevations, …

The OPV Model makes it possible to view and use large segments of the Web as a database MIT

Implementing the OPV Model: START and Omnibase Omnibase supports START by providing access to structured and semi-structured information in databases, on the Web, etc. Data Resources

User Questions

World Factbook


structured query

1.  What does the question mean? 2.  Where can the answer be found? 3.  What are the object and property?


Wikipedia IMDb Internet Public Library NASA Big Data… etc.

1.  Go to the specific data source or Web page containing the answer. 2.  Extract the answer from the data source. MIT

Answering complex questions “How many people live in the capital of the 8th richest Asian country?” ¢ 

Syntactically decompose a complex question into a set of nested ternary expressions


Successively resolve groups of ternary expressions containing variables l 

Answer sub-questions by replacing variables with obtained values

“How many people live in the capital of the 8th richest Asian country?” What is the 8th richest Asian country? What is its capital? How many people live there? MIT

Replying: syntactic decomposition


START Question Answering System


START: linguistically-motivated representations and approaches ¢ 

Ternary expressions representation


OPV Data Model: Uniform access to heterogeneous resources


Natural language annotations


Decomposition of complex questions


Same representation for sentence analysis, sentence generation, and question answering MIT

Contributions ¢ 

The START system pioneered language-based services on the Web. The public START server handles millions of questions from users all over the world.


START provides high-precision “one-stop shopping” for information from diverse sources: structured, semi-structured, and unstructured.


System responses can fuse information from multiple sources and multiple formats.


Natural language interaction is a flexible and convenient way to access massive scale data. MIT


Recent Successes of Artificial Intelligence Applications ¢ 

Google’s Goggles


Microsoft’s Kinect


IBM’s Watson


Apple’s Siri

… but are these systems truly intelligent? These systems don’t have any knowledge or understanding about the world outside of their narrow area of expertise. MIT

The challenge of creating a truly intelligent machine

Meeting this challenge will require moving across: ¢ 

modalities – language, vision, robotics, reasoning, …


disciplines – AI, linguistics, cognitive science, neuroscience, …

Much work remains to be done!