Interfacing with Big Data Repositories Boris Katz MIT Computer Science and Artificial Intelligence Laboratory July 18, 2013
Our Claim As we develop storage capacity, compute platforms and algorithms for scaling to big data, we will need to create new ways to access and interact with massive scale data
The Big Picture Q: ----A: ----Q: ----A: -----
Data But, what about access?Language Visualization Interaction imagery feature spaces graphs diagrams
terms relationships descriptions
Big Data unstructured
We can view Big Data through…
… language-colored glasses Query: “What diseases present with fever and a rash?”
Query: What diseases present with fever and a rash? Answer: Scarlet fever is an illness with a characteristic rash that is caused by a strep infection. Chickenpox, Fifth Disease and Systemic Lupus Erythematosus Roseola – this is one of the most common causes of fever and rash in infants and young children. It starts out with three days of moderate to ...
… visualizationcolored glasses
Language can help manipulate visualization Query: “Rule out patients under 25.” MIT
From Big Data to Manageable Data by understanding structure Parse into T-expressions
OPV Model Manageable Data
â€˘â€Ż Language focuses our attention on what is important in data and helps make data more manageable
START: Natural language tools ¢
Providing Machines with New Knowledge: NL text
Explaining Computer Actions or Describing its Knowledge: semantic representation
Testing Computer Understanding by Answering Questions: NL queries
NL responses computer actions
Building blocks in the START system ¢
Syntactic Analysis: parse trees
Semantic Representation: ternary expressions
Matching and Transformational Rules
Object–Property–Value Data Model
Question Decomposition MIT
Syntactic Analysis: parse trees â€œBecause the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.â€?
Semantic Representation: From Parse Trees to Ternary Expressions “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.” [subject relation object] [become because expect] [flooding become frequent] [become has_modifier likely] [somebody expect rise] [level related_to sea] [level is average] [level in Northeast] [frequent has_quantity more] [level rise null]
[rise due_to change] [change related_to climate] [flooding cause increase] [damage increase null] [damage related_to property] [damage in areas] [areas is coastal] …
Ternary expression representation
a versatile syntax-driven representation of language
highlights significant semantic relations
very efficient for indexing, matching and retrieval
Three types of Ternary Expressions “A young man’s friend was visiting Taiwan” ¢
Related to the syntactic structure of the sentence [friend visit Taiwan] [friend related_to man] [man has_property young]
Related to syntactic features that change from sentence to sentence [visit has_tense past] [visit is_progressive yes] [man has_det indefinite]
Related to lexical features of words that don’t change from sentence to sentence [Taiwan is_proper yes] [man has_number singular]
Creating semantic representations
Matching T-Expressions Assertion: “Average sea level in the Northeast is expected to rise higher due to climate change.” Query:
“What sea levels are expected to rise?
T-Expressions" from Query"
T-Expressions" from Assertion"
[somebody expect rise]
[somebody expect rise]
[level related_to sea]
[level related_to sea] [level is average] [level in Northeast]
[level rise null]
[level rise null] [rise due_to change] [change related_to climate] MIT
Matching in START T-Exps from Questions
term matching: l l l
T-Exps from Assertions
lexical match synonym match hyponym match
structure matching: l l
exact match match via transformational S-rules
Verb argument alternations and paraphrases “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Load”: “The crane loaded the ship with containers.” “The crane loaded containers onto the ship.” “Provide”: “Did Iran provide Syria with weapons?” “Did Iran provide weapons to Syria?”
Verb classes and S-Rules “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Confuse”: “The patient confused the doctor with his slow recovery.” “The patient’s slow recovery confuse the doctor.” … Emotional Reaction Verbs (semantic class): anger, confuse, disappoint, embarrass, frighten, impress, please, surprise, threaten, … S-Rule: If: Then:
[[subject verb object1] with object2] [object2 verb object1] [object2 related_to subject]
verb ∈ emotional reaction class MIT
Language generation As intelligent systems become more mature, they will be expected to... l l l l l l l
Explain their actions Answer complex questions Keep track of conversation history and state Engage in mixed-initiative dialog Offer related information of potential interest to the user Help users correct and refine their questions Indicate incomplete understanding of questions and offer partial responses
START's generator structural ternary expressions ternary expressions for syntactic features ternary expressions for lexical features user/machine-provided task specification Generator • linguistic constraints • syntactic rules • morphological rules • lexical knowledge • anaphoric reference • heuristic defaults natural language sentence MIT
Generator in action
Generator in action
Replying to a question after a match Generate a sentence from semantic representation related-to"
Query “How are the glucose molecules converted into pyruvate molecules?”
chain" quantifier" glucose"
A chain of reactions converts each molecule of glucose into two smaller molecules of pyruvate."
Execute a procedure to obtain an answer from the data source Script"
Query “Who directed Gone with the Wind?”
• get http://us.imdb.com/ Details?0031381" • match regexp...
Gone with the Wind (1939) was directed by George Cukor, Victor Fleming, and Sam Wood. Source: The Internet Movie Database
START in action
Google in action
The Object–Property–Value data model The object–property–value (OPV) model applies to: ¢
structured data: Record
Units in stock
heterogeneous semi-structured information sources: l l l
countries and their capitals, areas, populations, … individuals and their biographies, birthdates, spouses, … cities and their weather reports, maps, elevations, …
The OPV Model makes it possible to view and use large segments of the Web as a database MIT
Implementing the OPV Model: START and Omnibase Omnibase supports START by providing access to structured and semi-structured information in databases, on the Web, etc. Data Resources
1. What does the question mean? 2. Where can the answer be found? 3. What are the object and property?
Wikipedia IMDb Internet Public Library NASA Big Data… etc.
1. Go to the specific data source or Web page containing the answer. 2. Extract the answer from the data source. MIT
Answering complex questions “How many people live in the capital of the 8th richest Asian country?” ¢
Syntactically decompose a complex question into a set of nested ternary expressions
Successively resolve groups of ternary expressions containing variables l
Answer sub-questions by replacing variables with obtained values
“How many people live in the capital of the 8th richest Asian country?” What is the 8th richest Asian country? What is its capital? How many people live there? MIT
Replying: syntactic decomposition
START Question Answering System
START: linguistically-motivated representations and approaches ¢
Ternary expressions representation
OPV Data Model: Uniform access to heterogeneous resources
Natural language annotations
Decomposition of complex questions
Same representation for sentence analysis, sentence generation, and question answering MIT
The START system pioneered language-based services on the Web. The public START server handles millions of questions from users all over the world.
START provides high-precision “one-stop shopping” for information from diverse sources: structured, semi-structured, and unstructured.
System responses can fuse information from multiple sources and multiple formats.
Natural language interaction is a flexible and convenient way to access massive scale data. MIT
Recent Successes of Artificial Intelligence Applications ¢
… but are these systems truly intelligent? These systems don’t have any knowledge or understanding about the world outside of their narrow area of expertise. MIT
The challenge of creating a truly intelligent machine
Meeting this challenge will require moving across: ¢
modalities – language, vision, robotics, reasoning, …
disciplines – AI, linguistics, cognitive science, neuroscience, …
Much work remains to be done!