1) Explain in detail about words and their components?
Words and their components
One of the fundamental aspects of NLP is dealing with words and their components.
In this context, words are the building blocks of language and carry semantic meaning.
Understanding the components of words is crucial for various NLP tasks,
Let's explore the components of words commonly analyzed in NLP:
(1) Tokens:
Definition:
In NLP, Tokens refers to individual unit of text that is separated by whitespace or other delimiters, such as punctuations.
Tokenization Process:
Tokenization is the process of breaking down a text into individual tokens; based on certain rules.
The specific rules may vary depending on the language, context, and requirements of the NLP task.
Common tokenization techniques include:
Word Tokenization:
The most common form of tokenization is breaking a text into individual words.
For example, the sentence "I love cats!" can be tokenized into the following tokens: ["I", "love", "cats"].
Sentence Tokenization:
In some cases, the goal is to split a text into sentences rather than words.
Sentence tokenization breaks the text (paragraph) into individual sentences.
For example, the paragraph "I love cats. They are so cute!" can be tokenized into ["I love cats.", "They are so cute!"].
Sub word Tokenization:
Subword tokenization breaks a text into smaller units, such as character n-grams or subword units.
This approach is commonly used in tasks like machine translation or sentiment analysis.
For example, the word "unhappiness" might be tokenized into ["un", "happi", "ness"].
Importance of Tokens:
Tokens serve as the fundamental units for various NLP tasks; They enable computers to process and analyze text efficiently.
Some key reasons why tokens are important in NLP include:
a. Text Pre-processing
b. Vocabulary Building
c. Text Analysis
d. Feature Extraction
(2) Lexemes:
Definition:
A lexeme is the abstract representation of a word, which includes the base form of the word and all its inflected or derived forms.
It represents the shared meaning of related words within a lexical category.
Base Form:
The base form of a word, also known as the lemma, is the canonical or dictionary form. It is the form that we typically find in dictionaries.
For example, the base form of the verb "run" is "run," and the base form of the noun "cats" is "cat."
Inflected Forms:
Inflected forms are variations of a word that are derived through grammatical processes such as tense, number, gender, case, or conjugation.
Examples of inflected forms include:
o Verb Inflections: "run" (base form) vs. "runs" (3rd person singular present tense) vs. "ran" (past tense) vs. "running" (present participle).
o Noun Inflections: "cat" (base form) vs. "cats" (plural form) vs. "cat's" (possessive form).
o Adjective Inflections: "happy" (base form) vs. "happier" (comparative form) vs. "happiest" (superlative form).
Importance of Lexemes:
Lexemes are an important concept in linguistics and natural language processing (NLP).
Some key reasons why lexemes are important in NLP include:
a. Meaning Representation
b. Morphological Analysis
c. Word Sense Disambiguation
d. Information Retrieval and Search
(3) Morphemes:
Definition:
Morphemes are the smallest units of meaning in language.
A word be composed of one or more morphemes.
They are the building blocks of words and carry semantic or grammatical information.
Free Morphemes:
Free morphemes are standalone words that can function independently and carry meaning on their own.
Examples of free morphemes include nouns (e.g., "cat"), verbs (e.g., "run"), adjectives (e.g., "happy"), and adverbs (e.g., "quickly").
Bound Morphemes:
Bound morphemes are units that cannot function independently as separate words but must be attached to other morphemes.
Bound morphemes can be further classified into prefixes, suffixes, and infixes.
Prefixes:
These are bound morphemes added to the beginning of a word.
Examples include "un-" (e.g., "undo"), "re-" (e.g., "redo"), and "pre-" (e.g., "preheat").
Suffixes:
These are bound morphemes added to the end of a word.
Examples include "-s" (e.g., "cats"), "-ed" (e.g., "walked"), and "-er" (e.g., "faster").
Infixes:
Infixes are less common in English but can be found in some languages. They are inserted within a word and are used for various purposes For example, the Tagalog word "abso-bloomin'-lutely" includes the infix "-bloomin'-" for emphasis.
Common example:
The word “unhappy” can be broken down into three morphemes: “un-”,” happy”,”-ly”
Importance of Morphemes:
Some key reasons why morphemes are important in NLP include:
a. Meaningful Units
b. Word Formation
c. Grammar and Syntax
d. Language Acquisition
e. Language Processing
(4) Typology:
Definition:
Typology refers to the classification of language based on their structural features and patterns.
Typology analysis helps to understand the diversity of language and how they differ from one another.
Typological Features:
Word Order
Morphological Typology
Phonological Features
Nominal and Verbal Categories
Syntactic Structures
Semantic Systems
Practical Applications:
Typology has implications for various areas, such as:
Language documentation
Language acquisition
Machine translation
Development of NLP systems
Explain the Issues and challenges in finding the structure of words?
Finding the structure of words is an important task in NLP. By understanding the structure of words, we can better understand the meaning of text and the relationships between words.
This information can be used to improve a wide range of NLP tasks.
Finding the structure of words in language poses several issues and challenges due to the inherent characteristics of human language.
Irregularity
Irregularity is a deviation from the norm or from what is expected. In the context of language, irregularity refers to words that do not follow the regular patterns of inflection.
For example, the plural of "mouse" is "mice," but the plural of "goose" is "geese." This is because "mouse" and "goose" are irregular nouns.
There are many reasons why words may be irregular.
Some words are irregular because they come from other languages.
o For example, the word "robot" is irregular because it comes from the Czech word "robota," which means "forced labor."
Other words are irregular because they have changed over time.
o For example, the word "man" used to be spelled "mann," but the spelling changed over time.
Irregularity makes it difficult to learn a language. However, it is important to remember that not all words are irregular. Most words in a language follow regular patterns
Some examples of irregularity in English include:
Nouns:
o mouse - mice
o goose - geese
o woman - women
Verbs:
o sing - sang - sung
o go - went - gone
o come - came - come
Adjectives:
o good - better - best
o big - bigger – biggest
Ambiguity
Ambiguity in language refers to the phenomenon where a word, phrase, or sentence can be interpreted in multiple ways, leading to uncertainty in its meaning.
It arises when there is more than one possible interpretation or when context is insufficient to determine the intended meaning.
There are many different types of ambiguity. Some of the most common types include:
1. Lexical Ambiguity:
Lexical ambiguity arises from words or phrases that have multiple meanings.
Example 1: the word "bank" can refer to a financial institution or the edge of a river.
Example 2: the word "bat" can refer to a flying mammal or to a piece of sports equipment.
2. Syntactic Ambiguity:
Syntactic ambiguity arises when the structure or arrangement of words in a sentence allows for multiple interpretations.
Example: consider the sentence "Visiting relatives can be a nuisance." Here, "visiting relatives" can be understood as either the subject (relatives who visit) or the object (the act of visiting relatives).
3. Semantic Ambiguity:
Semantic ambiguity occurs when a word or phrase can be understood in different ways based on its meaning or sense.
Example: the sentence "She saw a man with binoculars" can be interpreted as either "She used binoculars to see a man" or "She saw a man who was holding binoculars."
4. Pragmatic ambiguity:
This is when a sentence can be interpreted in more than one way due to the context in which it is used.
Example: the sentence "I saw John yesterday" can be interpreted to mean that the speaker saw John in person or that the speaker saw John on television.
Productivity
Productivity in language refers to the capacity to generate and understand new and meaningful linguistic expressions using existing linguistic elements.
It showcases the creative and flexible nature of human language
Key aspects of productivity in language include:
1. Word Formation:
2. Phrase and Sentence Construction
3. Figurative Language
4. Neologisms
For example: in English, the suffix "-er" can be added to verbs to create nouns denoting a person or thing associated with the action (e.g., "teacher," "baker," "singer").
How Ambiguity can be eliminated in NLP?
Context:
One of the most common ways to eliminate ambiguity is to use context.
For example, the word "bank" can have multiple meanings, such as "a financial institution" or "the side of a river". However, if we know that the word "bank" is being used in the context of finance, then we can narrow down the possible meanings to just one.
Part-of-speech tagging:
Part-of-speech tagging is the task of assigning a part-of-speech tag to each word in a sentence
For example, the word "bank" can be tagged as a noun or a verb
Morphological analysis:
Morphological analysis is the task of determining the morphological structure of a word.
For example, the word "unhappy" can be analyzed as the negative form of the word "happy
Machine learning:
ML techniques can be used to learn the relationships between words and their meanings; this can be used to eliminate ambiguity.
Morphological Models:
Morphological models are highly useful in finding the structure of words by providing systematic frameworks for analyzing and understanding word forms.
There are many possible approaches to designing and implementing morphological models.
Dictionary Lookup
Dictionary Lookup is an approach used in morphological modeling to associate word forms of a language with their corresponding linguistic descriptions.
In this approach, the associations are established by enumerating them case by case, typically in a dictionary or word list.
The main idea behind Dictionary Lookup is to use a pre-existing database or resource that contains list of word forms along with their corresponding morphological analyses.
These word forms can include inflected forms, derivations, compounds, and other linguistic variations.
When a word form needs to be analyzed or processed, the morphological model performs a lookup operation in the dictionary to retrieve the relevant information.
The dictionary used in this approach can be implemented in various data structures such as lists, binary search trees, tries, hash tables, or other efficient lookup mechanisms.
Dictionary Lookup is relatively straightforward and efficient because the lookup operations are simple and quick.
Once the word form is found in the dictionary, the corresponding linguistic information can be obtained directly.
This approach is particularly useful for handling exceptions and irregularities in a language.
Advantages
fast and efficient method
precise morphological analysis
robust to out-of-vocabulary (OOV) words.
Limitations
Dictionary Lookup does not offer generalization means.
Additionally, maintaining and updating the dictionary can be challenging as new words or linguistic variations emerge over time.
Finite-State Morphology
Finite-State Morphology is an approach to morphological modeling that utilizes finitestate transducers to analyze and generate word forms in a language.
It is based on the concept of finite-state automata and extends it to handle, linguistic phenomena related to morphology.
In Finite-State Morphology, word forms are represented as sequences of symbols referred as surface strings.
The goal is to associate these surface strings with their corresponding linguistic analyses, referred as lexical strings.
Finite-state transducers are computational devices used in this approach. They consist of a finite set of nodes, also known as states, connected by directed edges called arcs. Each arc is labeled with a pair of input and output symbols.
By traversing the transducer from the initial states to the final states along the arcs, it is possible to read the input surface string and generate corresponding lexical string(output)
The two most popular tools supporting this approach, XFST (Xerox Finite-State Tool) and
Lex Tools.
We can then consider R as a function mapping an input string into a set of output strings, formally denoted by this type signature, where [Σ] equals String:
Advantages
computationally efficient
allows for fast processing of word forms
enables quick lookup
can handle both inflectional and derivational processes of a language
suitable for modeling languages with complex morphology.
limitations
Difficulty in capturing reduplication: the process of repeating parts of a word
Despite this limitation
Maintenance and scalability
Unification-Based Morphology
Unification-Based Morphology is an approach to morphological modeling that utilizes the principles of unification and feature structures to represent and analyze the structure of words in a language.
It is based on techniques derived from logic programming and computational linguistics.

In Unification-Based Morphology, word forms are treated as structured objects consisting of features and their associated values.
Features represent grammatical or morphological properties, such as tense, number, gender, or case, while values represent the specific values
The main idea behind Unification-Based Morphology is the concept of unification, which is a process of merging or combining feature structures to resolve conflicts and generate a unified structure.
Unification allows representing linguistic information in a more structured and compositional manner.
Nodes in the feature structure represent attributes, and their values can be atomic
Unification can succeed when the two feature structures are compatible and can be merged without conflicts. However, it can also fail when there are conflicting attribute-value pairs.
Morphological parsing P thus associates linear forms φ with alternatives of structured content ψ, cf.
Advantages
Ability to handle exceptions and irregularities in word forms
Implemented for various languages, including Russian, Czech, Slovene, Persian, Hebrew, Arabic, and others
Limitations
lead to increased complexity
coverage is limited
struggle with ambiguity
Functional morphology
Functional morphology utilizes principles of functional programming and type theory to define its models.
It treats operations and processes are treated as pure mathematical functions.
Linguistic elements are organized into distinct types of values and type classes.
Functional morphology is particularly useful for modeling fusional morphologies but not limited to specific types of human languages.
Linguistic notions such as paradigms, rules, exceptions, grammatical categories, parameters, lexemes, morphemes, and morphs can be represented in this approach.
Functional morphology Implementations are designed as reusable programming libraries.
Functional morphology also be used for tasks such as morphological parsing, generation, lexicon browsing, etc.
we can describe inflection I, derivation D, and lookup L as functions of these generic type
Examples of functional morphology implementations include the
o Zen toolkit for Sanskrit morphology written in OCaml

o Haskell framework for implementing morphologies of languages such as Latin, Swedish, Spanish, Urdu, etc.
Advantages
provides a high level of expressiveness
Modularity and Reusability
Integration with General-Purpose Programming Languages
enables high levels of abstraction
Limitations
Dependency on Programming Language Ecosystem
Maintenance and Extensibility
Limited Linguistic Coverage
Morphology induction
Morphology induction, also known as unsupervised or data-driven morphology learning.
it is a computational approach that aims to automatically discover the morphological structure and patterns of a language from unannotated or minimally annotated text corpora.
It is a subfield of NLP and computational linguistics ,that focuses on extracting morphological units, such as morphemes or subword units, and their relationships from raw textual data.
The main goal of morphology induction is to uncover the underlying morphological rules, morpheme boundaries, and inflectional patterns.
Approaches to Morphology Induction:
Statistical Models: Common statistical models used in morphology induction are Hidden Markov models (HMMs), n-grams, and sequence alignment algorithms.
Rule-Based Models
Neural Network Models
Steps in Morphology Induction:
1. Corpus Preparation
2. Subword Segmentation
3. Clustering or Grouping
4. Pattern Extraction
5. Evaluation and Refinement
Advantages
Data-driven
Language independence
Discovery of hidden
Scalability
Limitations
Ambiguity
Lack of context
Evaluation difficulty
Computational complexity
How morphological models useful in finding the structure of words?
Morphological models are highly useful in finding the structure of words by providing systematic frameworks for analyzing and understanding word forms.
Here are several ways in which morphological models contribute to the identification and analysis of word structures:
1. Decomposition of Word Forms:
Morphological models break down complex word forms into smallest meaningful units(morphemes), which helps in identify the root and the affixes attached to it.
For example, in the word "unhappiness," a morphological model would identify "happiness" as the root and "un-" as the prefix.
2. Inflectional Analysis:
Morphological models help analyze inflectional patterns (identify how words change their forms
For example: "walks," "walked," and "walking" and relate them to the base form "walk."
3. Derivational Analysis:
Morphological models analyze the process of deriving new words from existing ones
For example: "friend" and "friendly" or "nation" and "national." Others like:
4. Identification of Irregularities
5. Language Learning and Processing
6. Part-of-speech tagging Etc…
Challenging issues of Morphological model?
1. Morphological Ambiguity.
2. Out-of-Vocabulary Words.
3. Morphological Variation.
4. Data Sparsity.
5. Morphological Productivity.
6. Morphological Segmentation.
7. Language-Specific Challenges.
What Is NLP?
NLP stands for Natural Language Processing, which is a part of Computer Science, Human language, and Artificial Intelligence.



Humans communicate with each other using words and text. The way that humans convey information to each other is called Natural Language. Every day humans share a large quality of information with each other in various languages as speech or text.
However, computers cannot interpret this data, which is in natural language, as they communicate in 1s and 0s. The data produced is precious and can offer valuable insights. Hence, you need computers to be able to understand, emulate and respond intelligently to human speech.
NLP refers to the branch of AI that gives the machines the ability to read, understand and derive meaning from human languages.
Components of NLP
Natural Language Understanding (NLU):
NLU involves transforming human language into a machine-readable format.
It helps the machine to understand and analyse human language by extracting the text from large data
Natural Language Generation (NLG):
NLG acts as a translator that converts the computerized data into natural language
It mainly involves Text planning, Sentence planning, and Text realization.
The NLU is harder than NLG.
Steps/Phases in NLP:
Applications of NLP:

Importance of NLP?
1. Better Human-Computer Interaction
2. Language Understanding and Processing
3. Automation and Efficiency
Goals of NLP?
1. Understanding and Interpreting Language.

2. Natural Language Generation.
3. Language Translation and Communication
4. Information Extraction and Retrieval.
Early Natural Language Processing (NLP) systems?
1. ELIZA (1966)
2. SHRDLU (1970
3. MYCIN (1976)
4. PROLOGUE (1982)
5. LUNAR (1990)
Difference between surface and deep structure?