CS 180 – Artificial Intelligence Review Notes

PART 1: Concept Learning What is MACHINE LEARNING? Any computer program that improves its performance at some tasks through experience A computer program is said to learn from experience E wrt some class of tasks T and performance measure P, if its performance tasks in T, as measured by P, improves w/ experience E. Well-posed Learning Problems 3 features Class of tasks Measure of performance to be improved Source of experience Choosing the training EXPERIENCE 1. Direct/Indirect Feedback 2. Degree of control over sequence of training examples 3. How well it represents the distribution of examples – Independently Identically Distributed CONCEPT LEARNING Formal definition: Inferring a boolean-valued function from training examples of its input and output Acquiring the definition of a general analogy given a sample of positive and negative training examples of the category Hypothesis Representation: - “?” – any value is acceptable - Specific value - “0” – no value is acceptable - <? , ? , ? , ? > Most general hypothesis - < 0, 0, 0, 0 > Most specific hypothesis 1

ď€˘ Inductive Learning Hypothesis ď€š Any hypothesis found to approximate the target function well over a sufficiently long set of training examples will also approximate the target function well over unobserved training examples ď€˘ A hypothesis h is CONSISTENT with the set of training examples D if â„Ž(đ?‘Ľ) = đ?‘?(đ?‘Ľ) for every training example < đ?‘Ľ, đ?‘?(đ?‘Ľ) > in D. ď€š CONSISTENT â€“ if the answer is positive ď€˘ VERSION SPACE â€“ subset of hypotheses from H consistent w/ training examples in D ď€˘ CANDIDATE-ELIMINATION ALGORITHM ď€š General Boundary - stats w/ <? , ? , ? > and gets more specific - changes w/ every negative input ď€š Specific Boundary - starts w/ < 0, 0, 0 > and gets more general - changes w/ every positive input ď€˘ Inductive Bias ď€š Set of assumptions that together w/ the training data, deductively justify the classifications assigned by the learner to future interests

PART 2: Decision Tree Learning ď€˘ Method for approximating discrete-valued target functions, in w/c the learned function is represented by a decision tree ď€˘ Decision Trees ď€š Classify instances by sorting them down from the root node to save leaf node w/c provides the classifications of the instance ď€š Represents a disjunction (OR) of conjunctions (AND) of constraints or the attribute values of the instances

2

Appropriate Problems for Decision Tree Learning 1. Instances are represented by attribute-value pair 2. The target function has discrete output values 3. Disjunctive descriptions may be required 4. The training data may contain errors 5. The training data may contain missing attribute values The Basic ID3 Algorithm Answers the question: Which attribute is the best classifier? Entropy - characterizes the purity/ impurity of an arbitrary collection of samples (0 – call yes/no, 1- evenly distributed)

Information Gain - Measures the expected reduction in entropy - Measures the effectiveness of an attribute in clarifying the training data - Highest Valued Gain will be the root

Inductive Bias: Shorter trees are preferred over longer trees

3

Part 3: Neural Networks Massively parallel distributed processor that has a natural propensity for storing experimental knowledge and making it available for use Resembles the brain in 2 respects: 1. Knowledge is acquired through a learning process 2. Interneuron connection strengths (synaptic) weights) are used to store the knowledge Model of a Neuron: 1. A set of SYNAPSE characterized by weights 2. An ADDER, summing the input signals weighted by the respective synapse of the neuron 3. An activation function/ SQUASHING FUNCTION for limiting the amplitude of the neuron Types of SQUASHING FUNCTION: 1. Threshold Function

2. Piecewise-Linear Function

3. Sigmoid Function

Network Architectures 1. Single-layer feedforward networks 2. Mulit-layer feedforward networks (w/ hidden layers) 3. Recurrent networks 4. Latice Structures

4

Output Encoding 1. Single output neuron – divide the range of output according to the # of the classes 2. One output neuron for each class (n) 3. <log2n) output neurons Overfitting - Small error in training set but large error in other sets - Use validation set as a stopping criterion for training Supervised vs. Unsupervised Learning Supervised learning - Machine learning task used to infer a function from a supervised data - i.e. Back pop algorithm and Support Vector Machine (SVM) Unsupervised Learning - Seeks to determine how data is organized - i.e. self-organizing Maps (SOM), Adaptive Resonance Theory (ART) SOM Unsupervised neural network Used to produce a discretized representation of the input space of the training samples (map) Back Propagation Algorithm Learns the weights for a multilayer feedforward network, given a network w/ fixed set of units & interconnections Employs gradient descent to attempt to minimize the squared error between the network output & the target output

5

The GRADIENT DESCENT RULE - Used to search the hypothesis space of possible weight vectors to find the weights that best fits the training examples.

1. Propagate the input FORWARD through the network 2. Propagate the output/errors BACKWARD through the network a. For each output unit k, calculate its error term δk

b. For each hidden unit h, calculate its error term δk

c. Update each network weight wij

PART 4: Support Vector Machines (SVM) Supervised learning technique that uses an optimal separating hyperplane to distinguish data in two categories (binary classifier by default) Closely related to neural networks Why SVM? 1. Optimal (separation of data) 2. Theoretically well-founded 3. Easily extendable to a multi-class classifier 4. Has been successfully applied in many fields Shattering of Instances A set of instances S is shattered by a hypothesis space H iff for every dichotomy (two subsets) of S, here exists some hypothesis in H consistent w/ this dichotomy.

6

Vapnik-Chervonenkis Dimension (VC Dimension) Measures the complexity (capacity to discriminate) of a hypothesis space H Measures the complexity of the hypothesis space by the # of distinct instances in X that can be shattered Formal definition: size of the largest finite subset of an instance space X shattered by a hypothesis space H Structure Risk Minimization Order of hypothesis based on the VC Dimension VC Confidence = VC Dimension/# of training samples The idea is to find a machine that has the minimum sum between the empirical error (error in the training set) and VC Confidence Kernel Trick When data is non-linearly separable Separation may be easier in higher dimensions Different types of kernel: 1. Radical Basis Fcn 2. Polynomial Fcn 3. Signoid Fcn Extending SVM to a Mulit-class classifer

CREDITS: Notes by Camille Salazar Scanned by Rovie Doculan Encoded by Emir Mercado 7