CS7641 FINAL EXAM QUESTIONS WITH VERIFIED
ACCURATE ANSWERS
Four optimization approaches - Answers - 1) Generate and test
2) Calculus
3) Newton's Method
4) Randomized Optimization
Hill Climbing Algorithm - Answers - Guess x∈X
Repeat the following:
Let n*=argmax_n∈N(x) f(n)
If f(n)>f(x): x=n
Else: stop
Disadvantage:
- Get stuck in local optima
Randomized Restart Hill Climbing - Answers - Same as Hill Climbing but once local
optimum reached, restart again with a different starting x
Advantage:
- Won't get stuck in local optimum
- Not much more expensive than HC (constant factor)
Disadvantage:
- May not do better than enumeration (depends on size of attraction basin around
global optimum)
Entropy - Answers - -∑p(s)log₂p(s)
Number of bits per symbol (probability of symbol X # of bits to describe that symbol)
Joint Entropy - Answers - H(x,y)=-∑p(x,y)log₂p(x,y)
Randomness contained in two variables together
Conditional Entropy - Answers - H(y|x)=-∑p(x,y)log₂p(y|x)
Randomness of one variable given the other variable
Entropy if x and y are independent - Answers - H(Y|X)=H(Y) Y doesn't get any info from
x
H(X,Y)=H(X)+H(Y) Joint entropy is sum
Mutual Information - Answers - I(x,y)=H(y)-H(y|x)=I(y,x)
,Measure of reduction of randomness of variable given some knowledge of another
variable.
Specific case of KL Divergence
Kullback-Leibler Divergence - Answers - Always non-negative
Zero when P is equal to Q
Measures distance between any two distributions
Supervised Learning - Answers - Use labeled training data to generalize labels to new
instances (function approximation)
Unsupervised Learning - Answers - Make sense out of unlabeled data (data description)
Single Linkage Clustering (SLC) Algorithm - Answers - - Consider each object a cluster
(n objects)
- Define intercluster distance as distance between closest two points in the two clusters
- Merge 2 closest clusters
- Repeat n-k times to make k clusters
Notes:
- Running time O(n²)x(~n)
k-Means Clustering Algorithm - Answers - - Pick k centers at random
- Each center claims closest points
- Recompute the centers by averaging the clustered points
- Repeat until convergence
Notes:
- Can get stuck
Expectation Maximization - Answers - Expectation - soft clustering:
E[Zᵢⱼ]=(P(x=xᵢ|µ=µⱼ)/(∑ᵏP(x=xᵢ|µ=µⱼ)
"Likelihood data element i comes from cluster j"
Maximization - calculating meas of clusters:
µⱼ=(∑E[Zᵢⱼ]xᵢ)/(∑E[Zᵢⱼ])
P(x=xᵢ|µ=µⱼ)=exp[-½σ²(xᵢ-µⱼ)²]
Properties:
- Monotonically non-decreasing likelihood
- Does not converge
- Will not diverge
- Can get stuck -> overcame with randomization
, - Works with any distribution
Clustering Properties - Answers - Richness
Scale Invariance
Consistency
Impossibility Theorem - Answers - No clustering scheme can achieve all three of:
- richness
- scale invariance
- consistency
Why Feature Selection? - Answers - 1) Knowledge Discovery - Interpretability and
insight
2) Curse of Dimensionality (number of data grows exponentially 2^n with number of
features)
Filtering versus Wrapping - Answers - Filtering - filters features before learning
algorithm
- Ignores learner
- Fast
Wrapping - search for features is wrapped around the learning algorithm
- Takes into account model bias
- Slow
Wrapping Methods of Feature Selection - Answers - - Hill Climbing
- Randomized Optimization
- Forward/Backward Selection
Strongly relevant - Answers - Removing xi degrades the Bayes Optimal Classifier
(BOC)
{Recall BOC is the classifier that takes the weighted average of all hypotheses based
on their probability of being the correct hypothesis}
Usefulness - Answers - Measures effect on particular predictor
(MINIMIZING ERROR|LEARNER)
Feature Transformation - Answers - Pre-processing a set of features to create a new
feature set, while retaining as much (relevant/useful) information as possible
Principal Component Analysis - Answers - Finds directions that maximize variance and
that are mutually orthogonal
ACCURATE ANSWERS
Four optimization approaches - Answers - 1) Generate and test
2) Calculus
3) Newton's Method
4) Randomized Optimization
Hill Climbing Algorithm - Answers - Guess x∈X
Repeat the following:
Let n*=argmax_n∈N(x) f(n)
If f(n)>f(x): x=n
Else: stop
Disadvantage:
- Get stuck in local optima
Randomized Restart Hill Climbing - Answers - Same as Hill Climbing but once local
optimum reached, restart again with a different starting x
Advantage:
- Won't get stuck in local optimum
- Not much more expensive than HC (constant factor)
Disadvantage:
- May not do better than enumeration (depends on size of attraction basin around
global optimum)
Entropy - Answers - -∑p(s)log₂p(s)
Number of bits per symbol (probability of symbol X # of bits to describe that symbol)
Joint Entropy - Answers - H(x,y)=-∑p(x,y)log₂p(x,y)
Randomness contained in two variables together
Conditional Entropy - Answers - H(y|x)=-∑p(x,y)log₂p(y|x)
Randomness of one variable given the other variable
Entropy if x and y are independent - Answers - H(Y|X)=H(Y) Y doesn't get any info from
x
H(X,Y)=H(X)+H(Y) Joint entropy is sum
Mutual Information - Answers - I(x,y)=H(y)-H(y|x)=I(y,x)
,Measure of reduction of randomness of variable given some knowledge of another
variable.
Specific case of KL Divergence
Kullback-Leibler Divergence - Answers - Always non-negative
Zero when P is equal to Q
Measures distance between any two distributions
Supervised Learning - Answers - Use labeled training data to generalize labels to new
instances (function approximation)
Unsupervised Learning - Answers - Make sense out of unlabeled data (data description)
Single Linkage Clustering (SLC) Algorithm - Answers - - Consider each object a cluster
(n objects)
- Define intercluster distance as distance between closest two points in the two clusters
- Merge 2 closest clusters
- Repeat n-k times to make k clusters
Notes:
- Running time O(n²)x(~n)
k-Means Clustering Algorithm - Answers - - Pick k centers at random
- Each center claims closest points
- Recompute the centers by averaging the clustered points
- Repeat until convergence
Notes:
- Can get stuck
Expectation Maximization - Answers - Expectation - soft clustering:
E[Zᵢⱼ]=(P(x=xᵢ|µ=µⱼ)/(∑ᵏP(x=xᵢ|µ=µⱼ)
"Likelihood data element i comes from cluster j"
Maximization - calculating meas of clusters:
µⱼ=(∑E[Zᵢⱼ]xᵢ)/(∑E[Zᵢⱼ])
P(x=xᵢ|µ=µⱼ)=exp[-½σ²(xᵢ-µⱼ)²]
Properties:
- Monotonically non-decreasing likelihood
- Does not converge
- Will not diverge
- Can get stuck -> overcame with randomization
, - Works with any distribution
Clustering Properties - Answers - Richness
Scale Invariance
Consistency
Impossibility Theorem - Answers - No clustering scheme can achieve all three of:
- richness
- scale invariance
- consistency
Why Feature Selection? - Answers - 1) Knowledge Discovery - Interpretability and
insight
2) Curse of Dimensionality (number of data grows exponentially 2^n with number of
features)
Filtering versus Wrapping - Answers - Filtering - filters features before learning
algorithm
- Ignores learner
- Fast
Wrapping - search for features is wrapped around the learning algorithm
- Takes into account model bias
- Slow
Wrapping Methods of Feature Selection - Answers - - Hill Climbing
- Randomized Optimization
- Forward/Backward Selection
Strongly relevant - Answers - Removing xi degrades the Bayes Optimal Classifier
(BOC)
{Recall BOC is the classifier that takes the weighted average of all hypotheses based
on their probability of being the correct hypothesis}
Usefulness - Answers - Measures effect on particular predictor
(MINIMIZING ERROR|LEARNER)
Feature Transformation - Answers - Pre-processing a set of features to create a new
feature set, while retaining as much (relevant/useful) information as possible
Principal Component Analysis - Answers - Finds directions that maximize variance and
that are mutually orthogonal