learning has 3 components :
1) Representation -
the space of allowed models
Linear tree/Sets of Rule/Instances/Graphical models / networks...
Regression/decision neural
·
2) Evaluation -
how to
juge one model vs .
another
/ / likelihood
Accuracy/ Precision & recall mean squared ever
·
3) the models
Optimization -
a method to search
among forthe highest-scoring one
·
Combinatorial optimization / Convex optimization / Constrained/Nonconvex
Supervised learning unsuperised learning Reinforcement learning
·
correct out put known for each ·
correct output is not known ·
learn action to maximize payoff
training example create an internal representation of the
learning is based on new and
· ·
learn to predict output when
given an input rector input capturing regularities/structure in data Relation with
game theory control
...
· ·
, ,
·
the most used
type Methods Methods
wide-area
of academic and industrial
appli K-mean
clustering Q-learning
·
·
.
Methods ·
Restricted Boltzmann Machine. ·
SARSA
Support Lector Machine Examples Examples
Artificial Neural Networks Discover clusters Decision
making process
·
·
·
Decision Trees. . . ·
Discover factors/structures Control
Examples of specific tasks : Learns from data :
·
Strategic optimization
·
Classification : discute output ·
Training data does not include ·
On-line
learning
Learns
Regression real-valued output desired outputs
·
: :
Learne from data :
·
from collected data
Training data include delived outpute
by exploring the
agent environment
·
·
Designing a
learning system
Yearner
Training data >
~
Environment /
Experience
Knowledge
V
Testing data
>
Performance
Element
,Underfitting
Model is too simple to capture the underlying patterns in the data. It fails to learn from the training data adequately,
resulting in poor performance on both training and testing datasets.
Causes:
• An overly simplistic model
• Insufficient training
• Lack of relevant features
Consequences: the model has a high bias, leading to inaccurate predictions. It may exhibit a high error rate even on the
training set
Overfitting
Model learns the noise and random fluctuations in the training data instead of the actual underlying patterns. This results in
a model that performs exceptionally well on the training set but poorly on unseen data
Causes:
• A model that is too complex
• Excessive training
• Insufficient data to train the model adequately
Consequences: The model has a high variance, meaning it is sensitive to small changes in the training data. It may
generalize poorly to new data
,Lesson 2
Data preparation for ML
Types of data attributes:
• Nominal —> ID numbers, eye color, zip codes
• Ordinal —> rankings, grades, height tallies, medium
• Interval —> calendar dates
• Ratio —> length, time, mass
Discrete attributes
• has only a finite or countable infinite set of values
• Zip codes, counts, or the set of words in a collection of documents
• Often represented as integer variable
• Binary attributes are a special case of discrete attributes
Continuous attributes:
• has real numbers as attributes values
• Temperature, height, weight
• Represented as floating point variables
Types of data sets
• Record: data matrix, document data, transaction data
• Graph: World Wide Web, molecular structures
• Ordered: spatial data, temporal data, sequential data, genetic sequence data
Characteristics of data
• dimensionality: high dimensional data brings a number of challenges
• Sparsity: only presence counts
• Resolution: patterns depend on the scale
• Size: type of analysis may depend on size of data
Data Preprocessing
• Aggregation: combining two or more attributes (or objects) into a single attribute
• Sampling: main technique employed for data reduction
• Feature extraction: transforms the data in the high-dimensional space to a space of fewer dimensions
• Feature subset selection: tries to find a representative subset of the original variables
• Feature creation: create new attributes that can capture the important information in a data sets much more efficiently
than the original attributes
• Discretization and Binarization: the process of converting a continuous attributes into an ordinal attribute
• Attribute transformation: a function that maps the entire set of values of a given attribute to a new set of replacement
values such that each old value can be identified with one of the new values
, Noise
Random errors or variances in the data that do not reflect the true underlying patterns. It can arise from various sources,
such as measurement errors, data entry errors or inconsistencies in data collection processes
Impact: obscure the true relationship between input features and the target variable
Outliers
Data points that differ significantly from the majority of the dataset. They can occur due to variability in the measurement,
data entry errors, or they may represent significant anomalies.
Missing values
Occur when data for a particular observation is not available. They can result from various factors, including data collection
errors, participant non-responses, or system failures
types of missing values:
1. Missing completely at random (MCAR)
• Missingness of a values is independent of attributes
• Fill in values based on the attribute
• Analysis may be unbiased overall
1. Missing at Random (MAR)
• Missingness is related to other variables
• Fill in values based on other values
• Almost always produces a bias in the analysis
1. Missing Not at Random (MNAR)
• Missingness is related to unobserved measurements
• Informative or non-ignorable Missingness
1. Not possible to know the situation from the data
Imbalanced data
The number of objects in some classes are much smaller than the number of objects from the other classes
Possible approches: resampling, collect more data, choose the right evaluation metrics and the right models
Similarity and Dissimilarity Measures
Similarity measures
• numerical measure of how alike two data objects are
• Is higher when objects are more alike
Dissimilarity measures
• numerical measure of how different two data objects are
• Lower when objects are more alike