1. Unaltered data:-Collected from various sources.
2. Understand the characteristics:-Before modifying the raw data for final use.
,Causes of data issues
1. Transmission errors from devices.
2. Human errors in data submission.
3. Presence of outliers->outliers are extreme values in data that can be removed to
enhance usefulness.
,1. Cannot be used as is
2. Errors and noise need to be filtered out
3. Complexity and non-linearity should be identified
1. Deals with the transformation of the raw data to make it suitable for building a model.
2. Aims at discovering how well data can be presented for a given machine learning
method and task.
3. The underlying structure of the problems is understood to select the appropriate
machine learning method
1. Imputation method that help to deal with missing values.
2. Detection and removal of outliers
1. Include aggregation function such as mean, mode. Standard deviation sum, etc.
1. Finding the correlation among variables.
, 2. Selecting the appropriate features or variables that will be suitable without
complicating the modelling process.
1. Makes the data ready for model building
2. Scales data to represented it in way that the model will accept it
3. Encodes the given data to suit the model’s context for it to read and process the data.
1. Play an important role in the data –driven modelling
2. selects training data from the data population
3. Helps in testing and validating data.
Examples: Neural networks
1. Numerical values are always accepted.
2. Non-numerical values need to be converted or transformed into numerical values.