The breast cancer data set breast-cancer-wisconsin.data.txt from
http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/ (description at
http://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Original%29 ) has missing
values.
1. Use the mean/mode imputation method to impute values for the missing data.
2. Use regression to impute values for the missing data.
3. Use regression with perturbation to impute values for the missing data.
Below shows snippets from the code where I used the mean/mode method to identify and impute
values for missing data, which showed up in variable/column “V7”. First we test to see how many
observations under “V7” are missing, which amounts to 22.8%.
I used the following function to determine the mode of V7, which we find to be equal to the Value of
“1”.
The following then shows the application of Regression and Perturbation to the missing values in the
data set. Note that the initial linear regression model, which used Variables V2 – V10, was reduced to
V2, V4, V5 and V8 based on P-values.
This study source was downloaded by 100000850872992 from CourseHero.com on 04-11-2023 01:14:01 GMT -05:00
https://www.coursehero.com/file/198667579/Week-10-HW-AF-1pdf/