2)
techworldthink • March 06, 2022
4. Explain how to choose the value of k in k-NN algorithm.
K-Nearest Neighbors is the supervised machine learning algorithm used for
classification and regression. It manipulates the training data and classifies the new
test data based on distance metrics. It finds the k-nearest neighbors to the test data,
and then classification is performed by the majority of class labels.
Selecting the optimal K value to achieve the maximum accuracy of the model is
always challenging for a data scientist.
• There are no pre-defined statistical methods to find the most favorable value of K.
• Initialize a random K value and start computing.
• Choosing a small value of K leads to unstable decision boundaries.
• The substantial K value is better for classification as it leads to smoothening the
decision boundaries.
• Derive a plot between error rate and K denoting values in a defined range. Then
choose the K value as having a minimum error rate.
In KNN, finding the value of k is not easy. A small value of k means that noise will
have a higher influence on the result and a large value make it computationally
expensive.
There is no straightforward method to calculate the value of K in KNN. You have to
play around with different values to choose the optimal value of K. Choosing a right
value of K is a process called Hyperparameter Tuning.
The value of optimum K totally depends on the dataset that you are using. The best
value of K for KNN is highly data-dependent. In different scenarios, the optimum K
may vary. It is more or less hit and trail method.
, You need to maintain a balance while choosing the value of K in KNN. K should not
be too small or too large. A small value of K means that noise will have a higher
influence on the result.
Larger the value of K, higher is the accuracy. If K is too large, you are under-
fitting your model. In this case, the error will go up again. So, at the same time you
also need to prevent your model from under-fitting. Your model should retain
generalization capabilities otherwise there are fair chances that your model may
perform well in the training data but drastically fail in the real data. Larger K will
also increase the computational expense of the algorithm.
There is no one proper method of estimation of K value in KNN. No method is the
rule of thumb but you should try considering following suggestions:
1. Square Root Method: Take square root of the number of samples in the training
dataset.
2. Cross Validation Method: We should also use cross validation to find out the
optimal value of K in KNN. Start with K=1, run cross validation (5 to 10
fold), measure the accuracy and keep repeating till the results become consistent.
K=1, 2, 3... As K increases, the error usually goes down, then stabilizes, and then
raises again. Pick the optimum K at the beginning of the stable zone. This is also
called Elbow Method.
3. Domain Knowledge also plays a vital role while choosing the optimum value of K.
4. K should be an odd number.
5. Explain entropy and information gain.