KNN - Understanding K Nearest Neighbor Algorithm in Python

K Nearest Neighbors is a very simple and intuitive supervised learning algorithm.
A supervised learning algorithm is one in which you already know the result you want to find.
It works like this you give your model a dataset having data and results, using the data your
model creates a decision boundary to predict the desired result.

It is a really intuitive and simple algorithm and it can be used both as classifier and as regressor.
I will not give you a standardized definition of classification and regressor as it might confuse
you let’s understand with an example: Given a dataset of features of cat and dog your job is to
tell by examining the features that given animal is cat or dog this is a classification problem
while you are given a dataset of houses having various features and price of houses on
examining the data you have to predict the price of a house given to you as query this is a
regression problem here you are not classifying you are predicting a value.

Now first we will see and implement kNN and then we will see how it can be used both as a
classifier and a regressor.

kNN Algorithm

  1. Your algo would receive a query point in the dataset and you have to give results on this
    query point.

  2. Calculate Euclidean distance of query points from the nearest k points(k nearest
    neighbors).

  3. Store these distances in a list.

  4. Sort the list.

  5. Take the majority vote and predict results.

So, it is pretty simple we first get a query for example on a 2-D feature set query can be [2, 3].

We take the nearest k points and calculate the distance of each point from the query point.

Take a look at the picture above we have the black dot as query and we have k=3 so we

calculate the distance of 3 nearest points and store distance in a list then we will sort the list

and as we can see more points belong to the dataset at the top so the query belongs to the

data member at the top.

Let’s now see the code which would give you a much better idea about the algorithm.

  • First, we will import important libraries we have imported sklearn to make dummy data,
    matplotlib.pyplot for making plots and NumPy library which a very famous library for
    carrying out mathematical computations.

  • Now, we will create dummy data we are creating data with 100 samples having two
    features.

  • Let’s first see how is our data by taking a look at its dimensions and making a plot of it.
    X contains 100 samples of two featured data, we will put feature 1(indexed as 0) in
    variable x_axis and feature 2(indexed1) in y_axis and plot it out in a scatter plot using
    function “plt.satter(x_axis, y_axis)”, we have also plotted a query point as you can see in
    the pic below.

  •  We have created a dictionary while I will be talking about after we have completed the
    kNN algorithm. We have also created a distance function to calculate Euclidean Distance
    and return it.

  • Finally, we have arrived at the implementation of the kNN algorithm so let’s see what we
    have done in the code below.

    • We have defined a kNN function in which we will pass X, y, x_query(our query point)
      , and k which is set as default at 5. 

    • We have taken variable m which is the number of training examples using
      statement — m = X.shape[0].

    • We have created an empty list named distances. 

    • Now, we will iterate over samples and calculate the distance of these points from
      query points and we will append this distance as well as the label of the point in the
      dataset in the distances list. 

    • After iterating over the whole dataset we will sort the list using the sorted function
      you can use any other method if you want.
      We will split the first k rows of the
      sorted distances list. 

    • Converting a list to numpy array will help us a lot to perform mathematical
      computations a lot easier. As we have stored both distance and label we have
      created a new variable label and store the first column and all rows.

    •  We now find the unique labels and number of times these labels occur and store
      them in two variables ‘uniq_label’ and ‘counts’ and we can predict by finding the
      unique label having maximum occurrence using argmax() function.

  • Now we can run a query on our model and see the prediction, note that we have returned
    the dictionary name corresponding to the result by using dic[int(pred)].

kNN As A Classifier

We can use kNN as a classifier and the example above is also a classification problem and I
don’t think there is nay more need to explain kNN as a classifier, I will just show how we can
use kNN as a classifier to classify images of the famous
Mnist Dataset but I won’t be
explaining it only code will be shown here, for a hint it will group all the numbers in different
cluster calculate distance of query point from all other points take k nearest and then predict
the result.

Using kNN for Mnist Handwritten Dataset Classification


kNN As A Regressor

kNN can also be used as a regressor, formally regressor is a statistical method to predict the
value of one dependent variable i.e output y by examining a series of other independent
variables called features in machine learning.

We will use kNN to solve the Boston House prediction problem.

Boston House Price Prediction

  • Importing libraries — Matplotlib for plotting data, sklearn for model, dataset, and
    train_test_split.

  • Loading boston dataset and we will see the shape of our data and keys of data to get an
    idea about our data.

  • Splitting train and test set.

  • Here we are using the sklearn to implement knn I choose it to show you the usage of
    sklearn as till now we have just coded knn all the way from scratch. It is pretty simple
    with sklearn we are predicting for different values of k to see variations and it has two
    steps first passing parameters in the model, second fitting it for train data, to examine
    the model we calculate r squared error (also called r2 error) using the inbuilt score
    method of sklearn and at the end, we are returning the test_score, training_score and
    the list of models for different values of k.

  • We are calling our function and saving the values in the respective variables.

  • Here are the predictions by model for varying k.

  • Examining performance for k=6.

  • Let’s plot a graph for both train and test data. R-squared error is pretty low which shows
    that kNN doesn’t work well as a regressor.


What should be the value of K?

Choosing the value of k is crucial as it will determine the accuracy of your algorithm and it can
change the performance of your algorithm. So, the big question arises is what should be the
value of k? Let’s see…

  • The value of k should be large as it is better to consider more neighbors but it should not
    be at the cost of speed that should be always kept in mind.

  • Value of k should not be even as it might be the case that for a query the distances
    belonging to different datasets might be equal for example, suppose there are 2
    categories and value of k=6, now the query point is nearest to the equal number of
    points belonging to both categories and now model can’t distinguish and give you
    result.

You can take sqrt(n) as a good approximation of k but it is advisable to try a few values of k
and choose the most optimized value.

Advantages

  1. KNN is pretty intuitive and simple: K-NN algorithm is very simple to understand and
    equally easy to implement.

  2. K-NN has no assumptions: kNN is a non-parametric algorithm and hence it has no
    assumptions unlike other algorithms like linear regression.

  3. No Training Step: K-NN does not explicitly build any model, it simply gives results on
    the new data entry, based on learning from historical data. New data entry would be
    tagged with the majority class in the nearest neighbor.

  4. It constantly evolves: k-NN is a memory-based approach. The classifier immediately
    adapts as we collect new training data. It allows the algorithm to respond quickly to
    changes in the input during real-time use.

  5. Very easy to implement for the multi-class problem: Most of the classifier algorithms
    are easy to implement for binary problems and need effort to implement for multi-class
    whereas KNN adjusts to multi-class without any extra efforts.

  6. Can be used both for Classification and Regression: One of the biggest advantages
    of KNN is that KNN can be used both for classification and regression problems.

  7. One Hyper Parameter: Choosing the right value of hyperparameter can be a very painful
    experience which is not the case with kNN as it has only one hyperparameter.

  8. Variety of distance criteria to choose from: KNN algorithm gives users the flexibility to
    choose distance while building KNN model. There are various methods to calculate
    distance:

    Euclidean Distance
    Hamming Distance
    Manhattan Distance
    Minkowski Distance

Disadvantages

  1. Slow algorithm: K-NN might be very easy to implement but as the dataset grows,
    efficiency or speed of algorithm declines very fast.

  2. Dimensionality: KNN works well with a small number of input variables but as the
    numbers of variables grow K-NN algorithm struggles to predict the output of the new
    data point.

  3. KNN needs homogeneous features: If you decide to build k-NN using a common
    distance, like Euclidean or Manhattan distances, it is completely necessary that features
    have the same scale, since a given distance in one feature must mean the same for the
    second feature.

  4. Optimal value of neighbors: One of the biggest issues with KNN is to choose the
    optimal value of hyperparameter k to be considered while classifying the new data entry.

  5. Imbalanced data causes problems: kNN doesn’t perform well on imbalanced data.
    If we consider two classes, A and B, and the majority of the training data is labeled as A,
    then the model will ultimately give a lot of preference to A. This might result in getting the
    less common class B wrongly classified.

  6. Outlier sensitivity: KNN algorithm is very sensitive to outliers as it simply chose the
    neighbors based on distance criteria.

Comments