The Essentials of KNN Algorithm: Understanding with a Concrete Example
KNN with an example
K-Nearest Neighbors theory
K-Nearest Neighbor (KNN) is one of the simplest machine learning algorithms. KNN is a frequency-based supervised algorithm. In this post, we will study What is KNN with a simple example. Supervised algorithms are classified as Regression and Classification.
KNN algorithm is used for both Regression & Classification. It is one of the simplest machine learning techniques where the algorithm classifies the test data based on their similarity with the training data.
Here, there are two categories: Class A and Class B, when KNN is applied to the data, the new data point is classified as Class B based on its similarity (closeness) with Class B rather than Class A. This classification is based on Similarities and Dissimilarities between the objects.
In the above example, the new data point is closest to Class B, therefore it is classified as Class B. This similarity is based on Distance-measure. Once the distance measure is finalized, the next step is to determine the number of neighbors with which the comparison has to be made. This is the 'K' value and based on these K neighbors, the prediction is made. Due to this process, the algorithm is named as K- Nearest Neighbor. The K has to be determined by the programmer, if the value of K is too small the chances of error increase. Therefore, it is important to determine an optimum value of K. The most optimum value of K is either 3 or 5. (Always keep it odd numbers, because if the numbers are even and both the distances are the same then a problem occurs.)
KNN example
In this example, I have used the iris dataset.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
dataset = pd.read_csv('../input/iris/Iris.csv')
Summarize the dataset
dataset.shape
output - (150, 6)
dataset.head(5)
dataset.describe()
dataset.groupby('Species').size()
Dividing Data into features and labels
feature_columns = ['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm','PetalWidthCm']
X = dataset.iloc[:,1:3].values
y = dataset['Species'].values
Label encoder
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)
Spliting dataset into training and testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline #This shows the plot right below this cell
from pandas.plotting import parallel_coordinates
plt.figure(figsize=(15,10))
parallel_coordinates(dataset.drop("Id",axis = 1), "Species")
plt.title('Parallel Coordinates Plot', fontsize=20, fontweight='bold')
plt.xlabel('Features', fontsize=15)
plt.ylabel('Features values', fontsize=15)
plt.legend(loc=1, prop={'size': 15}, frameon=True,shadow=True, facecolor="white", edgecolor="black")
plt.show()
Making predictions
# Fitting classifier to the Training set
# Loading libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import cross_val_score
# Instantiate learning model (k = 3)
classifier = KNeighborsClassifier(n_neighbors=3)
# Fitting the model
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Evaluating predictions
#Calculating model accuracy
accuracy = accuracy_score(y_test, y_pred)*100
print('Accuracy of our model is equal ' + str(round(accuracy, 2)) + ' %.')
Output - Accuracy of our model is equal 73.33 %.
model = KNeighborsClassifier().fit(X,y)
pred = model.predict([[1,2]])
pred
Output - array([0])
Conclusion
The predicted output is 0, which means it belongs to 1st class which is 'Iris-setosa'.
Thank you for reading.
Have a nice day! ๐
For more such content make sure to subscribe to my Newsletter here
Follow me on