KNN Classification
K-Nearest Neighbors (KNN) classification is a simple yet powerful non-parametric machine learning algorithm used for categorizing data points. The core idea is to classify a new data point based on the majority class among its 'k' nearest neighbors in the training dataset. The algorithm works by first calculating the distance (often Euclidean distance) between the new data point and all points in the training set. Then, it selects the 'k' training points that are closest to the new point. Finally, the new point is assigned to the class that is most frequent among those 'k' neighbors. For instance, if k=3, the new point will belong to the class that has the highest number of representatives among its 3 nearest neighbors.
KNN is considered a “lazy learner” because it doesn't build an explicit model during a training phase. Instead, it stores the entire training dataset and performs calculations only when a new data point needs to be classified. While this makes KNN easy to implement and adaptable to new data, it can be computationally expensive, especially with large datasets.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Prepare data
➤ 3. Split data
➤ 4. Model training
➤ 5. Prediction
➤ 6. Confusion Matrix
➤ 7. Classification Report
➤ 8. Strengths and Weaknesses
KNN Classification
↪ 1. Import data
Import data using read_csv() function from pandas library.
import numpy as np import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', header=0) print(data.columns)
---Output--- # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price', # 'Low Price', 'Last Price', 'Close Price', 'Average Price', # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty', # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'], # dtype='object')
KNN Classification
↪ 2. Prepare data
The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.
The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Then, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model.
Next, a new column 'UP2' is created and assigned to the dependent variable y. This column is derived from an existing column 'UP'. If the value in the 'UP' column is equal to the string 'UP', the corresponding value in the 'UP2' column is set to 1; otherwise, it's set to 0. This step converts the categorical 'UP' data into a binary numerical representation.
data = data.dropna() data['UP2'] = np.where(data['UP'] == 'UP', 1, 0) print(data.head(5)) x = data[['SMA20']] y = data[['UP2']]
KNN Classification
↪ 3. Split data
The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True) print("x_train.shape: ", x_train.shape ) print("x_test.shape: ", x_test.shape ) print("y_train.shape: ", y_train.shape ) print("y_test.shape", y_test.shape )
---Output--- # x_train.shape: (160, 1) # x_test.shape: (41, 1) # y_train.shape: (160, 1) # y_test.shape (41, 1)
KNN Classification
↪ 4. Model training
Standardize the features This step is generally recommended for KNN methods since distance computation will be significantly impacted by feature scaling. The code below standardizes the training and testing feature data using StandardScaler() function. It first initializes the scaler and then uses fit_transform() function on the training data (x_train) to calculate the mean and standard deviation of each feature, and then transform it to have zero mean and unit standard deviation.
The scaler then applies the same transformation using transform() on the test data (x_test), ensuring consistent scaling across both datasets based on the training data's statistics. This preprocessing step is crucial for many machine learning algorithms, including KNN, as it prevents features with larger scales from dominating the distance calculations.
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_train_scaled = scaler.fit_transform(x_train) X_test_scaled = scaler.transform(x_test)
This code below demonstrates the initialization and training of a K-Nearest Neighbors (KNN) classifier using the scikit-learn library. It starts by importing the KNeighborsClassifier class from the sklearn.neighbors module. Then, an instance of the KNN classifier is created and stored in the variable knn_classifier. The n_neighbors=3 parameter specifies that the classifier will consider the 3 nearest neighbors when making predictions. Finally, the fit() method is called on the knn_classifier object, using X_train_scaled as the training features and y_train as the training labels. This step effectively trains the KNN model.
from sklearn.neighbors import KNeighborsClassifier knn_classifier = KNeighborsClassifier(n_neighbors=3) knn_classifier.fit(X_train_scaled, y_train)
KNN Classification
↪ 5. Prediction
The knn_classifier.predict() method generates predictions for each data point in the X_test_scaled, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.
prediction = knn_classifier.predict(X_test_scaled) y_pred = pd.DataFrame(prediction, columns=['Pred']) dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1) print(dframe.head(5))
---Output--- # UP2 Pred # 0 1 0 # 1 1 0 # 2 1 1 # 3 0 0 # 4 0 1
KNN Classification
↪ 6. Confusion Matrix
The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.
From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, prediction) print(cm)
---Output--- # [[ 7 8] # [11 15]]
The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]
TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel() print('True Positive (TP) = ', TP) print('False Positive (FP) = ', FP) print('True Negative (TN) = ', TN) print('False Negative (FN) = ', FN)
---Output--- # True Positive (TP) = 15 # False Positive (FP) = 8 # True Negative (TN) = 7 # False Negative (FN) = 11
Accuracy score can be calculated using the following formula.
accuracy = (TP+TN) / (TP+FP+TN+FN) print('Accuracy of the classification = {:0.3f}'.format(accuracy))
---Output--- # Accuracy of the classification = 0.537
Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, prediction) print(f"Accuracy: {accuracy}")
---Output--- # Accuracy: 0.5365853658536586
KNN Classification
↪ 7. Classification Report
The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.
from sklearn.metrics import classification_report print(classification_report(y_test, prediction))
---Output--- # precision recall f1-score support # # 0 0.39 0.47 0.42 15 # 1 0.65 0.58 0.61 26 # # accuracy 0.54 41 # macro avg 0.52 0.52 0.52 41 # weighted avg 0.56 0.54 0.54 41
KNN Classification
↪ 8. Strengths and Weaknesses
Strengths
Weaknesses