KNN Classification

K-Nearest Neighbors (KNN) classification is a simple yet powerful non-parametric machine learning algorithm used for categorizing data points. The core idea is to classify a new data point based on the majority class among its 'k' nearest neighbors in the training dataset. The algorithm works by first calculating the distance (often Euclidean distance) between the new data point and all points in the training set. Then, it selects the 'k' training points that are closest to the new point. Finally, the new point is assigned to the class that is most frequent among those 'k' neighbors. For instance, if k=3, the new point will belong to the class that has the highest number of representatives among its 3 nearest neighbors.

KNN is considered a “lazy learner” because it doesn't build an explicit model during a training phase. Instead, it stores the entire training dataset and performs calculations only when a new data point needs to be classified. While this makes KNN easy to implement and adaptable to new data, it can be computationally expensive, especially with large datasets.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Prepare data

  ➤ 3. Split data

  ➤ 4. Model training

  ➤ 5. Prediction

  ➤ 6. Confusion Matrix

  ➤ 7. Classification Report

  ➤ 8. Strengths and Weaknesses

KNN Classification

↪ 1. Import data

Import data using read_csv() function from pandas library.

      import numpy as np
      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],       # dtype='object')

KNN Classification

↪ 2. Prepare data

The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.

The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Then, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model.

Next, a new column 'UP2' is created and assigned to the dependent variable y. This column is derived from an existing column 'UP'. If the value in the 'UP' column is equal to the string 'UP', the corresponding value in the 'UP2' column is set to 1; otherwise, it's set to 0. This step converts the categorical 'UP' data into a binary numerical representation.

      data = data.dropna()
      data['UP2'] = np.where(data['UP'] == 'UP', 1, 0)
      print(data.head(5))
      x = data[['SMA20']]
      y = data[['UP2']]

KNN Classification

↪ 3. Split data

The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.

      from sklearn.model_selection import train_test_split
      x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)
 
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---       # x_train.shape: (160, 1)       # x_test.shape: (41, 1)       # y_train.shape: (160, 1)       # y_test.shape (41, 1)

KNN Classification

↪ 4. Model training

Standardize the features This step is generally recommended for KNN methods since distance computation will be significantly impacted by feature scaling. The code below standardizes the training and testing feature data using StandardScaler() function. It first initializes the scaler and then uses fit_transform() function on the training data (x_train) to calculate the mean and standard deviation of each feature, and then transform it to have zero mean and unit standard deviation.

The scaler then applies the same transformation using transform() on the test data (x_test), ensuring consistent scaling across both datasets based on the training data's statistics. This preprocessing step is crucial for many machine learning algorithms, including KNN, as it prevents features with larger scales from dominating the distance calculations.

      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(x_train)
      X_test_scaled = scaler.transform(x_test)

This code below demonstrates the initialization and training of a K-Nearest Neighbors (KNN) classifier using the scikit-learn library. It starts by importing the KNeighborsClassifier class from the sklearn.neighbors module. Then, an instance of the KNN classifier is created and stored in the variable knn_classifier. The n_neighbors=3 parameter specifies that the classifier will consider the 3 nearest neighbors when making predictions. Finally, the fit() method is called on the knn_classifier object, using X_train_scaled as the training features and y_train as the training labels. This step effectively trains the KNN model.

      from sklearn.neighbors import KNeighborsClassifier
      knn_classifier = KNeighborsClassifier(n_neighbors=3)
      knn_classifier.fit(X_train_scaled, y_train)

KNN Classification

↪ 5. Prediction

The knn_classifier.predict() method generates predictions for each data point in the X_test_scaled, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.

      prediction = knn_classifier.predict(X_test_scaled)
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1)
      print(dframe.head(5))

      ---Output---       # UP2 Pred       # 0 1 0       # 1 1 0       # 2 1 1       # 3 0 0       # 4 0 1

KNN Classification

↪ 6. Confusion Matrix

The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(y_test, prediction)
      print(cm)

      ---Output---       # [[ 7 8]       # [11 15]]

The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]

      TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 15       # False Positive (FP) = 8       # True Negative (TN) = 7       # False Negative (FN) = 11

Accuracy score can be calculated using the following formula.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the classification = 0.537

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy}")

      ---Output---       # Accuracy: 0.5365853658536586

KNN Classification

↪ 7. Classification Report

The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.

      from sklearn.metrics import classification_report
      print(classification_report(y_test, prediction))

      ---Output---       # precision recall f1-score support       #       # 0 0.39 0.47 0.42 15       # 1 0.65 0.58 0.61 26       #       # accuracy 0.54 41       # macro avg 0.52 0.52 0.52 41       # weighted avg 0.56 0.54 0.54 41
  • precision: The ratio of true positives to the total predicted positives (true positives + false positives). It measures how many of the positively predicted instances were actually positive.
  • recall (Sensitivity): The ratio of true positives to the total actual positives (true positives + false negatives). It measures how many of the actual positive instances were correctly predicted.
  • f1-score: The harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall. A higher f1-score indicates better performance.
  • support: The number of actual instances in each class in the test set.
  • KNN Classification

    ↪ 8. Strengths and Weaknesses

    Strengths

  • Easy to understand and implement. No complex training is involved, making it a good starting point.
  • Makes no assumptions about the underlying data distribution, useful for complex, non-linear data.
  • Can be used for both classification and regression problems.
  • Can naturally handle multi-class classification problems.
  • Weaknesses

  • Slow, especially with large datasets, as distance calculations are performed at prediction time.
  • Choice of 'n_neighbors' (number of neighbors) is crucial, and selecting the optimal 'n_neighbors' requires extensive testing or cross-validation.
  • Requires storing the entire training dataset in memory.
  • Noisy data points or outliers in the training set influence the model.
  • Can be biased toward the majority class in imbalanced datasets.