SVM Classification

Support Vector Machines (SVMs) are powerful supervised learning algorithms used for classification tasks. The core idea behind SVM classification is to find the optimal hyperplane that best separates different classes in the feature space. This hyperplane isn't just any separating line (in 2D) or plane (in higher dimensions); it's the one that maximizes the margin, which is the distance between the hyperplane and the nearest data points of each class (known as support vectors).

SVMs are versatile because they can handle both linear and non-linear separations. For non-linear data, they employ a “kernel”. Kernels implicitly map the input data into a higher-dimensional space where a linear separation might be possible, without actually calculating the coordinates in that higher-dimensional space. This allows SVMs to capture complex relationships and achieve high accuracy in diverse scenarios, such as tasks like image classification, text categorization, and bioinformatics.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Prepare data

  ➤ 3. Split data

  ➤ 4. Model training

  ➤ 5. Prediction

  ➤ 6. Classification Report

  ➤ 7. Confusion Matrix

  ➤ 8. Visualize the hyperplane

  ➤ 9. SVM Kernels

  ➤ 10. Strengths and Weaknesses

SVM Classification

↪ 1. Import data

Import data using read_csv() function.

      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],       # dtype='object')

SVM Classification

↪ 2. Prepare data

This code below performs initial data preparation for a classification task. First, it removes rows containing any missing values using dropna() function. Then, it creates a new binary column named 'UP2'. This column is assigned a value of 1 if the corresponding value in the 'UP' column is 'UP', and 0 otherwise.

Next, it selects the 'Prev Close' and 'SMA20' columns from the dataframe as features (denoted as X) and the 'UP2' column as the target variable (denoted as y). The minimum and maximum values of the 'SMA20' column are printed to understand the range of values, and this data is helpful while drawing hyperplane (See Display hyperplane section).

      import numpy as np
      data = data.dropna()
      data['UP2'] = np.where(data['UP'] == 'UP', 1, 0)
      #print(data.head(5))
      columnname = ['Prev Close','SMA20']
      X = data[columnname]
      y = data['UP2']
      print("SMA max", X['SMA20'].max())
      print("SMA min", X['SMA20'].min())

      ---Output---       # SMA max 1674.2099999999998       # SMA min 1394.8775

SVM Classification

↪ 3. Split data

The data is split into training and testing sets using train_test_split() function. 80% of the data is used for training the model, and 20% is held out for testing, using a random_state for reproducibility. This ensures that the model is trained on a portion of the data and evaluated on unseen data, providing an unbiased assessment of its performance.

      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

SVM Classification

↪ 4. Model training

This code standardizes features using StandardScaler to prepare data for machine learning. It calculates the mean and standard deviation from the training data (X_train) using fit_transform() function and then applies the same transformation to the test data (X_test) using the transform() function, preventing data leakage. The scaled data, X_train_scaled and X_test_scaled, have a mean near zero and a standard deviation of one. This process ensures that all features have a similar scale, potentially improving model performance.

      from sklearn.preprocessing import StandardScaler
      scaler = StandardScaler()
      X_train_scaled = scaler.fit_transform(X_train)
      X_test_scaled = scaler.transform(X_test)

This code below initializes and trains a Support Vector Machine (SVM) classifier. It begins by creating an SVC object, representing the SVM classifier from the scikit-learn library. The kernel='linear' parameter specifies that a linear kernel will be used, meaning the model will try to find a straight-line boundary to separate the classes. The C=1.0 parameter is a regularization parameter that controls the trade-off between achieving a low error on the training data and having a simple decision boundary. A smaller value indicates greater emphasis on having a simpler decision boundary. Finally, random_state=0 is set for reproducibility, ensuring the same result if the code is run multiple times.

The fit() method is then called on the svm_classifier object, passing in the scaled training features (X_train_scaled) and the training target variable (y_train). This step is where the SVM model learns the optimal decision boundary from the training data. It determines the support vectors and the weights that define the separating hyperplane in the feature space. After this step, the svm_classifier is trained and ready to be used to make predictions on new, unseen data.

      from sklearn.svm import SVC
      svm_classifier = SVC(kernel='linear', C=1.0, random_state=0)
      svm_classifier.fit(X_train_scaled, y_train)

SVM Classification

↪ 5. Prediction

The code below uses the trained Support Vector Machine (SVM) classifier to make predictions on unseen test data and then organizes those predictions for analysis. First, the prediction method of the svm_classifier is called with the scaled test features (X_test_scaled). This generates the predicted class labels for each instance in the test set, which is stored in the prediction variable. These predictions are then converted into a Pandas DataFrame named y_pred, with the column labeled 'Pred'. This prepares the predicted results for easy comparison with the true labels.

Next, the code concatenates the true test labels (y_test) and the predicted labels (y_pred) into a single DataFrame named dframe. The .reset_index(drop=True) is used to reset the index of y_test to avoid any issues during concatenation, ensuring alignment of the predicted and true labels based on their position in the data. The axis=1 argument specifies that the concatenation should occur column-wise, placing the true labels and predictions side by side. This new dframe allows for direct comparison and further evaluation of the model's performance.

      prediction = svm_classifier.predict(X_test_scaled)
      # Compare the actual and predicted values
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1)
      print(dframe.head(10))

      ---Output---       # UP2 Pred       # 0 0 1       # 1 0 1       # 2 1 1       # 3 0 1       # 4 1 1       # 5 0 1       # 6 1 1       # 7 1 1       # 8 1 1       # 9 1 1

SVM Classification

↪ 6. Classification Report

The code below evaluates the performance of the trained Support Vector Machine (SVM) model. It imports two key metrics from scikit-learn: accuracy_score and classification_report. The accuracy_score() function is used to calculate the accuracy of the model by comparing the true test labels (y_test) with the model's predictions (prediction).

      from sklearn.metrics import accuracy_score, classification_report
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy:.2f}")

      ---Output---       Accuracy: 0.56

The classification_report provides more details, such as precision, recall, and F1-score, for each class.

      print("Classification Report:")
      print(classification_report(y_test, prediction))

      ---Output---       # Classification Report:       # precision recall f1-score support       #       # 0 0.00 0.00 0.00 18       # 1 0.56 1.00 0.72 23       #       # accuracy 0.56 41       # macro avg 0.28 0.50 0.36 41       # weighted avg 0.31 0.56 0.40 41

SVM Classification

↪ 7. Confusion Matrix

The code below calculates and extracts values from the confusion matrix, a crucial tool for understanding the performance of a classification model. First, it imports the function confusion_matrix().This function takes the true test labels (y_test) and the predicted labels (prediction) as input and computes the confusion matrix, which summarizes the counts of true positives, true negatives, false positives, and false negatives.

      from sklearn.metrics import confusion_matrix
      TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 23       # False Positive (FP) = 18       # True Negative (TN) = 0       # False Negative (FN) = 0

The accuracy score can be calculated using the formula below.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the binary classification = 0.561

SVM Classification

↪ 8. Visualize the hyperplane

The code trains a linear Support Vector Machine (SVM) classifier and then visualizes the decision boundary it learns. It starts by creating an SVM model, which is trained using the provided training data (X_train and y_train). After training, the code extracts the coefficients (weights) and intercept that define the linear decision boundary from the trained model.

Next, the code prepares for visualization by generating points along a line that represents the decision boundary using the calculated coefficients and intercept. It then creates a scatter plot where each point represents a data sample, colored according to its class (y-values). The decision boundary line is plotted over this scatter plot. Additionally, the code highlights support vectors, which are the key data points that influence the decision boundary, on the plot.

      svm_classifier = SVC(kernel='linear', C=1.0, random_state=0)
      svm_classifier.fit(X_train, y_train)
      clf = svm_classifier
 
      # Get the coefficients and intercept of the hyperplane
      w1 = clf.coef_[0,0]
      w2 = clf.coef_[0,1]
      b = clf.intercept_[0]
 
      # Generate points for the decision boundary line
      x_line = np.linspace(X['Prev Close'].min() - 1, X['Prev Close'].max() + 1, 300)
      y_line = -(w1 / w2) * x_line - (b / w2)
 
      # Scatter plot
      import matplotlib.pyplot as plt
      plt.scatter(X['Prev Close'], X['SMA20'], c=y, cmap=plt.cm.Paired, edgecolors='k', marker='o', s=30, label = 'UP')
      plt.xticks(np.arange(1350, 1750, step=50))
      plt.yticks(np.arange(1350, 1750, step=50))
 
      # Plot the decision boundary (hyperplane)
      plt.plot(x_line, y_line, color='r', linestyle='dashed', linewidth=2, label='Decision Boundary (Hyperplane)')
 
      # Highlight support vectors
      plt.scatter(clf.support_vectors_[:, 0], clf.support_vectors_[:, 1], s=50, facecolors='none', edgecolors='k', linewidth=1, label='Support Vectors')
 
      # Set labels, title, and legend
      plt.xlabel('Feature 1')
      plt.ylabel('Feature 2')
      plt.title('Hyperplane Visualization')
      plt.legend()
      plt.show()

Hyperplane

SVM Classification

↪ 9. SVM Kernels

Support Vector Machines (SVMs) are powerful classification algorithms. A kernel function is a way of implicitly mapping data into a higher-dimensional space, allowing SVMs to find linear decision boundaries in that higher space, even if the data isn't linearly separable in the original input space.

Linear Kernel: This is the simplest kernel, essentially performing a linear separation in the input space. It works best when the data is already linearly separable or when the number of features is much larger than the number of samples. Typical Use Cases: Text classification, sentiment analysis, and other scenarios where the feature space is high-dimensional.

      svm_classifier = SVC(kernel='linear', C=1.0, random_state=0)

Polynomial Kernel: This kernel introduces non-linearity by mapping the input data into a higher-dimensional space using a polynomial function. It's defined by the degree of the polynomial, allowing for more complex decision boundaries. Typical Use Cases: Image classification and other cases where data may have curved or polynomial-like separations.

      svm_classifier = SVC(kernel='poly', C=1.0, random_state=1)

Radial Basis Function (RBF) Kernel (also called Gaussian Kernel): The RBF kernel is one of the most popular and versatile kernels. It maps data into an infinitely dimensional space using a Gaussian function. It is particularly effective in capturing complex relationships and is widely used when the decision boundary is expected to be non-linear and complicated. Typical Use Cases: Image classification, object recognition, bio-informatics, and many other applications where linear separation is unlikely.

      svm_classifier = SVC(kernel='rbf', C=1.0, random_state=1)

Sigmoid Kernel: This kernel is similar to a neural network activation function. It can be effective for some datasets, but it's less commonly used compared to RBF and polynomial kernels. In some cases, it might even perform similar to a linear kernel. Typical Use Cases: While it can be used for any dataset, it's less common than RBF, but can be seen in some neural network literature.

      svm_classifier = SVC(kernel='sigmoid', C=1.0, random_state=1)

SVM Classification

↪ 10. Strengths and Weaknesses

Strengths:

  • By using different kernel functions, SVMs can model complex, non-linear relationships in the data, allowing them to handle a wide range of classification problems.
  • With proper regularization (the 'C' parameter), SVMs can generalize well to unseen data, minimizing the risk of overfitting compared to other algorithms.
  • Once trained, the model is compact and only needs to store the support vectors (usually a small subset of data points), leading to more efficient prediction times and lower memory footprint compared to other models.
  • Weaknesses:

  • The performance of SVMs heavily depends on selecting the right kernel and tuning its associated parameters (e.g., gamma for RBF, degree for polynomial, C which is the regularization parameter). Finding optimal parameters can be computationally expensive and requires careful cross-validation.
  • Training an SVM can be computationally expensive, especially for large datasets, particularly when using non-linear kernels, because of quadratic time complexity with respect to the dataset size.
  • SVMs can be sensitive to noisy data or outliers because the position of support vectors could shift due to these noisy points which significantly impacts the decision boundary.
  • While SVMs achieve good results, their “black box” nature makes it difficult to understand the reasoning behind the model's predictions. The decision boundary is determined through support vectors which are often difficult to interpret in practical use.