XGBoost Classification

XGBoost (Extreme Gradient Boosting) classification is a highly effective and widely used supervised machine learning algorithm for classification problems. It's an ensemble learning method that leverages the power of gradient boosting to combine the predictions of multiple decision trees into a single, more accurate classification model. The core principle of XGBoost classification involves iteratively building decision trees, where each tree is trained to correct the errors made by its predecessors, thereby boosting the overall performance. This process is enhanced by regularization techniques that prevent overfitting, leading to models that generalize well to unseen data.

XGBoost also incorporates features such as handling missing values and parallel processing, making it both accurate and efficient, especially with large datasets.

This module outlines the following steps:

➤ 1. Import data

➤ 2. Prepare data

➤ 3. Split data

➤ 4. Model training

➤ 5. Prediction

➤ 6. Confusion Matrix

➤ 7. Classification Report

➤ 8. Strengths and Weaknesses

XGBoost Classification

↪ 1. Import data

Import data using read_csv() function from pandas library.

      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---
      # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',
      #       'Low Price', 'Last Price', 'Close Price', 'Average Price',
      #       'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',
      #       '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],
      #      dtype='object')

XGBoost Classification

↪ 2. Prepare data

The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.

The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Then, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model.

Next, a new column 'UP2' is created and assigned to the dependent variable y. This column is derived from an existing column 'UP'. If the value in the 'UP' column is equal to the string 'UP', the corresponding value in the 'UP2' column is set to 1; otherwise, it's set to 0. This step converts the categorical 'UP' data into a binary numerical representation.

      data = data.dropna()
      data['UP2'] = np.where(data['UP'] == 'UP', 1, 0)
      # print(data.head(5))
      x = data[['SMA20']]
      y = data[['UP2']]

XGBoost Classification

↪ 3. Split data

The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.

      from sklearn.model_selection import train_test_split
      x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)
 
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---
      # x_train.shape:  (160, 1)
      # x_test.shape:  (41, 1)
      # y_train.shape:  (160, 1)
      # y_test.shape (41, 1)

XGBoost Classification

↪ 4. Model training

The code below demonstrates the setup and training of an XGBoost classification model using the xgboost library. It begins by importing the XGBClassifier class, which provides the functionality for XGBoost-based classification. Then, an instance of the XGBClassifier is created and assigned to the variable xgb_classifier. During initialization, key hyperparameters are specified as follows.

    – n_estimators is set to 100, indicating that the model will use 100 boosting rounds (decision trees);
    – learning_rate is set to 0.1, controlling the step size at each round; and
    – max_depth is set to 3, limiting the maximum depth of individual decision trees.

These parameters control the model's complexity and learning process.

Following the model's setup, the fit() method is called on the xgb_classifier object, using x_train as the feature data and y_train as the corresponding target labels from the training dataset. This is the core step where the model learns from the provided training data by iteratively building decision trees and minimizing the classification error.

      from xgboost import XGBClassifier
      xgb_classifier = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
      xgb_classifier.fit(x_train, y_train)

XGBoost Classification

↪ 5. Prediction

The xgb_classifier.predict() method generates predictions for each data point in the x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.

      prediction = xgb_classifier.predict(x_test)
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1)
      print(dframe.head(5))

      ---Output---
      #    UP2  Pred
      # 0    1     0
      # 1    0     1
      # 2    0     0
      # 3    1     1
      # 4    1     1

XGBoost Classification

↪ 6. Confusion Matrix

The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(y_test, prediction)
      print(cm)

      ---Output---
      # [[10  8]
      #  [12 11]]

The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]

      TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---
      # True Positive (TP)  =  11
      # False Positive (FP) =  8
      # True Negative (TN)  =  10
      # False Negative (FN) =  12

Accuracy score can be calculated using the following formula.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the classification = {:0.3f}'.format(accuracy))

      ---Output---
      # Accuracy of the classification =  0.512

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy}")

      ---Output---
      # Accuracy:  0.5121951219512195

XGBoost Classification

↪ 7. Classification Report

The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.

      from sklearn.metrics import classification_report
      print(classification_report(y_test, prediction))

      ---Output---
      #               precision    recall  f1-score   support
      # 
      #            0       0.45      0.56      0.50        18
      #            1       0.58      0.48      0.52        23
      # 
      #     accuracy                           0.51        41
      #    macro avg       0.52      0.52      0.51        41
      # weighted avg       0.52      0.51      0.51        41

precision: The ratio of true positives to the total predicted positives (true positives + false positives). It measures how many of the positively predicted instances were actually positive.

recall (Sensitivity): The ratio of true positives to the total actual positives (true positives + false negatives). It measures how many of the actual positive instances were correctly predicted.

f1-score: The harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall. A higher f1-score indicates better performance.

support: The number of actual instances in each class in the test set.

XGBoost Classification

↪ 8. Strengths and Weaknesses

Strengths

Can effectively model non-linear relationships and complex interactions within the data.

Can handle missing data efficiently, which is a significant benefit in real-world datasets.

Designed for efficiency, enabling it to handle large datasets and complex models relatively quickly due to parallel processing and optimized implementations.

Offers a rich set of hyperparameters, allowing users to fine-tune the model's behavior and adapt it to diverse classification problems.

Provides parameters for handling imbalanced datasets, making it suitable for situations where one class is more common than others.

Weaknesses

The presence of numerous hyperparameters can make the tuning process challenging and time-consuming, requiring expertise and a good understanding of the algorithm's internals.

Can be computationally intensive for large datasets or complex models, potentially requiring significant resources and time to train and optimize.

Like most tree-based methods, XGBoost can be difficult to interpret directly, making it a “black box” where the reasoning behind individual predictions is not easily apparent.

XGBoost requires all input features to be numerical; categorical features must be encoded or converted to a numerical representation.