XGBoost Classification
XGBoost (Extreme Gradient Boosting) classification is a highly effective and widely used supervised machine learning algorithm for classification problems. It's an ensemble learning method that leverages the power of gradient boosting to combine the predictions of multiple decision trees into a single, more accurate classification model. The core principle of XGBoost classification involves iteratively building decision trees, where each tree is trained to correct the errors made by its predecessors, thereby boosting the overall performance. This process is enhanced by regularization techniques that prevent overfitting, leading to models that generalize well to unseen data.
XGBoost also incorporates features such as handling missing values and parallel processing, making it both accurate and efficient, especially with large datasets.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Prepare data
➤ 3. Split data
➤ 4. Model training
➤ 5. Prediction
➤ 6. Confusion Matrix
➤ 7. Classification Report
➤ 8. Strengths and Weaknesses
XGBoost Classification
↪ 1. Import data
Import data using read_csv() function from pandas library.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', header=0) print(data.columns)
---Output--- # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price', # 'Low Price', 'Last Price', 'Close Price', 'Average Price', # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty', # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'], # dtype='object')
XGBoost Classification
↪ 2. Prepare data
The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.
The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Then, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model.
Next, a new column 'UP2' is created and assigned to the dependent variable y. This column is derived from an existing column 'UP'. If the value in the 'UP' column is equal to the string 'UP', the corresponding value in the 'UP2' column is set to 1; otherwise, it's set to 0. This step converts the categorical 'UP' data into a binary numerical representation.
data = data.dropna() data['UP2'] = np.where(data['UP'] == 'UP', 1, 0) # print(data.head(5)) x = data[['SMA20']] y = data[['UP2']]
XGBoost Classification
↪ 3. Split data
The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True) print("x_train.shape: ", x_train.shape ) print("x_test.shape: ", x_test.shape ) print("y_train.shape: ", y_train.shape ) print("y_test.shape", y_test.shape )
---Output--- # x_train.shape: (160, 1) # x_test.shape: (41, 1) # y_train.shape: (160, 1) # y_test.shape (41, 1)
XGBoost Classification
↪ 4. Model training
The code below demonstrates the setup and training of an XGBoost classification model using
the xgboost library. It begins by importing the XGBClassifier class, which provides
the functionality for XGBoost-based classification. Then, an instance of the XGBClassifier
is created and assigned to the variable xgb_classifier. During initialization, key
hyperparameters are specified as follows.
– n_estimators is set to 100, indicating that the model will use 100 boosting rounds (decision trees);
– learning_rate is set to 0.1, controlling the step size at each round; and
– max_depth is set to 3, limiting the maximum depth of individual decision trees.
These parameters control the model's complexity and learning process.
Following the model's setup, the fit() method is called on the xgb_classifier object, using x_train as the feature data and y_train as the corresponding target labels from the training dataset. This is the core step where the model learns from the provided training data by iteratively building decision trees and minimizing the classification error.
from xgboost import XGBClassifier xgb_classifier = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3) xgb_classifier.fit(x_train, y_train)
XGBoost Classification
↪ 5. Prediction
The xgb_classifier.predict() method generates predictions for each data point in the x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.
prediction = xgb_classifier.predict(x_test) y_pred = pd.DataFrame(prediction, columns=['Pred']) dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1) print(dframe.head(5))
---Output--- # UP2 Pred # 0 1 0 # 1 0 1 # 2 0 0 # 3 1 1 # 4 1 1
XGBoost Classification
↪ 6. Confusion Matrix
The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.
From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, prediction) print(cm)
---Output--- # [[10 8] # [12 11]]
The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]
TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel() print('True Positive (TP) = ', TP) print('False Positive (FP) = ', FP) print('True Negative (TN) = ', TN) print('False Negative (FN) = ', FN)
---Output--- # True Positive (TP) = 11 # False Positive (FP) = 8 # True Negative (TN) = 10 # False Negative (FN) = 12
Accuracy score can be calculated using the following formula.
accuracy = (TP+TN) / (TP+FP+TN+FN) print('Accuracy of the classification = {:0.3f}'.format(accuracy))
---Output--- # Accuracy of the classification = 0.512
Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, prediction) print(f"Accuracy: {accuracy}")
---Output--- # Accuracy: 0.5121951219512195
XGBoost Classification
↪ 7. Classification Report
The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.
from sklearn.metrics import classification_report print(classification_report(y_test, prediction))
---Output--- # precision recall f1-score support # # 0 0.45 0.56 0.50 18 # 1 0.58 0.48 0.52 23 # # accuracy 0.51 41 # macro avg 0.52 0.52 0.51 41 # weighted avg 0.52 0.51 0.51 41
XGBoost Classification
↪ 8. Strengths and Weaknesses
Strengths
Weaknesses