XGBoost

XGBoost Regression

XGBoost (Extreme Gradient Boosting) regression is a powerful machine-learning algorithm that excels at regression tasks. It is an ensemble method that combines the predictions of multiple decision trees to produce a more accurate and robust final prediction. Unlike traditional decision trees, which can be prone to overfitting, XGBoost uses a boosting technique to iteratively build trees. Each new tree attempts to correct the errors of the previous trees, leading to a more accurate overall model.

Furthermore, XGBoost incorporates regularization techniques (L1 and L2) to prevent overfitting and improve generalization to unseen data. It is known for its exceptional performance and is widely used in machine learning competitions and real-world applications due to its ability to handle complex data relationships and provide high prediction accuracy.

This module outlines the following steps:

➤ 1. Import data

➤ 2. Prepare data

➤ 3. Split data

➤ 4. Model training

➤ 5. Prediction

➤ 6. Actual vs. Predicted Graph

➤ 7. Evaluate the model

➤ 8. Predicting Close Price with generated input data

➤ 9. Strengths and Weaknesses

XGBoost Regression

↪ 1. Import data

Import data using read_csv() function.

      import pandas as pd
      import numpy as np
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---
      # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',
      #       'Low Price', 'Last Price', 'Close Price', 'Average Price',
      #       'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',
      #       '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],
      #      dtype='object')

XGBoost Regression

↪ 2. Prepare data

      print(data.shape)

      ---Output---
      # (250, 19)

Choose the required columns for analysis. This exercise predicts the Close Price using the Previous Close Price.

      data2 = data.dropna()
      data2 = data[['Prev Close', 'Close Price']]
      print(data2.head(5))

      ---Output---
      #    Prev Close  Close Price
      # 0     1401.55      1388.95
      # 1     1388.95      1394.85
      # 2     1394.85      1385.10
      # 3     1385.10      1380.30
      # 4     1380.30      1378.45
 
      X = data[['Prev Close']]
      y = data['Close Price']

XGBoost Regression

↪ 3. Split data

Split the data into training and test sets.

      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The code above performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. X represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The random_state=42 argument ensures that the split is reproducible.

XGBoost Regression

↪ 4. Model training

The code below initializes and trains an XGBoost regression model using the xgboost library. It begins by importing the XGBRegressor class, which provides the functionality for XGBoost-based regression. Then, an XGBRegressor object is instantiated and assigned to the variable xgb. During instantiation, key hyperparameters are defined as follows.

    – n_estimators is set to 100, indicating that the model will use 100 boosting rounds (decision trees)
    – learning_rate is set to 0.1, controlling the step size at each round; and
    – max_depth is set to 3, limiting the maximum depth of individual decision trees.

These parameters determine the model's learning process and complexity.

Following the model's initialization, the fit() method is called on the xgb object, using X_train as the feature data and y_train as the corresponding target values from the training dataset. In this step, the model learns from the provided training data, iteratively building decision trees and minimizing the error between the predicted values and the actual values.

      from xgboost import XGBRegressor
      xgb = XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
      xgb.fit(X_train, y_train)

XGBoost Regression

↪ 5. Prediction

The code below uses the trained XGBoost regression model to make predictions on new, unseen data.

      # Make predictions on the test set
      prediction = xgb.predict(X_test)

The code compares the model's predictions to the actual values from the test dataset (y_test). It creates a Pandas DataFrame to display the first 5 'Actual' versus 'Predicted' values, providing a quick visual check of the model's performance on unseen data.

      # Compare the actual and predicted values
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1)
      dframe.columns = ['Actual','Predicted']
      graph = dframe.head(5)
      print(graph)

      ---Output---
      #     Actual    Predicted
      # 0  1597.65  1617.206421
      # 1  1367.40  1355.175781
      # 2  1458.65  1454.996948
      # 3  1464.85  1481.486328
      # 4  1611.15  1507.974976

The code below demonstrates how to use a trained XGBoost regression model to predict an outcome for a new, unseen data point.

      from numpy import asarray
      Predict_Close_Price= xgb.predict(asarray([1508.80]))  
      print("Predicted Value: ", Predict_Close_Price)
      # Predicted Value:

      ---Output---
      # Predicted Value: [1507.975]

XGBoost Regression

↪ 6. Actual vs. Predicted Graph

Bar chart

The code below generates a bar chart that compares actual and predicted values. It uses the matplotlib.pyplot library to create a bar plot from the DataFrame “graph”.

      import matplotlib.pyplot as plt
      graph.plot(kind='bar')
      plt.title('Actual vs Predicted')
      plt.ylabel('Closing price')
      plt.show()

XGBoost Regression

↪ 6. Actual vs. Predicted Graph

Line chart

The code below produces a visualization image of stock price predictions using two lines on the same graph : one representing the “Actual Close Price” using the y_test data, shown in blue, and another representing the “Predicted Close Price” using the prediction data, shown in red with a dashed line style.

      # Visualize the predicted vs. actual stock prices
      plt.figure(figsize=(12, 6))
      plt.plot(y_test.index, y_test.values, label='Actual Close Price', color='blue')
      plt.plot(y_test.index, prediction, label='Predicted Close Price', color='red', linestyle='dashed')
      plt.title('Stock Price Prediction with XGBoost')
      plt.xlabel('Date')
      plt.ylabel('Stock Price')
      plt.legend()
      plt.show()

XGBoost Regression

↪ 7. Evaluate the model

This code evaluates the performance of a regression model using two key metrics: Mean Squared Error (MSE) and R-squared (R²). The mean_squared_error function computes the average squared difference between the model's predictions (y_pred) and the true values (y_test). A lower MSE indicates better accuracy.

      from sklearn.metrics import mean_squared_error
      mse = mean_squared_error(y_test, y_pred)
      print(f"Mean Squared Error: {mse}")

      ---Output---
      # Mean Squared Error: 588.6288800881035

The MSE, which is 588.62, represents the average squared difference between the model's predictions and the actual values in the test set. A lower MSE is better, indicating that the model's predictions are, on average, close to the true values.

The r2_score function calculates the R-squared, a measure of how well the model fits the data, ranging from 0 to 1. A higher R² (closer to 1) suggests a better fit, indicating that the model explains a larger proportion of the variance in the dependent variable.

      from sklearn.metrics import  r2_score
      r2 = r2_score(y_test, y_pred)
      print(f"R-squared: {r2}")

      ---Output---
      # R-squared: 0.958235319285963

The R-squared value of 0.958 indicates that the model explains approximately 95.8% of the variance in the data. This is a high R-squared, suggesting a good model fit to the test data.

XGBoost Regression

↪ 8. Predicting Close Price with generated input data

Create input data in the x_test range in the interval of 1, and predict the Close Price for each generated input data.

      X_grid = np.arange(X_test.values.min(), X_test.values.max())
      # Reshape the data into  a len(X_grid)*1 array, i.e. to make a column out of the X_grid values 
      X_grid = X_grid.reshape((len(X_grid), 1)) 
      # Compare the predicted Close Price with the actual Close Price using scatter plot. 
      plt.figure(figsize=(16,8))
      plt.title('XGBoost Regression')
      plt.xlabel('Prices')
      plt.ylabel('Close Price')
      plt.scatter(X, y, color = "blue")
      plt.scatter(X_grid, xgb.predict(X_grid), color = 'red')
      plt.show()

XGBoost Regression

↪ 9. Strengths and Weaknesses

Strengths

XGBoost is known for its exceptional performance in machine-learning. It often achieves state-of-the-art results due to its sophisticated boosting algorithm and regularization techniques.

It can effectively capture complex non-linear relationships within the data due to its tree-based structure and boosting approach.

It can automatically handle missing data, which is a significant benefit in real-world data scenarios.

XGBoost is designed to be efficient and can handle large datasets due to its optimized implementation and parallel processing capabilities.

Can be more robust to outliers compared to some other regression models because of its boosting nature and tree-based structure.

Weaknesses

XGBoost has many hyperparameters, which can make it challenging to tune and configure optimally for a given problem.

While scalable, training large XGBoost models can be computationally expensive and may require significant time and resources.

The tree-based nature of the model can make it difficult to interpret the predictions and understand why the model made certain decisions.

XGBoost requires all input features to be numerical; categorical features must be encoded before use.