Random Forest Regression

Random Forest Regression is a supervised machine-learning algorithm used to predict continuous dependent variables. It's an ensemble method, meaning it combines multiple decision trees to make more accurate predictions than a single tree could achieve. Each tree in the forest is built on a random subset of the data and a random subset of the features.

This code demonstrates a basic workflow of training a Decision Tree Classifier for binary classification, performing predictions, and evaluating model performance using a confusion matrix and accuracy.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Prepare data

  ➤ 3. Split data

  ➤ 4. Random Forest Regression

  ➤ 5. Actual vs. Predicted Graph

  ➤ 6. Evaluate the model

  ➤ 7. Predicting Close Price with generated input data

  ➤ 8. Effect of number of trees

  ➤ 9. Strengths and Weaknesses

Random Forest Regression

↪ 1. Import data

Import the pre-processed data for analysis. Subsequently, the data will be partitioned into training and test sets to facilitate the analysis.

      # Import required libraries.
      import pandas as pd                                 # Matplotlib is the fundamental plotting library
      import matplotlib.pyplot as plt                     # Seaborn builds upon Matplotlib, offering a
      import seaborn as sns                               # higher-level interface for statistical visualization.                                           
      import numpy as np
      from sklearn.metrics import mean_squared_error, r2_score

Set default style and color scheme for Seaborn plots.

      sns.set(style="ticks", color_codes=True)  

Import data using read_csv() function.

      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-cleaned.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty'],       # dtype='object')
The info() function can be used for checking the data type.
      data.info()

      ---Output---       #       # RangeIndex: 250 entries, 0 to 249       # Data columns (total 15 columns):       # # Column Non-Null Count Dtype       # --- ------ -------------- -----       # 0 Symbol 250 non-null object       # 1 Series 250 non-null object       # 2 Date 250 non-null object       # 3 Prev Close 250 non-null float64       # 4 Open Price 250 non-null float64       # 5 High Price 250 non-null float64       # 6 Low Price 250 non-null float64       # 7 Last Price 250 non-null float64       # 8 Close Price 250 non-null float64       # 9 Average Price 250 non-null float64       # 10 Total Traded Quantity 250 non-null float64       # 11 Turnover 250 non-null float64       # 12 No. of Trades 250 non-null float64       # 13 Deliverable Qty 250 non-null float64       # 14 % Dly Qt to Traded Qty 250 non-null float64       # dtypes: float64(12), object(3)       # memory usage: 29.4+ KB

Random Forest Regression

↪ 2. Prepare data

      print(data.shape)

      ---Output---       # (250, 15)

Choose the required columns for analysis. This exercise predicts the Close Price using the Previous Close Price.

      data2 = data[['Prev Close','Close Price']]
      print(data2.head(5))

      ---Output---       # Prev Close Close Price       # 0 1401.55 1388.95       # 1 1388.95 1394.85       # 2 1394.85 1385.10       # 3 1385.10 1380.30       # 4 1380.30 1378.45

Separate a dataset into an independent (x) and a dependent variable (y).

      x = data2.drop(['Close Price'], axis=1)
      y = data2[['Close Price']]

where,
  • x = data2.drop(['Close Price'], axis=1): This line creates a new DataFrame called x. It's derived from data2 by removing the 'Close Price' column. The parameter, axis=1 indicates that the operation should be performed on columns. If axis=0 were used, it would drop rows instead. Therefore, this line effectively extracts all columns except 'Close Price' to form the feature set.
  • y = data2[['Close Price']]: This line creates another DataFrame called y. It contains only the 'Close Price' column from data2. The double brackets [['Close Price']] are important; they ensure that the result is a DataFrame.
  • Random Forest Regression

    ↪ 3. Split data

    Split the data into training and test sets.

          from sklearn.model_selection import train_test_split
          x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
          print("x_train.shape: ", x_train.shape )
          print("x_test.shape: ", x_test.shape )
          print("y_train.shape: ", y_train.shape )
          print("y_test.shape", y_test.shape )
    
          ---Output---       # x_train.shape: (200, 6)       # x_test.shape: (50, 6)       # y_train.shape: (200, 1)       # y_test.shape (50, 1)

    Random Forest Regression

    ↪ 4. Random Forest Regression

    The code below trains a Random Forest Regression model using the scikit-learn library. A RandomForestRegressor is initialized with 100 trees (n_estimators) and a fixed random seed (random_state) for reproducibility. The model is then trained using a training dataset. After training, the model predicts values for a test dataset (x_test), storing the predictions in rfr_pred.

          from sklearn.ensemble import RandomForestRegressor
          rfr = RandomForestRegressor(n_estimators = 100, random_state = 0)
          rfr.fit(x_train,y_train)
          rfr_pred = rfr.predict(x_test)
    
    

    The code compares the model's predictions to the actual values from the test dataset (y_test). It creates a Pandas DataFrame to display the first 10 'Actual' versus 'Predicted' values, providing a quick visual check of the model's performance on unseen data.

          y_pred = pd.DataFrame(rfr_pred, columns=['Pred'])
          dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1)
          dframe.columns = ['Actual','Predicted']
          graph = dframe.head(10)
          print(graph)
    
          ---Output---       # Actual Predicted       # 0 1508.80 1501.9130       # 1 1615.80 1642.5370       # 2 1511.70 1515.4930       # 3 1599.70 1618.6785       # 4 1404.40 1417.0235       # 5 1378.45 1372.3565       # 6 1289.75 1313.1705       # 7 1637.30 1617.0270       # 8 1692.45 1685.2515       # 9 1638.45 1632.1250

    Random Forest Regression

    ↪ 5. Actual vs. Predicted Graph

    The code snippet below generates a bar chart comparing actual and predicted closing prices. It uses a Pandas DataFrame to create a bar chart using the .plot(kind='bar') method. The chart is then titled “Actual vs Predicted,” and the y-axis is labeled “Closing price” using Matplotlib's plt.title() and plt.ylabel() functions. Finally, plt.show() displays the generated chart.

          graph.plot(kind='bar')
          plt.title('Actual vs Predicted')
          plt.ylabel('Closing price')
          plt.show()
    
    
    Actual vs Predicted

    Random Forest Regression

    ↪ 6. Evaluate the model

    This code evaluates the performance of a regression model using two key metrics: Mean Squared Error (MSE) and R-squared (R²). The mean_squared_error function computes the average squared difference between the model's predictions (y_pred) and the true values (y_test). A lower MSE indicates better accuracy.

          mse = mean_squared_error(y_test, y_pred)
          print(f"Mean Squared Error: {mse}")
    
          ---Output---       # Mean Squared Error: 407.4686544840686

    The MSE, which is 407.47, represents the average squared difference between the model's predictions and the actual values in the test set. A lower MSE is better, indicating that the model's predictions are, on average, close to the true values.

    The r2_score function calculates the R-squared, a measure of how well the model fits the data, ranging from 0 to 1. A higher R² (closer to 1) suggests a better fit, indicating that the model explains a larger proportion of the variance in the dependent variable.

          r2 = r2_score(y_test, y_pred)
          print(f"R-squared: {r2}")
    
          ---Output---       # R-squared: 0.9661239499695331

    The R-squared value of 0.97 indicates that the model explains approximately 97% of the variance in the data. This is a high R-squared, suggesting a good model fit to the test data.

    Random Forest Regression

    ↪ 7. Predicting Close Price with generated input data

    Create input data in the x_test range in the interval of 1, and predict the Close Price for each generated input data.

          x_grid = np.arange(x_test.values.min(), x_test.values.max())
          # Reshape the data into  a len(X_grid)*1 array, i.e. to make a column out of the X_grid values 
          x_grid = x_grid.reshape((len(x_grid), 1)) 
          plt.figure(figsize=(16,8))
          plt.title('Random Forest Regression with Generated Input Data')
          plt.xlabel('Prices')
          plt.ylabel('Close Price')
          print("x", x.size)
          print("y", y.size)
          plt.scatter(x, y, color = "blue")
          plt.plot(x_grid, rfr.predict(x_grid), color = 'red')
          plt.show()
    
    
    Random Forest Regression with Generated Input Data

    Random Forest Regression

    ↪ 8. Effect of number of trees

    The code below demonstrates the effect of changing the number of trees (n_estimators) in a Random Forest Regression model on its prediction. The code trains three separate Random Forest Regressors, each with a different number of trees (10, 100, and 200), using the same training data (x, y). The random_state is fixed to ensure consistent results across the runs. For each model, it predicts the closing price using a single input value ([[1508.80]]).

          # Trees = 10
          rfr = RandomForestRegressor(n_estimators = 10, random_state = 0)
          rfr.fit(x,y)
          Predict_Close_Price= rfr.predict([[1508.80]])  
          print("Predicted Value: ", Predict_Close_Price)
    
          ---Output---       # Predicted Value: [1516.94]

          # Trees = 100
          rfr = RandomForestRegressor(n_estimators = 100, random_state = 0)
          rfr.fit(x,y)
          Predict_Close_Price= rfr.predict([[1508.80]])  
          print("Predicted Value: ", Predict_Close_Price)
    
          ---Output---       # Predicted Value: [1512.8025]

          # Trees = 1000
          rfr = RandomForestRegressor(n_estimators = 200, random_state = 0)
          rfr.fit(x,y)
          Predict_Close_Price= rfr.predict([[1508.80]])  
          print("Predicted Value: ", Predict_Close_Price)
    
          ---Output---       # Predicted Value: [1511.79]

    The output shows that the predicted closing price varies slightly depending on the number of trees. With 10 trees, the prediction is 1516.94. Increasing the number of trees to 100 refines the prediction to 1512.8025, and increasing it further to 200 results in a prediction of 1511.79. This suggests that, in this specific case, increasing the number of trees leads to a more stable and potentially more accurate prediction, although the improvement diminishes with a larger number of trees. The slight variation highlights the inherent randomness in Random Forest, even with a fixed random_state.

    Random Forest Regression

    ↪ 9. Strengths and Weaknesses

    Strengths

  • Random Forests often achieve high accuracy due to the ensemble nature of the algorithm. Averaging predictions from multiple trees reduces variance and improves predictive performance compared to a single decision tree.
  • The bagging technique makes them relatively robust to outliers in the data. Outliers are less likely to significantly impact the overall prediction because they may not be present in all subsets used to train the individual trees.
  • Random Forests can effectively handle datasets with a large number of independent variables (high dimensionality).
  • Random Forests can effectively model non-linear relationships between the independent and the dependent variable.
  • Weaknesses

  • Training a Random Forest can be computationally expensive, especially with large datasets and a large number of trees. The computational cost scales with the number of trees and the size of the dataset.
  • While feature importance is provided, Random Forests are often considered “black box” models. Understanding exactly how a particular prediction is made can be challenging due to the complexity of the ensemble of trees.
  • Random Forests can be biased toward categorical features with many levels, potentially leading to overfitting.
  • Memory Intensive: Storing many trees in memory can be resource-intensive, particularly for large models.