Random Forest Regression
Random Forest Regression is a supervised machine-learning algorithm used to predict continuous dependent variables. It's an ensemble method, meaning it combines multiple decision trees to make more accurate predictions than a single tree could achieve. Each tree in the forest is built on a random subset of the data and a random subset of the features.
This code demonstrates a basic workflow of training a Decision Tree Classifier for binary classification, performing predictions, and evaluating model performance using a confusion matrix and accuracy.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Prepare data
➤ 3. Split data
➤ 4. Random Forest Regression
➤ 5. Actual vs. Predicted Graph
➤ 6. Evaluate the model
➤ 7. Predicting Close Price with generated input data
➤ 8. Effect of number of trees
➤ 9. Strengths and Weaknesses
Random Forest Regression
↪ 1. Import data
Import the pre-processed data for analysis. Subsequently, the data will be partitioned into training and test sets to facilitate the analysis.
# Import required libraries. import pandas as pd # Matplotlib is the fundamental plotting library import matplotlib.pyplot as plt # Seaborn builds upon Matplotlib, offering a import seaborn as sns # higher-level interface for statistical visualization. import numpy as np from sklearn.metrics import mean_squared_error, r2_score
Set default style and color scheme for Seaborn plots.
sns.set(style="ticks", color_codes=True)
Import data using read_csv() function.
data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-cleaned.csv', header=0) print(data.columns)The info() function can be used for checking the data type.
---Output--- # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price', # 'Low Price', 'Last Price', 'Close Price', 'Average Price', # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty', # '% Dly Qt to Traded Qty'], # dtype='object')
data.info()
---Output--- ## RangeIndex: 250 entries, 0 to 249 # Data columns (total 15 columns): # # Column Non-Null Count Dtype # --- ------ -------------- ----- # 0 Symbol 250 non-null object # 1 Series 250 non-null object # 2 Date 250 non-null object # 3 Prev Close 250 non-null float64 # 4 Open Price 250 non-null float64 # 5 High Price 250 non-null float64 # 6 Low Price 250 non-null float64 # 7 Last Price 250 non-null float64 # 8 Close Price 250 non-null float64 # 9 Average Price 250 non-null float64 # 10 Total Traded Quantity 250 non-null float64 # 11 Turnover 250 non-null float64 # 12 No. of Trades 250 non-null float64 # 13 Deliverable Qty 250 non-null float64 # 14 % Dly Qt to Traded Qty 250 non-null float64 # dtypes: float64(12), object(3) # memory usage: 29.4+ KB
Random Forest Regression
↪ 2. Prepare data
print(data.shape)
---Output--- # (250, 15)
Choose the required columns for analysis. This exercise predicts the Close Price using the Previous Close Price.
data2 = data[['Prev Close','Close Price']] print(data2.head(5))
---Output--- # Prev Close Close Price # 0 1401.55 1388.95 # 1 1388.95 1394.85 # 2 1394.85 1385.10 # 3 1385.10 1380.30 # 4 1380.30 1378.45
Separate a dataset into an independent (x) and a dependent variable (y).
x = data2.drop(['Close Price'], axis=1) y = data2[['Close Price']]where,
Random Forest Regression
↪ 3. Split data
Split the data into training and test sets.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2) print("x_train.shape: ", x_train.shape ) print("x_test.shape: ", x_test.shape ) print("y_train.shape: ", y_train.shape ) print("y_test.shape", y_test.shape )
---Output--- # x_train.shape: (200, 6) # x_test.shape: (50, 6) # y_train.shape: (200, 1) # y_test.shape (50, 1)
Random Forest Regression
↪ 4. Random Forest Regression
The code below trains a Random Forest Regression model using the scikit-learn library. A RandomForestRegressor is initialized with 100 trees (n_estimators) and a fixed random seed (random_state) for reproducibility. The model is then trained using a training dataset. After training, the model predicts values for a test dataset (x_test), storing the predictions in rfr_pred.
from sklearn.ensemble import RandomForestRegressor rfr = RandomForestRegressor(n_estimators = 100, random_state = 0) rfr.fit(x_train,y_train) rfr_pred = rfr.predict(x_test)
The code compares the model's predictions to the actual values from the test dataset (y_test). It creates a Pandas DataFrame to display the first 10 'Actual' versus 'Predicted' values, providing a quick visual check of the model's performance on unseen data.
y_pred = pd.DataFrame(rfr_pred, columns=['Pred']) dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1) dframe.columns = ['Actual','Predicted'] graph = dframe.head(10) print(graph)
---Output--- # Actual Predicted # 0 1508.80 1501.9130 # 1 1615.80 1642.5370 # 2 1511.70 1515.4930 # 3 1599.70 1618.6785 # 4 1404.40 1417.0235 # 5 1378.45 1372.3565 # 6 1289.75 1313.1705 # 7 1637.30 1617.0270 # 8 1692.45 1685.2515 # 9 1638.45 1632.1250
Random Forest Regression
↪ 5. Actual vs. Predicted Graph
The code snippet below generates a bar chart comparing actual and predicted closing prices. It uses a Pandas DataFrame to create a bar chart using the .plot(kind='bar') method. The chart is then titled “Actual vs Predicted,” and the y-axis is labeled “Closing price” using Matplotlib's plt.title() and plt.ylabel() functions. Finally, plt.show() displays the generated chart.
graph.plot(kind='bar') plt.title('Actual vs Predicted') plt.ylabel('Closing price') plt.show()

Random Forest Regression
↪ 6. Evaluate the model
This code evaluates the performance of a regression model using two key metrics: Mean Squared Error (MSE) and R-squared (R²). The mean_squared_error function computes the average squared difference between the model's predictions (y_pred) and the true values (y_test). A lower MSE indicates better accuracy.
mse = mean_squared_error(y_test, y_pred) print(f"Mean Squared Error: {mse}")
---Output--- # Mean Squared Error: 407.4686544840686
The MSE, which is 407.47, represents the average squared difference between the model's predictions and the actual values in the test set. A lower MSE is better, indicating that the model's predictions are, on average, close to the true values.
The r2_score function calculates the R-squared, a measure of how well the model fits the data, ranging from 0 to 1. A higher R² (closer to 1) suggests a better fit, indicating that the model explains a larger proportion of the variance in the dependent variable.
r2 = r2_score(y_test, y_pred) print(f"R-squared: {r2}")
---Output--- # R-squared: 0.9661239499695331
The R-squared value of 0.97 indicates that the model explains approximately 97% of the variance in the data. This is a high R-squared, suggesting a good model fit to the test data.
Random Forest Regression
↪ 7. Predicting Close Price with generated input data
Create input data in the x_test range in the interval of 1, and predict the Close Price for each generated input data.
x_grid = np.arange(x_test.values.min(), x_test.values.max()) # Reshape the data into a len(X_grid)*1 array, i.e. to make a column out of the X_grid values x_grid = x_grid.reshape((len(x_grid), 1)) plt.figure(figsize=(16,8)) plt.title('Random Forest Regression with Generated Input Data') plt.xlabel('Prices') plt.ylabel('Close Price') print("x", x.size) print("y", y.size) plt.scatter(x, y, color = "blue") plt.plot(x_grid, rfr.predict(x_grid), color = 'red') plt.show()

Random Forest Regression
↪ 8. Effect of number of trees
The code below demonstrates the effect of changing the number of trees (n_estimators) in a Random Forest Regression model on its prediction. The code trains three separate Random Forest Regressors, each with a different number of trees (10, 100, and 200), using the same training data (x, y). The random_state is fixed to ensure consistent results across the runs. For each model, it predicts the closing price using a single input value ([[1508.80]]).
# Trees = 10 rfr = RandomForestRegressor(n_estimators = 10, random_state = 0) rfr.fit(x,y) Predict_Close_Price= rfr.predict([[1508.80]]) print("Predicted Value: ", Predict_Close_Price)
---Output--- # Predicted Value: [1516.94]
# Trees = 100 rfr = RandomForestRegressor(n_estimators = 100, random_state = 0) rfr.fit(x,y) Predict_Close_Price= rfr.predict([[1508.80]]) print("Predicted Value: ", Predict_Close_Price)
---Output--- # Predicted Value: [1512.8025]
# Trees = 1000 rfr = RandomForestRegressor(n_estimators = 200, random_state = 0) rfr.fit(x,y) Predict_Close_Price= rfr.predict([[1508.80]]) print("Predicted Value: ", Predict_Close_Price)
---Output--- # Predicted Value: [1511.79]
The output shows that the predicted closing price varies slightly depending on the number of trees. With 10 trees, the prediction is 1516.94. Increasing the number of trees to 100 refines the prediction to 1512.8025, and increasing it further to 200 results in a prediction of 1511.79. This suggests that, in this specific case, increasing the number of trees leads to a more stable and potentially more accurate prediction, although the improvement diminishes with a larger number of trees. The slight variation highlights the inherent randomness in Random Forest, even with a fixed random_state.
Random Forest Regression
↪ 9. Strengths and Weaknesses
Strengths
Weaknesses