Random Forest Classification

Random Forest Classification is a supervised machine learning algorithm that classifies data points into different categories. It's an ensemble method, combining multiple decision trees to improve prediction accuracy and robustness. This code demonstrates a basic workflow of training a Random Forest Classifier, performing predictions, and evaluating model performance using a confusion matrix and accuracy.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Prepare data

  ➤ 3. Preprocess data

  ➤ 4. Split data

  ➤ 5. Model training

  ➤ 6. Prediction

  ➤ 7. Confusion Matrix

  ➤ 8. Classification Report

  ➤ 9. Strengths and Weaknesses

Random Forest Classification

↪ 1. Import data

Import data using read_csv() function.

      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],       # dtype='object')

Random Forest Classification

↪ 2. Prepare data

The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.

The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Next, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model. Finally, it extracts the 'UP' column as the dependent variable (y), which indicates whether the stock price went up ('UP') or down ('DOWN').

      print(data.head(5))
      data = data.dropna()
      print(data.head(5))
      x = data[['SMA20']]
      y = data[['UP']]

Random Forest Classification

↪ 3. Preprocess data

Standardization brings all features to a similar scale, ensuring that no single feature disproportionately influences the tree's structure due to its scale. This is an important technique in data pre-processing, and it's a common requirement for many machine learning models. Standardization typically involves transforming the data to have a mean of 0 and a standard deviation of 1.

This code below creates a StandardScaler object to handle data standardization. It first calculates the mean and standard deviation of each feature in the data 'x' using the fit method. Then, it applies these learned statistics to 'x' using the transform method, effectively scaling the features to have a mean of 0 and a standard deviation of 1.

      from sklearn.preprocessing import StandardScaler
      standardizer = StandardScaler()
      x = standardizer.fit_transform(x)

Random Forest Classification

↪ 4. Split data

The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.

      from sklearn.model_selection import train_test_split
      x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)
 
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---       # x_train.shape: (160, 1)       # x_test.shape: (41, 1)       # y_train.shape: (160, 1)       # y_test.shape (41, 1)

Random Forest Classification

↪ 5. Model training

This code imports the RandomForestClassifier class, which is used to create a random forest model. A RandomForestClassifier object is then instantiated as a model using default parameters (meaning the model will use the default number of trees and other settings). Finally, the .fit() method trains the model using the provided training data: x_train and y_train.

      from sklearn.ensemble import RandomForestClassifier
      #We select the ones we want to use here
      model = RandomForestClassifier()
      model.fit(x_train, y_train.values.ravel()) 

Random Forest Classification

↪ 6. Prediction

The model.predict() method generates predictions for each data point in x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.

      prediction = model.predict(x_test)
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1)
      print(dframe.head(10))

      ---Output---       # UP Pred       # 0 UP UP       # 1 DOWN DOWN       # 2 DOWN UP       # 3 UP UP       # 4 UP DOWN       # 5 UP UP       # 6 UP DOWN       # 7 DOWN DOWN       # 8 UP DOWN       # 9 DOWN UP

Random Forest Classification

↪ 7. Confusion Matrix

The code below calculates and displays various metrics to evaluate the performance of a classification model. It uses the confusion_matrix function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(y_test, prediction)
      print(cm)

      ---Output---       # [[12 9]       # [ 8 12]]

The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]

      TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 12       # False Positive (FP) = 9       # True Negative (TN) = 12       # False Negative (FN) = 8

Accuracy score can be calculated using the following formula.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the classification = 0.585

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy}")

      ---Output---       # Accuracy: 0.5853658536585366

Random Forest Classification

↪ 8. Classification Report

The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.

      from sklearn.metrics import classification_report
      print(classification_report(y_test, prediction))

      ---Output---       # precision recall f1-score support       #       # DOWN 0.60 0.57 0.59 21       # UP 0.57 0.60 0.59 20       #       # accuracy 0.59 41       # macro avg 0.59 0.59 0.59 41       # weighted avg 0.59 0.59 0.59 41
  • precision: The ratio of true positives to the total predicted positives (true positives + false positives). It measures how many of the positively predicted instances were actually positive.
  • recall (Sensitivity): The ratio of true positives to the total actual positives (true positives + false negatives). It measures how many of the actual positive instances were correctly predicted.
  • f1-score: The harmonic mean of precision and recall. It provides a balanced measure considering both precision and recall. A higher F1-score indicates better performance.
  • support: The number of actual instances in each class in the test set.
  • Random Forest Classification

    ↪ 9. Strengths and Weaknesses

    Strengths

  • Random Forests often achieve high accuracy due to the ensemble nature—combining predictions from multiple decision trees reduces variance and improves predictive power compared to individual decision trees.
  • The bagging technique (bootstrap aggregating) makes Random Forests relatively insensitive to outliers and noisy data points.
  • Random Forests can handle datasets with many independent variables (high dimensionality) effectively.
  • Random Forests can model complex, non-linear relationships between the independent and the dependent variables.
  • Weaknesses

  • Training a Random Forest can be computationally expensive, especially with large datasets and a high number of trees. The computational cost increases linearly with the number of trees.
  • Interpreting the model's decision-making process for individual predictions can be challenging.
  • Random Forests can be biased toward categorical features with numerous levels, potentially leading to overfitting.
  • Storing numerous trees in memory can consume significant resources, especially for large models trained on massive datasets.