Random Forest Classification
Random Forest Classification is a supervised machine learning algorithm that classifies data points into different categories. It's an ensemble method, combining multiple decision trees to improve prediction accuracy and robustness. This code demonstrates a basic workflow of training a Random Forest Classifier, performing predictions, and evaluating model performance using a confusion matrix and accuracy.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Prepare data
➤ 3. Preprocess data
➤ 4. Split data
➤ 5. Model training
➤ 6. Prediction
➤ 7. Confusion Matrix
➤ 8. Classification Report
➤ 9. Strengths and Weaknesses
Random Forest Classification
↪ 1. Import data
Import data using read_csv() function.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', header=0) print(data.columns)
---Output--- # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price', # 'Low Price', 'Last Price', 'Close Price', 'Average Price', # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty', # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'], # dtype='object')
Random Forest Classification
↪ 2. Prepare data
The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.
The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Next, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model. Finally, it extracts the 'UP' column as the dependent variable (y), which indicates whether the stock price went up ('UP') or down ('DOWN').
print(data.head(5)) data = data.dropna() print(data.head(5)) x = data[['SMA20']] y = data[['UP']]
Random Forest Classification
↪ 3. Preprocess data
Standardization brings all features to a similar scale, ensuring that no single feature disproportionately influences the tree's structure due to its scale. This is an important technique in data pre-processing, and it's a common requirement for many machine learning models. Standardization typically involves transforming the data to have a mean of 0 and a standard deviation of 1.
This code below creates a StandardScaler object to handle data standardization. It first calculates the mean and standard deviation of each feature in the data 'x' using the fit method. Then, it applies these learned statistics to 'x' using the transform method, effectively scaling the features to have a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler standardizer = StandardScaler() x = standardizer.fit_transform(x)
Random Forest Classification
↪ 4. Split data
The code below performs a train-test split on a dataset. It utilizes the train_test_split() function to divide the data into training and testing sets. x represents the input features or independent variables, and y represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The shuffle = True argument ensures that the data is randomly shuffled before the split, which is crucial for preventing bias and ensuring that both training and test sets are representative of the overall dataset.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True) print("x_train.shape: ", x_train.shape ) print("x_test.shape: ", x_test.shape ) print("y_train.shape: ", y_train.shape ) print("y_test.shape", y_test.shape )
---Output--- # x_train.shape: (160, 1) # x_test.shape: (41, 1) # y_train.shape: (160, 1) # y_test.shape (41, 1)
Random Forest Classification
↪ 5. Model training
This code imports the RandomForestClassifier class, which is used to create a random forest model. A RandomForestClassifier object is then instantiated as a model using default parameters (meaning the model will use the default number of trees and other settings). Finally, the .fit() method trains the model using the provided training data: x_train and y_train.
from sklearn.ensemble import RandomForestClassifier #We select the ones we want to use here model = RandomForestClassifier() model.fit(x_train, y_train.values.ravel())
Random Forest Classification
↪ 6. Prediction
The model.predict() method generates predictions for each data point in x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.
prediction = model.predict(x_test) y_pred = pd.DataFrame(prediction, columns=['Pred']) dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1) print(dframe.head(10))
---Output--- # UP Pred # 0 UP UP # 1 DOWN DOWN # 2 DOWN UP # 3 UP UP # 4 UP DOWN # 5 UP UP # 6 UP DOWN # 7 DOWN DOWN # 8 UP DOWN # 9 DOWN UP
Random Forest Classification
↪ 7. Confusion Matrix
The code below calculates and displays various metrics to evaluate the performance of a classification model. It uses the confusion_matrix function to generate a confusion matrix, which summarizes the model's predictions against the true labels.
From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, prediction) print(cm)
---Output--- # [[12 9] # [ 8 12]]
The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]
TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel() print('True Positive (TP) = ', TP) print('False Positive (FP) = ', FP) print('True Negative (TN) = ', TN) print('False Negative (FN) = ', FN)
---Output--- # True Positive (TP) = 12 # False Positive (FP) = 9 # True Negative (TN) = 12 # False Negative (FN) = 8
Accuracy score can be calculated using the following formula.
accuracy = (TP+TN) / (TP+FP+TN+FN) print('Accuracy of the classification = {:0.3f}'.format(accuracy))
---Output--- # Accuracy of the classification = 0.585
Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, prediction) print(f"Accuracy: {accuracy}")
---Output--- # Accuracy: 0.5853658536585366
Random Forest Classification
↪ 8. Classification Report
The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.
from sklearn.metrics import classification_report print(classification_report(y_test, prediction))
---Output--- # precision recall f1-score support # # DOWN 0.60 0.57 0.59 21 # UP 0.57 0.60 0.59 20 # # accuracy 0.59 41 # macro avg 0.59 0.59 0.59 41 # weighted avg 0.59 0.59 0.59 41
Random Forest Classification
↪ 9. Strengths and Weaknesses
Strengths
Weaknesses