Decision Tree Classification
This code demonstrates a basic workflow of training a Decision Tree Classifier for binary classification, performing predictions, and evaluating model performance using a confusion matrix and accuracy.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Prepare data
➤ 3. Preprocess data
➤ 4. Split data
➤ 5. Model training
➤ 6. Prediction
➤ 7. Confusion Matrix
➤ 8. Classification Report
➤ 9. Strengths and Weaknesses
Decision Tree Classification
↪ 1. Import data
Import data using read_csv() function.
import pandas as pd data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', header=0) print(data.columns)
---Output--- # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price', # 'Low Price', 'Last Price', 'Close Price', 'Average Price', # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty', # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'], # dtype='object')
Decision Tree Classification
↪ 2. Prepare data
The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.
The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Next, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model. Finally, it extracts the 'UP' column as the dependent variable (y), which indicates whether the stock price went up ('UP') or down ('DOWN').
print(data.head(5)) data = data.dropna() print(data.head(5)) x = data[['SMA20']] y = data[['UP']]
Decision Tree Classification
↪ 3. Preprocess data
Standardization brings all features to a similar scale, ensuring that no single feature disproportionately influences the tree's structure due to its scale. This is an important technique in data pre-processing, and it's a common requirement for many machine learning models. Standardization typically involves transforming the data to have a mean of 0 and a standard deviation of 1.
This code below creates a StandardScaler object to handle data standardization. It first calculates the mean and standard deviation of each feature in the data 'x' using the fit method. Then, it applies these learned statistics to 'x' using the transform method, effectively scaling the features to have a mean of 0 and a standard deviation of 1.
from sklearn.preprocessing import StandardScaler standardizer = StandardScaler() x = standardizer.fit_transform(x)
Decision Tree Classification
↪ 4. Split data
Split the data into training and test sets.
from sklearn.model_selection import train_test_split x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True) print("x_train.shape: ", x_train.shape ) print("x_test.shape: ", x_test.shape ) print("y_train.shape: ", y_train.shape ) print("y_test.shape", y_test.shape )
---Output--- # x_train.shape: (160, 1) # x_test.shape: (41, 1) # y_train.shape: (160, 1) # y_test.shape (41, 1)
Decision Tree Classification
↪ 5. Model training
The code below demonstrates a basic implementation of a decision tree classifier. It begins by importing the DecisionTreeClassifier class, which is the core component for building decision tree models. An instance of this class is then created, representing a blank decision tree model ready for training.
The fit() method is then used to train the model using training data (x_train) and corresponding target labels (y_train). This process involves the decision tree algorithm learning the relationships between the features and the target variable, effectively building the tree structure to make predictions on new data.
After training, the model variable holds a fitted decision tree, ready to be used for predictions.
from sklearn.tree import DecisionTreeClassifier model = DecisionTreeClassifier() model.fit(x_train, y_train) #fit() is the training method
Decision Tree Classification
↪ 6. Prediction
The model.predict() method generates predictions for each data point in x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.
prediction = model.predict(x_test) y_pred = pd.DataFrame(prediction, columns=['Pred']) dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1) print(dframe.head(10))
---Output--- # UP Pred # 0 DOWN UP # 1 UP DOWN # 2 DOWN DOWN # 3 UP UP # 4 UP UP # 5 DOWN UP # 6 UP UP # 7 DOWN DOWN # 8 DOWN UP # 9 UP DOWN
Decision Tree Classification
↪ 7. Confusion Matrix
The code below calculates and displays various metrics to evaluate the performance of a binary classification model. It uses the confusion_matrix function to generate a confusion matrix, which summarizes the model's predictions against the true labels.
From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, prediction) TN, FP, FN, TP = cm.ravel() #TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel() print('True Positive (TP) = ', TP) print('False Positive (FP) = ', FP) print('True Negative (TN) = ', TN) print('False Negative (FN) = ', FN)
---Output--- # True Positive (TP) = 11 # False Positive (FP) = 13 # True Negative (TN) = 11 # False Negative (FN) = 6
Finally, the code calculates the accuracy of the model, which represents the proportion of correctly classified instances, and prints all these values for analysis.
accuracy = (TP+TN) / (TP+FP+TN+FN) print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))
---Output--- # Accuracy of the binary classification = 0.537
Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, prediction) print(f"Accuracy: {accuracy}")
---Output--- # Accuracy: 0.5753658536585366
Decision Tree Classification
↪ 8. Classification Report
The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.
from sklearn.metrics import classification_report print(classification_report(y_test, prediction))
---Output--- # precision recall f1-score support # # DOWN 0.42 0.47 0.44 17 # UP 0.59 0.54 0.57 24 # # accuracy 0.51 41 # macro avg 0.51 0.51 0.50 41 # weighted avg 0.52 0.51 0.52 41
Decision Tree Classification
↪ 9. Strengths and Weaknesses
Strengths
Weaknesses