Decision Tree Classification

This code demonstrates a basic workflow of training a Decision Tree Classifier for binary classification, performing predictions, and evaluating model performance using a confusion matrix and accuracy.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Prepare data

  ➤ 3. Preprocess data

  ➤ 4. Split data

  ➤ 5. Model training

  ➤ 6. Prediction

  ➤ 7. Confusion Matrix

  ➤ 8. Classification Report

  ➤ 9. Strengths and Weaknesses

Decision Tree Classification

↪ 1. Import data

Import data using read_csv() function.

      import pandas as pd
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-technical.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty', 'SMA20', 'SMA50', 'diff', 'UP'],       # dtype='object')

Decision Tree Classification

↪ 2. Prepare data

The approach aims to predict whether a stock price will rise or fall using the 20-day Simple Moving Average (SMA20) as an independent variable.

The code first removes any rows with missing values (NaNs) from the dataset. This step is crucial for most machine learning models, as they often cannot handle missing data. Next, it extracts the 'SMA20' column, which will be used as the independent variable (x) for the model. Finally, it extracts the 'UP' column as the dependent variable (y), which indicates whether the stock price went up ('UP') or down ('DOWN').

      print(data.head(5))
      data = data.dropna()
      print(data.head(5))
      x = data[['SMA20']]
      y = data[['UP']]

Decision Tree Classification

↪ 3. Preprocess data

Standardization brings all features to a similar scale, ensuring that no single feature disproportionately influences the tree's structure due to its scale. This is an important technique in data pre-processing, and it's a common requirement for many machine learning models. Standardization typically involves transforming the data to have a mean of 0 and a standard deviation of 1.

This code below creates a StandardScaler object to handle data standardization. It first calculates the mean and standard deviation of each feature in the data 'x' using the fit method. Then, it applies these learned statistics to 'x' using the transform method, effectively scaling the features to have a mean of 0 and a standard deviation of 1.

      from sklearn.preprocessing import StandardScaler
      standardizer = StandardScaler()
      x = standardizer.fit_transform(x)

Decision Tree Classification

↪ 4. Split data

Split the data into training and test sets.

      from sklearn.model_selection import train_test_split
      x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True)
 
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---       # x_train.shape: (160, 1)       # x_test.shape: (41, 1)       # y_train.shape: (160, 1)       # y_test.shape (41, 1)

Decision Tree Classification

↪ 5. Model training

The code below demonstrates a basic implementation of a decision tree classifier. It begins by importing the DecisionTreeClassifier class, which is the core component for building decision tree models. An instance of this class is then created, representing a blank decision tree model ready for training.

The fit() method is then used to train the model using training data (x_train) and corresponding target labels (y_train). This process involves the decision tree algorithm learning the relationships between the features and the target variable, effectively building the tree structure to make predictions on new data.

After training, the model variable holds a fitted decision tree, ready to be used for predictions.

      from sklearn.tree import DecisionTreeClassifier
      model = DecisionTreeClassifier()
      model.fit(x_train, y_train)                      #fit() is the training method

Decision Tree Classification

↪ 6. Prediction

The model.predict() method generates predictions for each data point in x_test, which are then stored in a pandas DataFrame for easy comparison with the actual dependent values (y_test). The actual and predicted values are combined into a single DataFrame, allowing for a clear visualization of the model's performance.

      prediction = model.predict(x_test)
      y_pred = pd.DataFrame(prediction, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True),y_pred], axis=1)
      print(dframe.head(10))

      ---Output---       # UP Pred       # 0 DOWN UP       # 1 UP DOWN       # 2 DOWN DOWN       # 3 UP UP       # 4 UP UP       # 5 DOWN UP       # 6 UP UP       # 7 DOWN DOWN       # 8 DOWN UP       # 9 UP DOWN

Decision Tree Classification

↪ 7. Confusion Matrix

The code below calculates and displays various metrics to evaluate the performance of a binary classification model. It uses the confusion_matrix function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(y_test, prediction)
      TN, FP, FN, TP = cm.ravel()
      #TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 11       # False Positive (FP) = 13       # True Negative (TN) = 11       # False Negative (FN) = 6

Finally, the code calculates the accuracy of the model, which represents the proportion of correctly classified instances, and prints all these values for analysis.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the binary classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the binary classification = 0.537

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy}")

      ---Output---       # Accuracy: 0.5753658536585366

Decision Tree Classification

↪ 8. Classification Report

The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.

      from sklearn.metrics import classification_report
      print(classification_report(y_test, prediction))

      ---Output---       # precision recall f1-score support       #       # DOWN 0.42 0.47 0.44 17       # UP 0.59 0.54 0.57 24       #       # accuracy 0.51 41       # macro avg 0.51 0.51 0.50 41       # weighted avg 0.52 0.51 0.52 41
  • precision: The ratio of true positives to the total predicted positives (true positives + false positives). It measures how many of the positively predicted instances were actually positive.
  • recall (Sensitivity): The ratio of true positives to the total actual positives (true positives + false negatives). It measures how many of the actual positive instances were correctly predicted.
  • f1-score: The harmonic mean of precision and recall. It provides a balanced measure considering both precision and recall. A higher F1-score indicates better performance.
  • Support: The number of actual instances in each class in the test set.
  • Decision Tree Classification

    ↪ 9. Strengths and Weaknesses

    Strengths

  • The tree structure visually represents the decision-making process, making it highly interpretable.
  • Decision trees can work with both numerical and categorical data.
  • Decision trees do not require extensive data preparation such as feature scaling or normalization.
  • Decision tree training and prediction are relatively fast.
  • Weaknesses

  • Deep trees can easily overfit the training data, leading to poor generalization of unseen data.
  • Small changes in the data can lead to significant changes in the resulting tree structure.
  • If the training data is biased, the resulting tree will reflect that bias, leading to unfair or inaccurate predictions.
  • It may struggle to capture complex, non-linear relationships between the independent and the dependent variable compared to algorithms like Support Vector Machines or Neural Networks.