Decision Tree

Decision Tree Regression

Decision tree regression, a versatile and interpretable technique for predicting continuous values, helps reveal the underlying relationships between independent and dependent variables. It trains a model in the form of a tree structure to make predictions about future data.

This module outlines the following steps:

➤ 1. Import data

➤ 2. Review data

➤ 3. Split data

➤ 4. Fit decision tree regressor

➤ 5. Compare actual and predicted values

➤ 6. Decision Tree

➤ 7. Predicting Close Price with Generated Input Data

➤ 8. Predict 20 days into the future

➤ 9. Strengths and Weaknesses

Decision Tree Regression

↪ 1. Import data

Import the pre-processed data for analysis. Subsequently, the data will be partitioned into training and test sets to facilitate the analysis.

      # Import required libraries.
      import pandas as pd                                 # Matplotlib is the fundamental plotting library
      import matplotlib.pyplot as plt                     # Seaborn builds upon Matplotlib, offering a
      import seaborn as sns                               # higher-level interface for statistical visualization.                                           
      import numpy as np

Set default style and color scheme for Seaborn plots.

      sns.set(style="ticks", color_codes=True)

Import data using read_csv() function.

      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-cleaned.csv', 
                       header=0)
      print(data.columns)

      ---Output---
      # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',
      #        'Low Price', 'Last Price', 'Close Price', 'Average Price',
      #        'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',
      #        '% Dly Qt to Traded Qty'],
      #       dtype='object')

The info() function can be used for checking the data type.

      data.info()

      ---Output---
      # 
      # RangeIndex: 250 entries, 0 to 249
      # Data columns (total 15 columns):
      #  #   Column                  Non-Null Count  Dtype  
      # ---  ------                  --------------  -----  
      #  0   Symbol                  250 non-null    object 
      #  1   Series                  250 non-null    object 
      #  2   Date                    250 non-null    object 
      #  3   Prev Close              250 non-null    float64
      #  4   Open Price              250 non-null    float64
      #  5   High Price              250 non-null    float64
      #  6   Low Price               250 non-null    float64
      #  7   Last Price              250 non-null    float64
      #  8   Close Price             250 non-null    float64
      #  9   Average Price           250 non-null    float64
      #  10  Total Traded Quantity   250 non-null    float64
      #  11  Turnover                250 non-null    float64
      #  12  No. of Trades           250 non-null    float64
      #  13  Deliverable Qty         250 non-null    float64
      #  14  % Dly Qt to Traded Qty  250 non-null    float64
      # dtypes: float64(12), object(3)
      # memory usage: 29.4+ KB

Decision Tree Regression

↪ 2. Review data

The shape attribute display the data shape.

      print(data.shape)

      ---Output---
      # (250, 15)

Visualize the closing prices.

      plt.figure(figsize=(16,8))
      plt.title('Andromeda')
      plt.xlabel('Days')
      plt.ylabel('Close Price')
      plt.plot(data['Close Price'])
      plt.show()

Decision Tree Regression

↪ 3. Split data

Choose the required columns for analysis. This exercise predicts the Close Price using the Previous Close Price.

      data = data[['Prev Close','Close Price']]
      x=data[['Prev Close']]
      y=data[['Close Price']]

Split the data into training and test sets.

      from sklearn.model_selection import train_test_split 
      x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,random_state=0)
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---
      # x_train.shape:  (200, 1)
      # x_test.shape:  (50, 1)
      # y_train.shape:  (200, 1)
      # y_test.shape (50, 1)

Decision Tree Regression

↪ 4. Fit decision tree regressor

The DecisionTreeRegressor is the specific class representing a decision tree regression model.

      from sklearn.tree import DecisionTreeRegressor
      tree = DecisionTreeRegressor().fit(x_train, y_train)

The above code snippet imports the DecisionTreeRegressor() function to create a decision tree regression model. The second line creates an instance of the model, initializes it, and trains it using the provided training data (x_train and y_train). After this, the variable 'tree' holds a trained decision tree model, ready for making predictions on new, unseen data.

Predicting Close Price

      Predict_Close_Price= tree.predict(pd.DataFrame({'Prev Close': [1508.80]}))  
      print("Predicted Value: ", Predict_Close_Price)

      ---Output---
      #Predicted Value:  [1508.35]

Decision Tree Regression

↪ 5. Compare actual and predicted values

      # tree_pred variable stores the predicted values output by the decision tree model.
      tree_pred = tree.predict(x_test)  
      # y_pred variable holds the DataFrame containing the predictions.
      y_pred = pd.DataFrame(tree_pred, columns=['Pred'])                               
      # dframe variable holds the combined DataFrame                                                                                
      dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1)
      dframe.columns = ['Actual','Predicted']
      # dframe.head(10) selects the first 10 rows of the DataFrame and assigns them to the variable graph.
      graph = dframe.head(10)
      print(graph)

      ---Output---
      # - Actual  Predicted
      # 0  1446.15    1472.15
      # 1  1365.05    1353.75
      # 2  1587.80    1575.80
      # 3  1671.90    1658.45
      # 4  1673.10    1631.80
      # 5  1597.50    1599.15
      # 6  1627.30    1619.75
      # 7  1688.15    1670.30
      # 8  1647.30    1670.30
      # 9  1393.60    1391.80

Decision Tree Regression

↪ 5. Compare the actual and predicted values

      graph.plot(kind='bar')
      plt.title('Actual vs Predicted')
      plt.ylabel('Closing price')
      plt.show()

Decision Tree Regression

↪ 6. Decision Tree

DecisionTree algorithm splits the data set into 2 parts to minimize the mean squared error. The algorithm does this repetitively and forms a tree structure.

The plot_tree() function from the sklearn.tree module creates a graphical representation of a decision tree. The following code generates a visualization that focuses on how the 'Prev Close' variable contributes to predicting its target. The visualization shows the tree's structure up to a depth of two, with color-filled nodes.

      from sklearn.tree import plot_tree
      fig = plt.figure(figsize=(25,20))
      plot_tree(tree, feature_names =['Prev Close'],class_names =['Prev Close'],
                filled=True, max_depth=2 )
      plt.show()

Decision Tree Regression

↪ 7. Predicting Close Price with generated input data

Create input data in the x_test range in the interval of 1, and predict the Close Price for each generated input data.

      X_grid = np.arange(x_test.values.min(), x_test.values.max())
      # Reshape the data into  a len(X_grid)*1 array, i.e. to make a column out of the X_grid values 
      X_grid = X_grid.reshape((len(X_grid), 1)) 
      # Compare the predicted Close Price with the actual Close Price using scatter plot. 
      plt.figure(figsize=(16,8))
      plt.title('Decision Tree Regression')
      plt.xlabel('Prev Close')
      plt.ylabel('Close Price')
      plt.scatter(x, y, color = "blue")
      plt.scatter(X_grid, tree.predict(X_grid), color = 'red')
      plt.show()

Decision Tree Regression

↪ 8. Predict 20 days into the future

      print(data.tail(5))

      ---Output---
      #     Prev Close  Close Price
      # 245     1615.80      1609.60
      # 246     1609.60      1615.80
      # 247     1615.80      1635.50
      # 248     1635.50      1636.75
      # 249     1636.75      1610.85

Last Close Price will be used to predict the Close Price for the next day.

      future_days = 20                  # Define the number of future dates for predicting
                                        # the Close Price
      pp = data['Close Price'].iloc[-1] # Last Close Price : [1610.85]
 
      i = 1
      while i < future_days:
         cp_pred = tree.predict(pd.DataFrame({'Prev Close': [pp.item()]}))
         new_row = pd.DataFrame({"Prev Close": [pp.item()], "Close Price": [cp_pred.item()]})
         data = pd.concat([data, new_row], ignore_index=True)
         x=data[['Prev Close']]
         y=data[['Close Price']]
         x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
         tree = DecisionTreeRegressor().fit(x_train, y_train)
         pp = cp_pred
         i += 1

Decision Tree Regression

↪ 8. Predict 20 days into the future

Create a plot of predicted data.

      plt.figure(figsize=(16,8))
      plt.title('Andromeda Prediction')
      plt.xlabel('Days')
      plt.ylabel('Close Price')
      plt.plot(data['Close Price'].iloc[0:249])
      plt.plot(data['Close Price'].iloc[248:268])
      plt.show()

Decision Tree Regression

↪ 9. Strengths and Weaknesses

Strengths

Decision trees are easy to understand and visualize, making them great for explaining predictions.

Decision trees can capture complex relationships between the independent and the dependent variables.

Decision trees can handle both numerical and categorical features.

Weakness

Decision trees can easily overfit the training data, meaning they perform well on the training data but poorly on unseen data.

Small changes in the training data can lead to large changes in the tree structure.