Decision Tree Regression

Decision tree regression, a versatile and interpretable technique for predicting continuous values, helps reveal the underlying relationships between independent and dependent variables. It trains a model in the form of a tree structure to make predictions about future data.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Review data

  ➤ 3. Split data

  ➤ 4. Fit decision tree regressor

  ➤ 5. Compare actual and predicted values

  ➤ 6. Decision Tree

  ➤ 7. Predicting Close Price with Generated Input Data

  ➤ 8. Predict 20 days into the future

  ➤ 9. Strengths and Weaknesses

Decision Tree Regression

↪ 1. Import data

Import the pre-processed data for analysis. Subsequently, the data will be partitioned into training and test sets to facilitate the analysis.

      # Import required libraries.
      import pandas as pd                                 # Matplotlib is the fundamental plotting library
      import matplotlib.pyplot as plt                     # Seaborn builds upon Matplotlib, offering a
      import seaborn as sns                               # higher-level interface for statistical visualization.                                           
      import numpy as np

Set default style and color scheme for Seaborn plots.

      sns.set(style="ticks", color_codes=True)  

Import data using read_csv() function.

      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-cleaned.csv', 
                       header=0)
      print(data.columns)

      ---Output---       # Index(['Symbol', 'Series', 'Date', 'Prev Close', 'Open Price', 'High Price',       # 'Low Price', 'Last Price', 'Close Price', 'Average Price',       # 'Total Traded Quantity', 'Turnover', 'No. of Trades', 'Deliverable Qty',       # '% Dly Qt to Traded Qty'],       # dtype='object')
The info() function can be used for checking the data type.
      data.info()

      ---Output---       #       # RangeIndex: 250 entries, 0 to 249       # Data columns (total 15 columns):       # # Column Non-Null Count Dtype       # --- ------ -------------- -----       # 0 Symbol 250 non-null object       # 1 Series 250 non-null object       # 2 Date 250 non-null object       # 3 Prev Close 250 non-null float64       # 4 Open Price 250 non-null float64       # 5 High Price 250 non-null float64       # 6 Low Price 250 non-null float64       # 7 Last Price 250 non-null float64       # 8 Close Price 250 non-null float64       # 9 Average Price 250 non-null float64       # 10 Total Traded Quantity 250 non-null float64       # 11 Turnover 250 non-null float64       # 12 No. of Trades 250 non-null float64       # 13 Deliverable Qty 250 non-null float64       # 14 % Dly Qt to Traded Qty 250 non-null float64       # dtypes: float64(12), object(3)       # memory usage: 29.4+ KB

Decision Tree Regression

↪ 2. Review data

The shape attribute display the data shape.

      print(data.shape)

      ---Output---       # (250, 15)

Visualize the closing prices.

      plt.figure(figsize=(16,8))
      plt.title('Andromeda')
      plt.xlabel('Days')
      plt.ylabel('Close Price')
      plt.plot(data['Close Price'])
      plt.show()

Andromeda

Decision Tree Regression

↪ 3. Split data

Choose the required columns for analysis. This exercise predicts the Close Price using the Previous Close Price.

      data = data[['Prev Close','Close Price']]
      x=data[['Prev Close']]
      y=data[['Close Price']]

Split the data into training and test sets.

      from sklearn.model_selection import train_test_split 
      x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2,random_state=0)
      print("x_train.shape: ", x_train.shape )
      print("x_test.shape: ", x_test.shape )
      print("y_train.shape: ", y_train.shape )
      print("y_test.shape", y_test.shape )

      ---Output---       # x_train.shape: (200, 1)       # x_test.shape: (50, 1)       # y_train.shape: (200, 1)       # y_test.shape (50, 1)

Decision Tree Regression

↪ 4. Fit decision tree regressor

The DecisionTreeRegressor is the specific class representing a decision tree regression model.

      from sklearn.tree import DecisionTreeRegressor
      tree = DecisionTreeRegressor().fit(x_train, y_train)

The above code snippet imports the DecisionTreeRegressor() function to create a decision tree regression model. The second line creates an instance of the model, initializes it, and trains it using the provided training data (x_train and y_train). After this, the variable 'tree' holds a trained decision tree model, ready for making predictions on new, unseen data.

Predicting Close Price

      Predict_Close_Price= tree.predict(pd.DataFrame({'Prev Close': [1508.80]}))  
      print("Predicted Value: ", Predict_Close_Price)

      ---Output---       #Predicted Value: [1508.35]

Decision Tree Regression

↪ 5. Compare actual and predicted values

      # tree_pred variable stores the predicted values output by the decision tree model.
      tree_pred = tree.predict(x_test)  
      # y_pred variable holds the DataFrame containing the predictions.
      y_pred = pd.DataFrame(tree_pred, columns=['Pred'])                               
      # dframe variable holds the combined DataFrame                                                                                
      dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1)
      dframe.columns = ['Actual','Predicted']
      # dframe.head(10) selects the first 10 rows of the DataFrame and assigns them to the variable graph.
      graph = dframe.head(10)
      print(graph)

      ---Output---       # - Actual Predicted       # 0 1446.15 1472.15       # 1 1365.05 1353.75       # 2 1587.80 1575.80       # 3 1671.90 1658.45       # 4 1673.10 1631.80       # 5 1597.50 1599.15       # 6 1627.30 1619.75       # 7 1688.15 1670.30       # 8 1647.30 1670.30       # 9 1393.60 1391.80

Decision Tree Regression

↪ 5. Compare the actual and predicted values

      graph.plot(kind='bar')
      plt.title('Actual vs Predicted')
      plt.ylabel('Closing price')
      plt.show()

Actual vs Predicted

Decision Tree Regression

↪ 6. Decision Tree

DecisionTree algorithm splits the data set into 2 parts to minimize the mean squared error. The algorithm does this repetitively and forms a tree structure.

The plot_tree() function from the sklearn.tree module creates a graphical representation of a decision tree. The following code generates a visualization that focuses on how the 'Prev Close' variable contributes to predicting its target. The visualization shows the tree's structure up to a depth of two, with color-filled nodes.

      from sklearn.tree import plot_tree
      fig = plt.figure(figsize=(25,20))
      plot_tree(tree, feature_names =['Prev Close'],class_names =['Prev Close'],
                filled=True, max_depth=2 )
      plt.show()

Decision Tree

Decision Tree Regression

↪ 7. Predicting Close Price with generated input data

Create input data in the x_test range in the interval of 1, and predict the Close Price for each generated input data.

      X_grid = np.arange(x_test.values.min(), x_test.values.max())
      # Reshape the data into  a len(X_grid)*1 array, i.e. to make a column out of the X_grid values 
      X_grid = X_grid.reshape((len(X_grid), 1)) 
      # Compare the predicted Close Price with the actual Close Price using scatter plot. 
      plt.figure(figsize=(16,8))
      plt.title('Decision Tree Regression')
      plt.xlabel('Prev Close')
      plt.ylabel('Close Price')
      plt.scatter(x, y, color = "blue")
      plt.scatter(X_grid, tree.predict(X_grid), color = 'red')
      plt.show()

Decision Tree Regression

Decision Tree Regression

↪ 8. Predict 20 days into the future

      print(data.tail(5))

      ---Output---       # Prev Close Close Price       # 245 1615.80 1609.60       # 246 1609.60 1615.80       # 247 1615.80 1635.50       # 248 1635.50 1636.75       # 249 1636.75 1610.85

Last Close Price will be used to predict the Close Price for the next day.

      future_days = 20                  # Define the number of future dates for predicting
                                        # the Close Price
      pp = data['Close Price'].iloc[-1] # Last Close Price : [1610.85]
 
      i = 1
      while i < future_days:
         cp_pred = tree.predict(pd.DataFrame({'Prev Close': [pp.item()]}))
         new_row = pd.DataFrame({"Prev Close": [pp.item()], "Close Price": [cp_pred.item()]})
         data = pd.concat([data, new_row], ignore_index=True)
         x=data[['Prev Close']]
         y=data[['Close Price']]
         x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2)
         tree = DecisionTreeRegressor().fit(x_train, y_train)
         pp = cp_pred
         i += 1

Decision Tree Regression

↪ 8. Predict 20 days into the future

Create a plot of predicted data.

      plt.figure(figsize=(16,8))
      plt.title('Andromeda Prediction')
      plt.xlabel('Days')
      plt.ylabel('Close Price')
      plt.plot(data['Close Price'].iloc[0:249])
      plt.plot(data['Close Price'].iloc[248:268])
      plt.show()

Andromeda Prediction

Decision Tree Regression

↪ 9. Strengths and Weaknesses

Strengths

  • Decision trees are easy to understand and visualize, making them great for explaining predictions.
  • Decision trees can capture complex relationships between the independent and the dependent variables.
  • Decision trees can handle both numerical and categorical features.
  • Weakness

  • Decision trees can easily overfit the training data, meaning they perform well on the training data but poorly on unseen data.
  • Small changes in the training data can lead to large changes in the tree structure.