MultilinearRegression

Multilinear regression is a statistical technique used to model the relationship between a dependent variable and two or more independent variables. Unlike simple linear regression, which considers only one independent variable, multilinear regression accounts for the influence of multiple factors simultaneously on the target variable.

The fundamental assumption is that there's a linear relationship between the dependent variable and a combination of the independent variables. This relationship is represented by a mathematical equation where the dependent variable (y) is expressed as a weighted sum of the independent variables (x1, x2, … xn), plus an intercept term or constant.

      #          y = b0 + b1*X1 + b2*X2 + ... + bn*Xn

The model aims to find the best set of weights (coefficients) that minimize the difference between predicted and observed values of the target.

This module covers the following steps:

  ➤ 1. Import Data

  ➤ 2. Coefficient of determination

  ➤ 3. Predict and Test the model

  ➤ 4. Compare the actual and predicted values

  ➤ 5. Actual vs. Predicted Graph

  ➤ 6. Metrics

  ➤ 7. Predicting Close Price

  ➤ 8. Strengths and Weaknesses

MultilinearRegression

↪ 1. Import Data

This module demonstrates how to predict a stock's closing price using the previous day's closing price, the current day's opening price, and the total trading volume. The code begins by importing pre-processed data, which is then divided into training and testing sets.

      # Import required libraries.
      import pandas as pd                                 # Matplotlib is the fundamental plotting library
      import matplotlib.pyplot as plt                     # Seaborn builds upon Matplotlib, offering a
      import seaborn as sns                               # higher-level interface for statistical visualization.                                           
      import numpy as np
 
      # Set default style and color scheme for Seaborn plots.
      sns.set(style="ticks", color_codes=True)  
 
      # Import data
      data = pd.read_csv('https://raw.githubusercontent.com/csxplore/data/main/andromeda-cleaned.csv', 
                       header=0)
      X=data[['Prev Close','Open Price','Total Traded Quantity']]
      y=data[['Close Price']]
 
      # Split the data into training and test sets.
      from sklearn.model_selection import train_test_split 
      X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2,random_state=0)

MultilinearRegression

↪ 2. Coefficient of determination

The coefficient of determination, commonly known as R-squared (R²), is a statistical measure used in linear regression. It quantifies the proportion of variance in the dependent variable (y) that is predictable from the independent variable (x), indicating the model's goodness of fit.

      from sklearn.linear_model import LinearRegression   #for linear regression model
      lm=LinearRegression()
      lm.fit(X_train,y_train)
 
      print("Slope: ", lm.coef_)
      print("Intercept: ", lm.intercept_)

      ---Output---       # Slope: [[1.46742456e-02 9.63624960e-01 4.91555129e-07]]       # Intercept: [30.90481917]

The score() method returns the model's coefficient of determination.

      print('Coefficient of determination: ',lm.score(X_train,y_train))

      ---Output---       # Coefficient of determination: 0.9835526790784476
The model explains 98.35% of the variance in the dependent variable 'Close Price', indicating a strong relationship between 'Prev Close', 'Open Price','Total Traded Quantity' and 'Close Price'.

MultilinearRegression

↪ 3. Predict and Test the model

      predictions = lm.predict(X_test)

The code below calculates and prints the R-squared (R²) score, also known as the coefficient of determination, for a regression model's predictions. It starts by importing the r2_score() function from the sklearn.metrics module. This function is specifically designed to evaluate the performance of regression models. The code then calls the r2_score() function, passing in two arguments: 1) y_test – the actual or true values of the dependent variable from the test dataset, and 2) predictions – the predicted values of the dependent variable generated by the trained regression model.

      from sklearn.metrics import r2_score 
      print('Coefficient of determination: ', r2_score(y_test, predictions))

      ---Output---       # Coefficient of determination: 0.9718303109010307

The output shows that the R² score is approximately 0.9718. An R² score ranges from 0 to 1, where 1 indicates a perfect fit of the model to the data. In this case, a score of 0.9718 suggests that the model explains a very high proportion (97.18%) of the variance in the target variable, indicating a strong fit between the model's predictions and the actual values in the test set.

MultilinearRegression

↪ 4. Compare the actual and predicted values

      y_pred = pd.DataFrame(predictions, columns=['Pred'])
      dframe = pd.concat([y_test.reset_index(drop=True).astype(float),y_pred], axis=1)
      dframe.columns = ['Actual','Predicted']
      graph = dframe.head(10)
      print(graph)

      ---Output---       # - Actual Predicted       # 0 1671.80 1661.674921       # 1 1617.65 1627.239377       # 2 1409.80 1405.884155       # 3 1600.65 1597.242314       # 4 1568.30 1600.170675       # 5 1644.10 1635.130887       # 6 1564.35 1575.210479       # 7 1389.55 1409.412835       # 8 1486.10 1448.353305       # 9 1590.90 1568.568980

MultilinearRegression

↪ 5. Actual vs. Predicted Graph

      graph.plot(kind='bar')
      plt.title('Actual vs Predicted')
      plt.ylabel('Closing price')
      plt.show()

Actual vs. Predicted

MultilinearRegression

↪ 6. Metrics

The model's precision depends on the problem type that needs to be solved. Typically, use

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • Mean Absolute Error (MAE) Mean Absolute Error (MAE) is the mean of the absolute value of the difference between the predicted value and the actual value. The MAE tells how big of an error in the predicted value is expected in the model.

    Mean squared error (MSE) Mean squared error (MSE) or Mean Squared Deviation (MSD)is the average squared distance between the actual and predicted values. The MSE represents the average squared residual.

    Squaring the differences eliminates negative values of the differences and ensures that the MSE is positive. However, squaring increases the impact of larger errors, and these calculations disproportionately penalize larger errors more than smaller errors.

    Variance is the average squared deviation of the observations from the mean. The MSE in contrast is the average of squared deviations of the predictions from the actual values (residuals).

    The Root Mean Square Error (RMSE) The Root Mean Square Error (RMSE) measures the average difference between the actual and predicted values. The RMSE is the standard deviation of the residuals. Residuals represent the distance between the regression line and the data points.

    MultilinearRegression

    ↪ 6. Metrics

          from sklearn import metrics
          import math
     
          print('Mean Absolute Error:     ', metrics.mean_absolute_error(y_test.astype(float),y_pred))
          print('Mean Squared Error:      ', metrics.mean_squared_error(y_test.astype(float),y_pred))
          print('Root Mean Squared Error: ', math.sqrt(metrics.mean_squared_error(y_test.astype(float),y_pred)))
    
          ---Output---       # Mean Absolute Error: 13.217473604160523       # Mean Squared Error: 299.3529767486309       # Root Mean Squared Error: 17.3018200415052

    MultilinearRegression

    ↪ 7. Predicting Close Price

          Predict_Close_Price= lm.predict([[1508.80,1440,5000000]])  
          print("Predicted Value: ", Predict_Close_Price)
    
          ---Output---       # Predicted Value: [[1443.12303828]]

    MultilinearRegression

    ↪ 8. Strengths and Weaknesses

    Strengths

  • Relatively easy to understand, implement, and interpret. The coefficients directly indicate the relationship between each independent and the dependent variable (while holding other predictors constant).
  • Compared to more complex models, multilinear regression is computationally inexpensive, making it suitable for large datasets.
  • A well-understood statistical method with a solid theoretical foundation, allowing for statistical inference (hypothesis testing, confidence intervals, etc.).
  • Weakness

  • Assumes a linear relationship between the predictors and the response variable. If this assumption is violated, the model's performance suffers significantly.
  • Outliers can have a disproportionate impact on the model, skewing the coefficients and potentially leading to poor predictions.
  • Multicollinearity: High correlation between independent variables can lead to unstable coefficient estimates and inflated variance, making interpretation difficult.
  • Overfitting: With a large number of predictors relative to the number of data points, the model may overfit the training data, resulting in poor generalization to new data.