Naive Bayes Classification

Naive Bayes classification is a probabilistic machine learning algorithm based on Bayes' theorem, with a key simplifying assumption: that all features are independent of each other, given the class label. This “naive” assumption is what gives the algorithm its name. Despite this simplification, Naive Bayes can be surprisingly effective, especially for text classification and other problems with high-dimensional data. In essence, it calculates the probability of a data point belonging to each possible class and assigns the data point to the class with the highest probability. This calculation relies on the frequency of feature values observed in the training data for each class. The independence assumption greatly simplifies the calculations, making Naive Bayes computationally efficient and fast.

This module outlines the following steps:

  ➤ 1. Import data

  ➤ 2. Split data

  ➤ 3. Model training

  ➤ 4. Prediction

  ➤ 5. Confusion Matrix

  ➤ 6. Classification Report

  ➤ 7. Strengths and Weaknesses

Naive Bayes Classification

↪ 1. Import data

The “Breast Cancer” dataset, a toy binary classification dataset included in the scikit-learn library, serves as a popular example for demonstrating machine learning classification tasks. The dataset's features are calculated from digitized images of fine needle aspirates (FNA) of breast masses and are used to predict if a mass is malignant (cancerous) or benign (non-cancerous).

The sklearn.datasets package provides access to the “Breast Cancer” dataset, which is loaded using the load_breast_cancer function. The returned dataset object is stored in the variable bc. This object contains the features, targets (labels), and descriptive metadata. Specifically, bc.feature_names holds a list of the names of all 30 features (for example, “mean radius” and “texture”).

      # Load Breast Cancer dataset
      from sklearn.datasets import load_breast_cancer
      bc = load_breast_cancer()
 
      # Access the data and target attributes
      print(bc.feature_names)

      ---Output---       # ['mean radius' 'mean texture' 'mean perimeter' 'mean area'       # 'mean smoothness' 'mean compactness' 'mean concavity'       # 'mean concave points' 'mean symmetry' 'mean fractal dimension'       # 'radius error' 'texture error' 'perimeter error' 'area error'       # 'smoothness error' 'compactness error' 'concavity error'       # 'concave points error' 'symmetry error' 'fractal dimension error'       # 'worst radius' 'worst texture' 'worst perimeter' 'worst area'       # 'worst smoothness' 'worst compactness' 'worst concavity'       # 'worst concave points' 'worst symmetry' 'worst fractal dimension']

The dataset contains a total of 569 instances (or data points).

      data = bc.data
      print("Number of Instances, Number of features", data.shape)

      ---Output---       # Number of Instances, Number of features (569, 30)

The bc.target_names attribute provides the labels for the classifications. It has two classes:
      Malignant (0): Indicates a cancerous breast mass.
      Benign (1): Indicates a non-cancerous breast mass.
 

      print(bc.target_names)
      target = bc.target

      ---Output---       # ['malignant' 'benign']

Naive Bayes Classification

↪ 2. Split data

Split the dataset into training and testing sets The code below divides the “Breast Cancer” dataset into training and testing sets using the train_test_split() function. bc.data represents the input features or independent variables, and bc.target represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The random_state=42 argument ensures that the split is reproducible across different runs by setting a specific seed for the random number generator, effectively making the train and test splits consistent.

      from sklearn.model_selection import train_test_split
      X_train, X_test, y_train, y_test = train_test_split(bc.data, bc.target, test_size=0.2, random_state=42)

Naive Bayes Classification

↪ 3. Model training

This code below demonstrates the initialization and training of a Gaussian Naive Bayes classifier using the scikit-learn library. First, it imports the GaussianNB class from the sklearn.naive_bayes module. Then, an instance of the GaussianNB classifier is created and assigned to the variable gnb. This establishes the model that will be used for classification. The fit() method is called on the gnb object, passing in the training data: X_train and y_train.

      from sklearn.naive_bayes import GaussianNB
      gnb = GaussianNB()
 
      # Train the classifier on the training set
      gnb.fit(X_train, y_train)

Naive Bayes Classification

↪ 4. Prediction

The code below uses the predict() method of the trained 'gnb' object to generate predictions on the test data, X_test, and the predicted labels are stored in the 'prediction' variable. Using pandas library, it creates a DataFrame 'dframe' containing two columns: 'Actual', populated with the true labels from y_test, and 'Pred', populated with the predicted labels from the prediction variable. Finally, it prints the first 5 rows of the 'dframe' using the head(5) method, providing a quick look at the model's performance.

      prediction = gnb.predict(X_test)
 
      # Compare the actual and predicted values
      import pandas as pd
      dframe = pd.DataFrame({'Actual': y_test, 'Pred': prediction}, columns=['Actual', 'Pred'])
      print(dframe.head(5))

      ---Output---       # Actual Pred       # 0 1 1       # 1 0 0       # 2 0 0       # 3 1 1       # 4 1 1

The code below demonstrates how to use a trained Naive Bayes classifier model to predict an outcome for a new, unseen data set. The 'bc_in' variable contains array of new features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Then, the input is passed to the predict() function of the trained Gaussian Naive Bayes classifier.

      import numpy as np
      bc_in = np.array([[ 5.57,       5.77,      30.9,      132. ,         0.08474,     0.07864,
          0.0869,      0.07017,     0.1812,      0.05667,     0.5435,      0.7339,
          3.398,      74.08,        0.005225,    0.01308,     0.0186,      0.0134,
          0.01389,     0.003532,   24.99,       23.41,      158.8,      1956.,
          0.1238,      0.1866,      0.2416,      0.186,       0.275,       0.08902 ]])
 
      Predict_BC= gnb.predict(bc_in)
      print("Predicted Value: ", Predict_BC)

      ---Output---       # Predicted Value: [0] # 0 -> Indicates a cancerous breast mass.

Naive Bayes Classification

↪ 5. Confusion Matrix

The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.

From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

      from sklearn.metrics import confusion_matrix
      cm = confusion_matrix(y_test, prediction)
      print(cm)

      ---Output---       # [[40 3]       # [ 0 71]]

The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]

      TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel()
      print('True Positive (TP)  = ', TP)
      print('False Positive (FP) = ', FP)
      print('True Negative (TN)  = ', TN)
      print('False Negative (FN) = ', FN)

      ---Output---       # True Positive (TP) = 71       # False Positive (FP) = 3       # True Negative (TN) = 40       # False Negative (FN) = 0

Accuracy score can be calculated using the following formula.

      accuracy =  (TP+TN) / (TP+FP+TN+FN)
      print('Accuracy of the classification = {:0.3f}'.format(accuracy))

      ---Output---       # Accuracy of the classification = 0.974

Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.

      from sklearn.metrics import accuracy_score
      accuracy = accuracy_score(y_test, prediction)
      print(f"Accuracy: {accuracy}")

      ---Output---       # Accuracy: 0.9736842105263158

Naive Bayes Classification

↪ 6. Classification Report

The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.

      from sklearn.metrics import classification_report
      print(classification_report(y_test, prediction))

      ---Output---       # precision recall f1-score support       #       # 0 1.00 0.93 0.96 43       # 1 0.96 1.00 0.98 71       #       # accuracy 0.97 114       # macro avg 0.98 0.97 0.97 114       # weighted avg 0.97 0.97 0.97 114
  • precision: The ratio of true positives to the total predicted positives (true positives + false positives). It measures how many of the positively predicted instances were actually positive.
  • recall (Sensitivity): The ratio of true positives to the total actual positives (true positives + false negatives). It measures how many of the actual positive instances were correctly predicted.
  • f1-score: The harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall. A higher f1-score indicates better performance.
  • support: The number of actual instances in each class in the test set.
  • Naive Bayes Classification

    ↪ 7. Strengths and Weaknesses

    Strengths

  • Naive Bayes algorithms are computationally efficient and easy to implement, making them suitable for large datasets and real-time applications.
  • It performs well with many features (dimensions).
  • Naive Bayes is naturally suited for datasets with categorical features.
  • Can often achieve good performance with relatively small amounts of training data compared to more complex algorithms.
  • Weaknesses

  • The core assumption that features are independent of each other is often unrealistic, and this can negatively affect the model's accuracy.
  • Naive Bayes is a linear classifier, making it less powerful than more complex algorithms for learning non-linear relationships in data.
  • Gaussian Naive Bayes assumes that features are normally distributed, which may not be the case in reality, potentially affecting its accuracy.