Naive Bayes Classification
Naive Bayes classification is a probabilistic machine learning algorithm based on Bayes' theorem, with a key simplifying assumption: that all features are independent of each other, given the class label. This “naive” assumption is what gives the algorithm its name. Despite this simplification, Naive Bayes can be surprisingly effective, especially for text classification and other problems with high-dimensional data. In essence, it calculates the probability of a data point belonging to each possible class and assigns the data point to the class with the highest probability. This calculation relies on the frequency of feature values observed in the training data for each class. The independence assumption greatly simplifies the calculations, making Naive Bayes computationally efficient and fast.
This module outlines the following steps:
➤ 1. Import data
➤ 2. Split data
➤ 3. Model training
➤ 4. Prediction
➤ 5. Confusion Matrix
➤ 6. Classification Report
➤ 7. Strengths and Weaknesses
Naive Bayes Classification
↪ 1. Import data
The “Breast Cancer” dataset, a toy binary classification dataset included in the scikit-learn library, serves as a popular example for demonstrating machine learning classification tasks. The dataset's features are calculated from digitized images of fine needle aspirates (FNA) of breast masses and are used to predict if a mass is malignant (cancerous) or benign (non-cancerous).
The sklearn.datasets package provides access to the “Breast Cancer” dataset, which is loaded using the load_breast_cancer function. The returned dataset object is stored in the variable bc. This object contains the features, targets (labels), and descriptive metadata. Specifically, bc.feature_names holds a list of the names of all 30 features (for example, “mean radius” and “texture”).
# Load Breast Cancer dataset from sklearn.datasets import load_breast_cancer bc = load_breast_cancer() # Access the data and target attributes print(bc.feature_names)
---Output--- # ['mean radius' 'mean texture' 'mean perimeter' 'mean area' # 'mean smoothness' 'mean compactness' 'mean concavity' # 'mean concave points' 'mean symmetry' 'mean fractal dimension' # 'radius error' 'texture error' 'perimeter error' 'area error' # 'smoothness error' 'compactness error' 'concavity error' # 'concave points error' 'symmetry error' 'fractal dimension error' # 'worst radius' 'worst texture' 'worst perimeter' 'worst area' # 'worst smoothness' 'worst compactness' 'worst concavity' # 'worst concave points' 'worst symmetry' 'worst fractal dimension']
The dataset contains a total of 569 instances (or data points).
data = bc.data print("Number of Instances, Number of features", data.shape)
---Output--- # Number of Instances, Number of features (569, 30)
The bc.target_names attribute provides the labels for the classifications. It has two classes:
Malignant (0): Indicates a cancerous breast mass.
Benign (1): Indicates a non-cancerous breast mass.
print(bc.target_names) target = bc.target
---Output--- # ['malignant' 'benign']
Naive Bayes Classification
↪ 2. Split data
Split the dataset into training and testing sets The code below divides the “Breast Cancer” dataset into training and testing sets using the train_test_split() function. bc.data represents the input features or independent variables, and bc.target represents the corresponding target or dependent variable. The split is performed such that 20% of the data is allocated to the test set (test_size=0.2), while the remaining 80% is used for training. The random_state=42 argument ensures that the split is reproducible across different runs by setting a specific seed for the random number generator, effectively making the train and test splits consistent.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(bc.data, bc.target, test_size=0.2, random_state=42)
Naive Bayes Classification
↪ 3. Model training
This code below demonstrates the initialization and training of a Gaussian Naive Bayes classifier using the scikit-learn library. First, it imports the GaussianNB class from the sklearn.naive_bayes module. Then, an instance of the GaussianNB classifier is created and assigned to the variable gnb. This establishes the model that will be used for classification. The fit() method is called on the gnb object, passing in the training data: X_train and y_train.
from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() # Train the classifier on the training set gnb.fit(X_train, y_train)
Naive Bayes Classification
↪ 4. Prediction
The code below uses the predict() method of the trained 'gnb' object to generate predictions on the test data, X_test, and the predicted labels are stored in the 'prediction' variable. Using pandas library, it creates a DataFrame 'dframe' containing two columns: 'Actual', populated with the true labels from y_test, and 'Pred', populated with the predicted labels from the prediction variable. Finally, it prints the first 5 rows of the 'dframe' using the head(5) method, providing a quick look at the model's performance.
prediction = gnb.predict(X_test) # Compare the actual and predicted values import pandas as pd dframe = pd.DataFrame({'Actual': y_test, 'Pred': prediction}, columns=['Actual', 'Pred']) print(dframe.head(5))
---Output--- # Actual Pred # 0 1 1 # 1 0 0 # 2 0 0 # 3 1 1 # 4 1 1
The code below demonstrates how to use a trained Naive Bayes classifier model to predict an outcome for a new, unseen data set. The 'bc_in' variable contains array of new features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. Then, the input is passed to the predict() function of the trained Gaussian Naive Bayes classifier.
import numpy as np bc_in = np.array([[ 5.57, 5.77, 30.9, 132. , 0.08474, 0.07864, 0.0869, 0.07017, 0.1812, 0.05667, 0.5435, 0.7339, 3.398, 74.08, 0.005225, 0.01308, 0.0186, 0.0134, 0.01389, 0.003532, 24.99, 23.41, 158.8, 1956., 0.1238, 0.1866, 0.2416, 0.186, 0.275, 0.08902 ]]) Predict_BC= gnb.predict(bc_in) print("Predicted Value: ", Predict_BC)
---Output--- # Predicted Value: [0] # 0 -> Indicates a cancerous breast mass.
Naive Bayes Classification
↪ 5. Confusion Matrix
The code below calculates and displays a confusion matrix to evaluate the performance of a classification model. It uses the confusion_matrix() function to generate a confusion matrix, which summarizes the model's predictions against the true labels.
From this matrix, the code extracts the individual counts for True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, prediction) print(cm)
---Output--- # [[40 3] # [ 0 71]]
The .ravel() method flattens the confusion matrix into a 1D array. A typical 2×2 confusion matrix (for binary classification) looks like this: [[TN, FP], [FN, TP]]
TN, FP, FN, TP = confusion_matrix(y_test, prediction).ravel() print('True Positive (TP) = ', TP) print('False Positive (FP) = ', FP) print('True Negative (TN) = ', TN) print('False Negative (FN) = ', FN)
---Output--- # True Positive (TP) = 71 # False Positive (FP) = 3 # True Negative (TN) = 40 # False Negative (FN) = 0
Accuracy score can be calculated using the following formula.
accuracy = (TP+TN) / (TP+FP+TN+FN) print('Accuracy of the classification = {:0.3f}'.format(accuracy))
---Output--- # Accuracy of the classification = 0.974
Alternatively, the accuracy_score() function calculates the accuracy of a classification model's predictions.
from sklearn.metrics import accuracy_score accuracy = accuracy_score(y_test, prediction) print(f"Accuracy: {accuracy}")
---Output--- # Accuracy: 0.9736842105263158
Naive Bayes Classification
↪ 6. Classification Report
The classification_report() function generates a text report showing the main classification metrics for each class in the dataset.
from sklearn.metrics import classification_report print(classification_report(y_test, prediction))
---Output--- # precision recall f1-score support # # 0 1.00 0.93 0.96 43 # 1 0.96 1.00 0.98 71 # # accuracy 0.97 114 # macro avg 0.98 0.97 0.97 114 # weighted avg 0.97 0.97 0.97 114
Naive Bayes Classification
↪ 7. Strengths and Weaknesses
Strengths
Weaknesses