A Deep dive into Naïve Bayes Classifier and k-Fold Cross Validation
Naïve Bayes Classification
Naïve Bayes is a probabilistic machine learning technique that is used to classify data. It is founded on the Bayes theorem, which describes the likelihood of a hypothesis given evidence. The Naïve Bayes method assesses the likelihood of an instance belonging to a certain class based on its attributes in a classification problem.
The “naïve” in Naïve Bayes refers to the assumption that all of the instance’s attributes are independent of one another. In other words, the method assumes that the presence or absence of a certain characteristic has no influence on the probability of any other feature being present or missing.
Naïve Bayes is well-known for its simplicity and efficiency in processing huge datasets with high-dimensional feature spaces. It just needs a tiny quantity of training data and can be taught rapidly. Despite its simplicity, Naïve Bayes has shown to be useful in a wide range of real-world applications.
K-Fold Cross Validation
K-fold cross-validation is a technique to evaluate the performance of a machine learning model which splits the dataset into k-folds or equal-sized subsets.
The dataset is randomly divided into k unique set’s, each referred to as a fold. This model is trained over k-1 folds and tested on remaining fold. The results are averaged to obtain the final performance of the model, it’s generally used over the unseen data to avoid overfitting.
The k value is chosen based on the size of the dataset, for example, if the dataset has 100 instances, then randomly choose the k value as 10 or 20. K-Fold cross validation is majorly used when the dataset is small or imbalanced or when performance of the model needs to be estimated accurately.
Let’s us consider an example to predict the price of the Real-Estate data using Naïve Bayes classifier and validate with k-Fold cross validation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
#Read the data
df = pd.read_csv("dataset.csv")
df
#Preprocess the data
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df["Neighborhood"] = le.fit_transform(df["Neighborhood"])
#Check for null values
df.isnull().sum()
# Split the data into features (X) and target (y)
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
#split the data for K-Fold
kfold = KFold(n_splits=11, shuffle=True, random_state=0)
#Create an Empty list to store the accuracy of the every fold
accuracy_scores = []
#Iterate over every folds and store the accuracy of the folds into a list
for train_index, test_index in kfold.split(X):
# Split the data into training and testing sets
X_train, X_test = X.iloc[train_index], X.iloc[test_index]
y_train, y_test = y.iloc[train_index], y.iloc[test_index]
model = GaussianNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy_scores.append(accuracy)
#Calculate Mean Accuracy Score
mean_accuracy = np.mean(accuracy_scores)
print("Mean accuracy score:", mean_accuracy*100)
#Print out accuracy score, classification report,confusion matrix
print(accuracy_score(y_test, y_pred)*100)
print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
#visualization:
test = pd.DataFrame({'Pred_values':y_pred,'Actual':y_test})
fig=plt.figure(figsize=(16,8))
test = test.reset_index()
plt.plot(test[:50])
plt.legend(['Actual','Predicted'])
The required libraires are imported and dataset is read. Further preprocess the data for the ‘Neighborhood’ columns from string data to numerical data, because the Naïve bayes classifier uses probabilistic approach, and check for Null values in the data further split the data into training and testing instances and apply K-Fold validation with 11 folds and store the accuracy of each fold into a list and compute mean accuracy score and display the accuracy, classification report and confusion matrix with visualization actual and predicated data of the model.
In order to predict the price of the unknown real-estate data, read the dataset transform the data from string to numerical provide the data to model to predict it.
#read the data
test = pd.read_csv("test_dataset.csv")
#tranform the data
test["Neighborhood"] = le.fit_transform(test["Neighborhood"])
#prediction
y_pred = model.predict(test)
y_pred
Summing up
In this article, we have seen what is an Naïve Bayes classifier and K-Fold cross validation, and how can we implement it on Real-estate dataset, and how to predict the prices of the real-estate.