Parallel Naïve Bayesian Classifier
March 2, 2011 3 Comments
Parallel Naïve Bayesian Classifier
The Naïve Bayesian classifier is a simple probabilistic classifier algorithm based on the Bayes theorem. It is used in data mining for the classification of new input. Naive Bayes reduces a high-dimensional density estimation task to one dimensional density estimation by assuming class conditional independence . Its assumption of independence among the variables of a given training set doesn’t deny the fact that it is comparable in performance to decision trees and neural networks because this assumption doesn’t greatly affect the posterior probabilities and the algorithm continues to work well . In this paper a parallel approach for implementing the naïve Bayesian classifier is discussed and implemented. The parallel approach is done through the parallel integration of a set of classifiers into a single classifier . Besides the parallel integration of a set of classifiers into one, the parallel approach also includes attribute parallelization where attributes can be assigned to different processors which allow the parallel computations of their probabilities. The parallel implementation of this algorithm is expected to improve the performance of the naïve Bayesian classifier and increase its accuracy.
Naïve Bayesian classifier is a statistical classifier that classifies the class label of new entities based on the probabilities of variables given a class from training data. This algorithm assumes class conditional independence which means that the effect of an attribute value on a given class is independent of the values of the other attributes. The algorithm takes an input which has values for specified attributes and is required to identify the class of the input by computing the conditional probabilities of each class given this input and then choosing the largest conditional probability and denoting the input with the selected class. The algorithm is based on the Bayes theorem:
where X=<x1,…,xn> is an input for n attributes, each xi is an input value for the ith attribute and Ci is a class value for supposing that there are m classes.
Since P(X) is constant for all classes only P(X | Ci)*P(Ci) needs to be maximized.
The naïve Bayesian classifier proceeds as follows:
1- Find the probabilities of all classes:
Where , Pk is the probability of havingk, r is the total number of records, and rk is the number of records havingk
2- For the given input X=<x1,…,xn> and class labels C=<C1,C2,C3,…,Cm> find P(Xi | Ck) for each given input value for a given attribute and for 1<= k <= m (all classes):
If the attribute is categorical value:
P (Xi | Ck) =
Where rik is the number of records havingk and the value Xi for the ith attribute.
If the attribute is continuous-valued, the attribute is typically assumed to have a Gaussian distribution with a mean µ and standard deviation s:
3- For each class find the probability P(X|Ci) by applying the following formula for :
P(X|Ci ) = P(xk|Ci)
= P(x1|Ci) * P(x2|Ci)*….*P(xn|Ci)
4- In order to predict the class label of X, P(Ci|X) = P(X|Ci)*P(Ci) is evaluated for each class for and then the class Cj having the highest P(Cj|X) is chosen to be the class of the given input.
In order to improve the performance of the naïve Bayesian classifier, the algorithm is implemented in parallel. Several parallel implementations exist including:
 The naïve Bayesian classifier is parallelized by first dividing the training set into k subsets and then applying this algorithm to each subset so that k classifiers are obtained. These classifiers are then integrated into a single classifier to find the decision rules . In order to classify an unknown sample X, P (C i |X) is calculated for each class value:
Assign X Ci if
For i=1,2,…,m; j=1,2,…k;
In order to classify an unknown sample X, P(Ci | X) is calculated for all classes. The calculation of P(Ci|X) is shown in the above equation, where from each classifier P(X|Ci)*P(Ci) is calculated then its multiplied by an assigned weight for each classifier. These values are added then divided by the total number of classifiers which is k. This is how the classifiers are integrated. The weight of the classifier is calculated by first finding the error rate of the classifier, subtracting it from 1 then dividing it by
Where fj is the error rate for classifier Cj.
By using integration of multiple classifiers, the recognition error rate can be reduced and robust of classification can be improved.  Thus the research of integration of multiple classifiers becomes important. At present, recognition based on integration of multiple classifiers was applied in many fields, such as handwritten and text recognition , face recognition , time-series prediction , etc.
In effect, naïve Bayesian classification reduces a high-dimensional density estimation task to one-dimensional kernel density estimation , because by assuming variable independence, the conditional probabilities can be calculated separately for each variable. Furthermore, the assumption does not seem to greatly affect the posterior probabilities, especially in regions near decision boundaries, thus, leaving the classification task unaffected.
The parallel implementation of naïve Bayesian classifier is done by dividing the attributes into p subsets, where p is the number of available processors. Each subset would contain n/p attributes, where n is the number of attributes.
These subsets are assigned to different processors and thus the calculation of the probabilities P(Xi|Cj) for each class can be found in parallel with other attributes probabilities. After finding all the conditional probabilities P(Xi|Cj) for all classes, the conditional probabilities that belong to the same class are multiplied in order to obtain P(Cj|X). Then we find the maximum P(Cj|X) and assign class Cj to input X.
The parallel algorithm is preceded by a pre-processing phase where the data list is organized into data structures where for each attribute a data structure is constructed containing the attribute name, its distinct values, and the class count for each class for each distinct value.
Implementation and Analysis
The parallel naïve Bayesian classifier is implemented as follows:
Compute P(Cj) for all class values
Divide the attributes among different processors
For Each attribute processed in parallel
Compute P(Cj|Xi)=P(Xi|Cj)*P(Cj) for all classes
For Each class
multiply P(Cj|Xi) for all input
Choose the highest P(Cj|Xi) and label the input with Cj class
The implementation of the parallel naïve Bayesian classifier is shown in the above algorithm, where evaluation of the conditional probabilities P(Xi|Cj) is done in parallel, by distributing the attributes among different processors. After finding the conditional probabilities, the conditional probabilities that belong to the same class are multiplied and then the class having the maximum obtained probability is chosen as the class label for the input X. Record parallelization was also used for parallelizing naive Bayesian classifier by allowing the processors to participate in computing P(Cj|Xi) for each sing class by distributing the records of the database among them.
The implementation of parallel naive Bayesian classifier would significantly reduce the time complexity from O (ND) to O (N/p * D/p) where N is the number of training records and D is the number of attributes.
Experiments and Results
The goal of this experiment was to study the effects of parallelizing naive Bayesian classifier in order to speed up the learning process when large data sets are present for training the system. Iris database of size 16 MB was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” . The data set contains three classes, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”. These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. The experiment was implemented on two machines one having a single processor and the other having seven processors and the obtained results were compared. By applying the above mentioned parallel procedures, the obtained execution time which is 3.89 seconds was approximately the same as the execution time of the serial approach which is 4.542 seconds. Also the time complexity of the algorithm would be reduced from O (ND) to O (N/p * D/p) where N is the number of training records, D is the number of attributes and p is the number of available processors. In the time complexity, N was replaced by N/p and D was replaced by D/p because in the parallel naive Bayesian classifier, instead of processing N attributes and D records in a serial manner, these numbers can be divided among the p processors so that each processor can now process a subset of attributes of size N/p and a subset of records of size D/p in a parallel manner. The accuracy of the algorithm was also calculated by using the holdout method, where two third of the data was used for training the system and one third of the data was used for testing the system. The training and testing datasets were chosen randomly from the database and the accuracy was calculated. This process is repeated ten times and the obtained average accuracy of the parallel algorithm was 33%. The parallel implementation of this algorithm didn’t result in speeding up the naive Bayesian algorithm.
In this paper parallel naïve Bayesian classifier was implemented through a new approach that hasn’t been addressed before through attribute and record parallelization. The parallel implementation of this algorithm didn’t result in an important increase in the speed of the naïve Bayesian algorithm. The implementation of the ensemble method along with attribute and record parallelization are taken into consideration for future work.
 PENG-TAO JIA, HUA-CAN HE, WEI LIN
“DECISION BY MAXIMUM OF POSTERIOR PROBABILITY AVERAGE WITH WEIGHTS: A METHOD OF MULTIPLE CLASSIFIERS COMBINATION”
 J. Kittler, M. Hatef, Duin R. P. W., and J. Matas, “On combining classifiers”, IEEE Transactions on Pattern analysis and Machine Intelligence, Vol 20, No. 3, pp. 226-239, Mar. 1998.
 Duin R. P. W., and Tax D. M. J., “Experiments with Classifier Combining Rules”, presented at the 1st International Workshop on Multiple Classifier System, Cagliari, Italy, pp. 16-29, Jun. 2000.
 Xu L, Krzyzak A, and Suen C Y, “Methods for Combining Multiple Classifiers and Their Applications to Handwriting Recognition”, IEEE Transactions on Systems，Man，and Cybernetics, Vol 22, No. 3, pp. 418-435, May. 1992.
 Xiaoguang Lv, Yunhong Wang and AK Jain, “Combining Classifiers for Face Recognition”, presented at the IEEE International Conference on Multimedia &Exp, Jul. 2003.
 C. Dietrich, F. Schwenker, and G. Palm, “Classification of time series utilizing temporal and decision fusion”, Proceedings of Multiple Classifier Systems (MCS), Cambridge, pp. 378-387. Feb. 2001.
 Richard O. Duba, Peter E. Hart, and David G. Stork, Pattern Classification(2nd Edition), John Wiley & Sons, Inc., 2001.