Parallel Naïve Bayesian Classifier

Conditional independence

Image via Wikipedia

Parallel Naïve Bayesian Classifier

 

Abstract

The Naïve Bayesian classifier is a simple probabilistic classifier algorithm based on the Bayes theorem. It is used in data mining for the classification of new input. Naive Bayes reduces a high-dimensional density estimation task to one dimensional density estimation by assuming class conditional independence [7]. Its assumption of independence among the variables of a given training set doesn’t deny the fact that it is comparable in performance to decision trees and neural networks because this assumption doesn’t greatly affect the posterior probabilities and the algorithm continues to work well [7]. In this paper a parallel approach for implementing the naïve Bayesian classifier is discussed and implemented. The parallel approach is done through the parallel integration of a set of classifiers into a single classifier [1]. Besides the parallel integration of a set of classifiers into one, the parallel approach also includes attribute parallelization where attributes can be assigned to different processors which allow the parallel computations of their probabilities. The parallel implementation of this algorithm is expected to improve the performance of the naïve Bayesian classifier and increase its accuracy.

 

Introduction

Naïve Bayesian classifier is a statistical classifier that classifies the class label of new entities based on the probabilities of variables given a class from training data. This algorithm assumes class conditional independence which means that the effect of an attribute value on a given class is independent of the values of the other attributes. The algorithm takes an input which has values for specified attributes and is required to identify the class of the input by computing the conditional probabilities of each class given this input and then choosing the largest conditional probability and denoting the input with the selected class. The algorithm is based on the Bayes theorem:

P(Ci|X)= P(X|Ci)*P(Ci)

P(X)

 

 

where X=<x1,…,xn> is an input for n attributes, each xi is an input value for the ith attribute and Ci is a class value for  supposing that there are m classes.

Since P(X) is constant for all classes only        P(X | Ci)*P(Ci) needs to be maximized.

The naïve Bayesian classifier proceeds as follows:

1-     Find the probabilities of all classes:

Pk =

Where , Pk is the probability of havingk, r is the total number of records, and rk is the number of records havingk

2-     For the given input X=<x1,…,xn> and class labels C=<C1,C2,C3,…,Cm> find P(Xi | Ck) for each given input value for a given attribute and for 1<= k <= m (all classes):

If the attribute is categorical value:

P (Xi | Ck) =

 

Where rik is the number of records havingk and the value Xi for the ith attribute.

If the attribute is continuous-valued, the attribute is typically assumed to have a Gaussian distribution with a mean µ and standard deviation s:

g(x,µ,σ)=

 

P(xk|Ci)=g(xk,µCiCi)

 

3-     For each class find the probability P(X|Ci) by applying the following formula for    :

P(X|Ci ) =  P(xk|Ci)

 

=  P(x1|Ci) * P(x2|Ci)*….*P(xn|Ci)

 

4-     In order to predict the class label of X, P(Ci|X) = P(X|Ci)*P(Ci) is evaluated for each class for  and then the class Cj having the highest P(Cj|X) is chosen to be the class of the given input.

 

In order to improve the performance of the naïve Bayesian classifier, the algorithm is implemented in parallel. Several parallel implementations exist including:

[1] The naïve Bayesian classifier is parallelized by first dividing the training set into k subsets and then applying this algorithm to each subset so that k classifiers are obtained. These classifiers are then integrated into a single classifier to find the decision rules [1]. In order to classify an unknown sample X, P (C i |X) is calculated for each class value:

Assign  X           Ci if

 

P(Ci|X)

 

=

 

For i=1,2,…,m; j=1,2,…k;

 

Where

 

wj =

 

In order to classify an unknown sample X, P(Ci | X) is calculated for all classes. The calculation of P(Ci|X) is shown in the above equation, where from each classifier P(X|Ci)*P(Ci) is calculated then its multiplied by an assigned weight for each classifier. These values are added then divided by the total number of classifiers which is k. This is how the classifiers are integrated. The weight of the classifier is calculated by first finding the error rate of the classifier, subtracting it from 1 then dividing it by

 

Where fj is the error rate for classifier Cj.

By using integration of multiple classifiers, the recognition error rate can be reduced and robust of classification can be improved. [3] Thus the research of integration of multiple classifiers becomes important. At present, recognition based on integration of multiple classifiers was applied in many fields, such as handwritten and text recognition [4], face recognition [5], time-series prediction [6], etc.

In effect, naïve Bayesian classification reduces a high-dimensional density estimation task to one-dimensional kernel density estimation [7], because by assuming variable independence, the conditional probabilities can be calculated separately for each variable. Furthermore, the assumption does not seem to greatly affect the posterior probabilities, especially in regions near decision boundaries, thus, leaving the classification task unaffected.

 

Proposed Method

The parallel implementation of naïve Bayesian classifier is done by dividing the attributes into p subsets, where p is the number of available processors. Each subset would contain n/p attributes, where n is the number of attributes.

These subsets are assigned to different processors and thus the calculation of the probabilities P(Xi|Cj) for each class can be found in parallel with other attributes probabilities. After finding all the conditional probabilities P(Xi|Cj) for all classes, the conditional probabilities that belong to the same class are multiplied in order to obtain P(Cj|X). Then we find the maximum P(Cj|X) and assign class Cj to input X.

The parallel algorithm is preceded by a pre-processing phase where the data list is organized into data structures where for each attribute a data structure is constructed containing the attribute name, its distinct values, and the class count for each class for each distinct value.

Implementation and Analysis

The parallel naïve Bayesian classifier is implemented as follows:

ClassifyInput (data)

{

Compute P(Cj) for all class values

Divide the attributes among different processors

For Each attribute processed in parallel

{

Compute P(Cj|Xi)=P(Xi|Cj)*P(Cj) for all classes

}

For Each class

{

multiply P(Cj|Xi) for all input

}

Choose the highest P(Cj|Xi) and label the input with Cj class

}

The implementation of the parallel naïve Bayesian classifier is shown in the above algorithm, where evaluation of the conditional probabilities P(Xi|Cj) is done in parallel, by distributing the attributes among different processors. After finding the conditional probabilities, the conditional probabilities that belong to the same class are multiplied and then the class having the maximum obtained probability is chosen as the class label for the input X. Record parallelization was also used for parallelizing naive Bayesian classifier by allowing the processors to participate in computing P(Cj|Xi) for each sing class by distributing the records of the database among them.

The implementation of parallel naive Bayesian classifier would significantly reduce the time complexity from O (ND) to O (N/p * D/p) where N is the number of training records and D is the number of attributes.

 

Experiments and Results

The goal of this experiment was to study the effects of parallelizing naive Bayesian classifier in order to speed up the learning process when large data sets are present for training the system. Iris database of size 16 MB was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [8]. The data set contains three classes, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. The experiment was implemented on two machines one having a single processor and the other having seven processors and the obtained results were compared. By applying the above mentioned parallel procedures, the obtained execution time which is 3.89 seconds was approximately the same as the execution time of the serial approach which is 4.542 seconds. Also the time complexity of the algorithm would be reduced from O (ND) to O (N/p * D/p) where N is the number of training records, D is the number of attributes and p is the number of available processors. In the time complexity, N was replaced by N/p and D was replaced by D/p because in the parallel naive Bayesian classifier, instead of processing N attributes and D records in a serial manner, these numbers can be divided among the p processors so that each processor can now process a subset of attributes of size N/p and a subset of records of size D/p in a parallel manner. The accuracy of the algorithm was also calculated by using the holdout method, where two third of the data was used for training the system and one third of the data was used for testing the system. The training and testing datasets were chosen randomly from the database and the accuracy was calculated. This process is repeated ten times and the obtained average accuracy of the parallel algorithm was 33%. The parallel implementation of this algorithm didn’t result in speeding up the naive Bayesian algorithm.

Conclusion

In this paper parallel naïve Bayesian classifier was implemented through a new approach that hasn’t been addressed before through attribute and record parallelization. The parallel implementation of this algorithm didn’t result in an important increase in the speed of the naïve Bayesian algorithm. The implementation of the ensemble method along with attribute and record parallelization are taken into consideration for future work.

References

[1] PENG-TAO JIA, HUA-CAN HE, WEI LIN

“DECISION BY MAXIMUM OF POSTERIOR PROBABILITY AVERAGE WITH WEIGHTS: A METHOD OF MULTIPLE CLASSIFIERS COMBINATION”

 

[2] J. Kittler, M. Hatef, Duin R. P. W., and J.          Matas, “On combining classifiers”, IEEE Transactions on Pattern  analysis and Machine Intelligence, Vol 20, No. 3, pp. 226-239, Mar. 1998.

 

[3] Duin R. P. W., and Tax D. M. J., “Experiments with Classifier Combining Rules”, presented at the 1st International Workshop on Multiple Classifier System, Cagliari, Italy, pp. 16-29, Jun. 2000.

 

[4] Xu L, Krzyzak A, and Suen C Y, “Methods for Combining Multiple Classifiers and Their Applications to Handwriting Recognition”, IEEE Transactions on Systems,Man,and Cybernetics, Vol 22, No. 3, pp. 418-435, May. 1992.

 

[5] Xiaoguang Lv, Yunhong Wang and AK Jain, “Combining Classifiers for Face Recognition”, presented at the IEEE International Conference on Multimedia &Exp, Jul. 2003.

 

[6] C. Dietrich, F. Schwenker, and G. Palm, “Classification of time series utilizing temporal and decision fusion”, Proceedings of Multiple Classifier Systems (MCS), Cambridge, pp. 378-387. Feb. 2001.

 

[7] Richard O. Duba, Peter E. Hart, and David G. Stork, Pattern Classification(2nd Edition), John Wiley & Sons, Inc., 2001.

 

[8]http://archive.ics.uci.edu/ml/machine– learning-databases/iris/iris.names

Parallel k-Nearest Neighbor

Example of k-nearest neighbour classification

Image via Wikipedia

K-Nearest Neighbor or KNN algorithm is part of supervised learning that has been used in many applications including data mining, statistical pattern recognition, and image processing. The algorithm doesn’t build a classification model but instead it is based on values found in storage or memory. To identify the class of an input, the algorithm chooses the class to which the majority of the input’s k closest neighbors belong to. The KNN algorithm is considered as one of the simplest machine learning algorithms. However it is computationally expensive especially when the size of the training set becomes large which would cause the classification task to become very slow. Several attempts have been made to parallelize KNN on the GPU by taking advantage of the natural parallel architecture of GPU [5]. However, in this paper the KNN algorithm was parallelized on CPU by distributing the distance computations of the k nearest neighbors among different processors. The parallel implementation greatly increased the speed of the KNN algorithm by reducing its time complexity from O(D) ,where D is the number of records, to O(D/p) where p is the number of processors.

Keywords: K Nearest Neighbor, GPU, manycore, CPU, parallel processors.

INTRODUCTION

The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. It was known to be computationally intensive when given large training sets, and did not gain popularity until the 1960s when increased computing power became available.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unknown tuple. “Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, …, x1n) and X2 = (x21, x22, …, x2n), is:

 

dist(X1,X2) = (1)

 

The above algorithm applies for numerical data. For categorical attributes, a simple method is to compare the corresponding values of the attributes in tuple X1 with those in tuple X2. If the two are identical, then the difference between the two is taken as 0. If the two are different, then the difference is considered to be 1. Other methods may incorporate more sophisticated schemes for differential grading.

When computed in this way on a serial machine, the time complexity is clearly linear with respect to the number of data points. Hence, there is an interest in mapping the process onto a highly parallel machine in order to further optimize the running time of the algorithm. It should be noted, however, that serial implementations of the k-NN rule employing branch and bound search algorithms[1] (a systematic method for solving optimization problems) can scale sublinearly, such that the asymptotic time complexity may be constant with respect to the number of data points. Nonetheless, a fully parallel hardware implementation should still be much faster than the most efficient serial implementations.

Many parallel methods were conducted to increase the speed of the KNN algorithm including:

1.     The method uses Neural Networks for constructing a multi-layer feed-forward network that implements exactly a 1-NN rule. The advantage of this approach is that the resulting network can be implemented efficiently. The disadvantage is that the training time can’t grow exponentially for high dimensional pattern spaces, which could make it impractical.

 

2.     A CUDA implementation of the “brute force” kNN search described in [6] is performed. The advantage of this method is the highly parallelizable architecture of the GPU.

 

In this paper, the “brute force” kNN is studied and implemented on CPU rather than a GPU where the degree of parallelism is indicated by the number of available cores or processors. The proposed algorithm is not expected to outperform state-of-the art GPU implementations but rather, to provide an equivalent performance on CPU. Hence, the benefit becomes the ability of load sharing between CPU and GPU without degradation or loss of speed upon switching between any of the two processor architectures.

PROPOSED METHOD

 

The nature of the brute force kNN algorithm can be assumed to be highly parallelizable[2] by nature, since computation of the distance between the input sample and any single training sample is independent of the distance computation to any other sample. This allows for partitioning the computation work with least synchronization effort. In fact, no inter communication or message passing is required at all during the time each processor is computing the distance between samples in its local storage and the input sample. When all processors terminate the distance computation procedure, the final step is to select a master processor to collect the results from all processors, sort the distances in ascending order, and then use the first 8 measures to determine the class of the input sample.

 

The proposed algorithm is described in the following steps:

1.     Select 1 processor to be the master, the other N-1 processors are slaves.

2.     Master divides the training samples to N subsets, and distributes 1 subset for each processor, keeping 1 subset for local processing (Master participates in distance computation too).

3.     Each individual processor now computes the distance measures independently and storing the computes measures in a local array

4.     When each processor terminates distance calculation, the processor transmits a message to the master indicating end of processing

5.     Master then notes the end of processing for the sender processor and acquires the computes measures by copying them into its own array

6.     After the master has claimed all distance measures from all processors, the following steps are performed:

a.      Sort all distance measures in ascending order

b.     Select top k measures

c.      Count the number of classes in the top k measures

d.     The input element’s class will belong the class having the higher count among top k measures

EXPERIMENTS& RESULTS

 

The goal of this experiment was to study the performance of parallelizing KNN on CPU for large data sets versus parallel GPU implementations. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were compared with respective implementation on a 64-core GPU. In order to accommodate for the biased number of cores on the GPU, the degree of parallelism was set to 8 in order to properly compare with the number of cores on the CPU, so 8 cores were used out of the available 64-cores. By applying the above mentioned parallel procedures, the CPU program was able to compete with the GPU showing equivalent performance but outperforming the GPU after repeating the test for several times. This was due to cache locality because after several repeated runs, chances of cache misses became less frequent and therefore fetching time from memory incurred by continuous trips was diminished. This also improved Bus utilization and hence power consumption was reduced.  These results were expected because of the nature of KNN is that it highly parallelizable and it scales well with many-core architectures. Complexity of the parallel algorithm is O(D/p) where D is the number of records and p is the number of available cores. Hence, given p processors, complexity reduces to constant time O(1).

 

CONCLUSION

 

In this paper parallel KNN algorithm was implemented by applying a new approach to facilitate computation of the distance measures of all data points. The implementation of the parallel technique reduced the running time of the algorithm on CPU which would make the algorithm a faster, more efficient than the serial kNN and competitor to state-of-the art GPU based implementation.