## Parallel k-Nearest Neighbor

February 26, 2011 4 Comments

**K-Nearest Neighbor or KNN algorithm is part of supervised learning that has been used in many applications including data mining, statistical pattern recognition, and image processing. The algorithm doesn’t build a classification model but instead it is based on values found in storage or memory. To identify the class of an input, the algorithm chooses the class to which the majority of the input’s k closest neighbors belong to. The KNN algorithm is considered as one of the simplest machine learning algorithms. However it is computationally expensive especially when the size of the training set becomes large which would cause the classification task to become very slow. Several attempts have been made to parallelize KNN on the GPU by taking advantage of the natural parallel architecture of GPU [5]. However, in this paper the KNN algorithm was parallelized on CPU by distributing the distance computations of the k nearest neighbors among different processors. The parallel implementation greatly increased the speed of the KNN algorithm by reducing its time complexity from O(D) ,where D is the number of records, to O(D/p) where p is the number of processors.**

** **

*Keywords: K Nearest Neighbor, GPU, manycore, CPU, parallel processors.*** **

**INTRODUCTION**

** **

The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. It was known to be computationally intensive when given large training sets, and did not gain popularity until the 1960s when increased computing power became available.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by *n *attributes. Each tuple represents a point in an *n*-dimensional space. In this way, all of the training tuples are stored in an *n*-dimensional pattern space. When given an unknown tuple, a ** k**-nearest-neighbor classifier searches the pattern space for the

*k*training tuples that are closest to the unknown tuple. These

*k*training tuples are the

*k*“nearest neighbors” of the unknown tuple. “Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say,

**1 = (**

*X**x*11,

*x*12, …,

*x*1

*n*) and

**2 = (**

*X**x*21,

*x*22, …,

*x*2

*n*), is:

*dist(X1,X2)*** = ** (1)

The above algorithm applies for numerical data. For categorical attributes, a simple method is to compare the corresponding values of the attributes in tuple ** X1 **with those in tuple

**. If the two are identical, then the difference between the two is taken as 0. If the two are different, then the difference is considered to be 1. Other methods may incorporate more sophisticated schemes for differential grading.**

*X*2When computed in this way on a serial machine, the time complexity is clearly linear with respect to the number of data points. Hence, there is an interest in mapping the process onto a highly parallel machine in order to further optimize the running time of the algorithm. It should be noted, however, that serial implementations of the k-NN rule employing branch and bound search algorithms[1] (a systematic method for solving optimization problems) can scale sublinearly, such that the asymptotic time complexity may be constant with respect to the number of data points. Nonetheless, a fully parallel hardware implementation should still be much faster than the most efficient serial implementations.

Many parallel methods were conducted to increase the speed of the KNN algorithm including:

1. The method uses Neural Networks for constructing a multi-layer feed-forward network that implements exactly a 1-NN rule. The advantage of this approach is that the resulting network can be implemented efficiently. The disadvantage is that the training time can’t grow exponentially for high dimensional pattern spaces, which could make it impractical.

2. A CUDA implementation of the “brute force” kNN search described in [6] is performed. The advantage of this method is the highly parallelizable architecture of the GPU.

In this paper, the “brute force” kNN is studied and implemented on CPU rather than a GPU where the degree of parallelism is indicated by the number of available cores or processors. The proposed algorithm is not expected to outperform state-of-the art GPU implementations but rather, to provide an equivalent performance on CPU. Hence, the benefit becomes the ability of load sharing between CPU and GPU without degradation or loss of speed upon switching between any of the two processor architectures.

**PROPOSED METHOD**

The nature of the brute force kNN algorithm can be assumed to be highly parallelizable[2] by nature, since computation of the distance between the input sample and any single training sample is independent of the distance computation to any other sample. This allows for partitioning the computation work with least synchronization effort. In fact, no inter communication or message passing is required at all during the time each processor is computing the distance between samples in its local storage and the input sample. When all processors terminate the distance computation procedure, the final step is to select a master processor to collect the results from all processors, sort the distances in ascending order, and then use the first 8 measures to determine the class of the input sample.

The proposed algorithm is described in the following steps:

1. Select 1 processor to be the master, the other N-1 processors are slaves.

2. Master divides the training samples to N subsets, and distributes 1 subset for each processor, keeping 1 subset for local processing (Master participates in distance computation too).

3. Each individual processor now computes the distance measures independently and storing the computes measures in a local array

4. When each processor terminates distance calculation, the processor transmits a message to the master indicating end of processing

5. Master then notes the end of processing for the sender processor and acquires the computes measures by copying them into its own array

6. After the master has claimed all distance measures from all processors, the following steps are performed:

a. Sort all distance measures in ascending order

b. Select top k measures

c. Count the number of classes in the top k measures

d. The input element’s class will belong the class having the higher count among top k measures

** **

**EXPERIMENTS& RESULTS**

The goal of this experiment was to study the performance of parallelizing KNN on CPU for large data sets versus parallel GPU implementations. Iris database was used to train the system which is “*perhaps the best known database to be found in the pattern recognition literature*” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “*Iris-setosa*”, “*Iris-virginica*” and “*Iris-versicolour*”. These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were compared with respective implementation on a 64-core GPU. In order to accommodate for the biased number of cores on the GPU, the degree of parallelism was set to 8 in order to properly compare with the number of cores on the CPU, so 8 cores were used out of the available 64-cores. By applying the above mentioned parallel procedures, the CPU program was able to compete with the GPU showing equivalent performance but outperforming the GPU after repeating the test for several times. This was due to cache locality because after several repeated runs, chances of cache misses became less frequent and therefore fetching time from memory incurred by continuous trips was diminished. This also improved Bus utilization and hence power consumption was reduced. These results were expected because of the nature of KNN is that it highly parallelizable and it scales well with many-core architectures. Complexity of the parallel algorithm is **O(D/p) **where D is the number of records and p is the number of available cores. Hence, given p processors, complexity reduces to constant time **O(1)**.

**CONCLUSION**

In this paper parallel KNN algorithm was implemented by applying a new approach to facilitate computation of the distance measures of all data points. The implementation of the parallel technique reduced the running time of the algorithm on CPU which would make the algorithm a faster, more efficient than the serial kNN and competitor to state-of-the art GPU based implementation.** **