Parallel k-Nearest Neighbor

Example of k-nearest neighbour classification

Image via Wikipedia

K-Nearest Neighbor or KNN algorithm is part of supervised learning that has been used in many applications including data mining, statistical pattern recognition, and image processing. The algorithm doesn’t build a classification model but instead it is based on values found in storage or memory. To identify the class of an input, the algorithm chooses the class to which the majority of the input’s k closest neighbors belong to. The KNN algorithm is considered as one of the simplest machine learning algorithms. However it is computationally expensive especially when the size of the training set becomes large which would cause the classification task to become very slow. Several attempts have been made to parallelize KNN on the GPU by taking advantage of the natural parallel architecture of GPU [5]. However, in this paper the KNN algorithm was parallelized on CPU by distributing the distance computations of the k nearest neighbors among different processors. The parallel implementation greatly increased the speed of the KNN algorithm by reducing its time complexity from O(D) ,where D is the number of records, to O(D/p) where p is the number of processors.

Keywords: K Nearest Neighbor, GPU, manycore, CPU, parallel processors.

INTRODUCTION

The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. It was known to be computationally intensive when given large training sets, and did not gain popularity until the 1960s when increased computing power became available.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unknown tuple. “Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, …, x1n) and X2 = (x21, x22, …, x2n), is:

 

dist(X1,X2) = (1)

 

The above algorithm applies for numerical data. For categorical attributes, a simple method is to compare the corresponding values of the attributes in tuple X1 with those in tuple X2. If the two are identical, then the difference between the two is taken as 0. If the two are different, then the difference is considered to be 1. Other methods may incorporate more sophisticated schemes for differential grading.

When computed in this way on a serial machine, the time complexity is clearly linear with respect to the number of data points. Hence, there is an interest in mapping the process onto a highly parallel machine in order to further optimize the running time of the algorithm. It should be noted, however, that serial implementations of the k-NN rule employing branch and bound search algorithms[1] (a systematic method for solving optimization problems) can scale sublinearly, such that the asymptotic time complexity may be constant with respect to the number of data points. Nonetheless, a fully parallel hardware implementation should still be much faster than the most efficient serial implementations.

Many parallel methods were conducted to increase the speed of the KNN algorithm including:

1.     The method uses Neural Networks for constructing a multi-layer feed-forward network that implements exactly a 1-NN rule. The advantage of this approach is that the resulting network can be implemented efficiently. The disadvantage is that the training time can’t grow exponentially for high dimensional pattern spaces, which could make it impractical.

 

2.     A CUDA implementation of the “brute force” kNN search described in [6] is performed. The advantage of this method is the highly parallelizable architecture of the GPU.

 

In this paper, the “brute force” kNN is studied and implemented on CPU rather than a GPU where the degree of parallelism is indicated by the number of available cores or processors. The proposed algorithm is not expected to outperform state-of-the art GPU implementations but rather, to provide an equivalent performance on CPU. Hence, the benefit becomes the ability of load sharing between CPU and GPU without degradation or loss of speed upon switching between any of the two processor architectures.

PROPOSED METHOD

 

The nature of the brute force kNN algorithm can be assumed to be highly parallelizable[2] by nature, since computation of the distance between the input sample and any single training sample is independent of the distance computation to any other sample. This allows for partitioning the computation work with least synchronization effort. In fact, no inter communication or message passing is required at all during the time each processor is computing the distance between samples in its local storage and the input sample. When all processors terminate the distance computation procedure, the final step is to select a master processor to collect the results from all processors, sort the distances in ascending order, and then use the first 8 measures to determine the class of the input sample.

 

The proposed algorithm is described in the following steps:

1.     Select 1 processor to be the master, the other N-1 processors are slaves.

2.     Master divides the training samples to N subsets, and distributes 1 subset for each processor, keeping 1 subset for local processing (Master participates in distance computation too).

3.     Each individual processor now computes the distance measures independently and storing the computes measures in a local array

4.     When each processor terminates distance calculation, the processor transmits a message to the master indicating end of processing

5.     Master then notes the end of processing for the sender processor and acquires the computes measures by copying them into its own array

6.     After the master has claimed all distance measures from all processors, the following steps are performed:

a.      Sort all distance measures in ascending order

b.     Select top k measures

c.      Count the number of classes in the top k measures

d.     The input element’s class will belong the class having the higher count among top k measures

EXPERIMENTS& RESULTS

 

The goal of this experiment was to study the performance of parallelizing KNN on CPU for large data sets versus parallel GPU implementations. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were compared with respective implementation on a 64-core GPU. In order to accommodate for the biased number of cores on the GPU, the degree of parallelism was set to 8 in order to properly compare with the number of cores on the CPU, so 8 cores were used out of the available 64-cores. By applying the above mentioned parallel procedures, the CPU program was able to compete with the GPU showing equivalent performance but outperforming the GPU after repeating the test for several times. This was due to cache locality because after several repeated runs, chances of cache misses became less frequent and therefore fetching time from memory incurred by continuous trips was diminished. This also improved Bus utilization and hence power consumption was reduced.  These results were expected because of the nature of KNN is that it highly parallelizable and it scales well with many-core architectures. Complexity of the parallel algorithm is O(D/p) where D is the number of records and p is the number of available cores. Hence, given p processors, complexity reduces to constant time O(1).

 

CONCLUSION

 

In this paper parallel KNN algorithm was implemented by applying a new approach to facilitate computation of the distance measures of all data points. The implementation of the parallel technique reduced the running time of the algorithm on CPU which would make the algorithm a faster, more efficient than the serial kNN and competitor to state-of-the art GPU based implementation.

Parallel Apriori algorithm for frequent pattern mining

Apriori is a frequent pattern mining algorithm for discovering association rules. It is one of the most well-known algorithms for discovering frequent patterns along with FP-Growth algorithm. However, as a result of the current advances in the area of storage of very large databases and the tremendous growth in number of transactions, sequential Apriori becomes a bottleneck because of the long running time of the algorithm. In this paper, our goal is to develop a new parallel version of Apriori that aims to reduce the overall running time of the algorithm. Although Apriori is not known to be highly parallelizable, several attempts have been made to parallelize it in various ways, either by using parallel I/O hardware to optimize database scans or by distributing the workload on multiple processors. However, many of the parallel approaches suffer from noticeable latencies in synchronizing results being collected from each individual processor after a parallel iteration terminates. Our approach focuses on trying to maximize the workload being executed in parallel and to minimize the synchronization point delays by adopting a parallel pre-computing scheme during generation of the superset. By applying our new approach, the running time of the algorithm is reduced by an order of magnitude compared to other parallel implementations of the same algorithm.

INTRODUCTION

Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

Apriori has a bottleneck when the number of transactions is very large since multiple database scans will be performed by apriori in every iteration. In order to improve the scalability of Apriori, several data structures and methods were constructed that can boost performance. These include:

Transaction Reduction: Atransaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration

Partitioning: A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is minsup, the number of transactions in that partition. For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.

Sampling: The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy gainst efficiency. The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.

Dynamic itemset counting: A dynamic itemset counting technique [1] was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent.

However, it seems that even with the above mentioned approaches, performance of Apriori decreases substantially when large number of transactions is involved. Therefore, many parallel approaches have been conducted to increase the speed of the Apriori algorithm including:

1. Partitioning the transactions into N partitions where N is the number of processors then for each iteration computes local support then exchange results with all N-1 processors to generate the next superset.[2]

2. Use a master/slave scheme by selecting one processor to be the master and N-1 processors to be working processors. In this approach, the input transactions are not partitioned but the master processor generates unique itemsets and divides the list of itemsets equally to the N-1 processors to compute their support. Once computations of supports complete, all N-1 processors send their results to the master in order to compute the next superset and divide the work further.[3]

In this paper, a variation of the 2nd approach is studied and implemented, where an additional data structure is used to predict itemsets of the next iteration and cache them for later retrieval thus reducing database scans. The 2nd approach is selected because it provides high flexibility with the way itemsets get computed between different cores and mainly because of the similarity of the master-slave architecture between this method and the proposed one.

PROPOSED METHOD

Our approach focuses on the algorithm part of Apriori by trying to maximize the workload being executed in parallel and to minimize, as much as possible, the synchronization point delays by allowing a portion of the superset generation forming the candidates to be generated in parallel as well. Only the remaining part of the superset will be generated at the synchronization point.

The proposed algorithm is described in the following steps:

1.     Select one processor to be the master, the other N-1 processors are slaves

2.     Create a local Hash Table, at the master; we refer to it as G during the remaining part of this paper.

3.     Initialize K, the itemset size, to 1

4.     Generate the 1-itemsets normally using the master processor only

5.     At the master, divide the k-itemsets into N-1 groups and assign each group to exactly 1 of the available N-1 processors. Then for each processor:

a.      Lookup the support of each local itemset in G, if it’s not found, compute support.

b.     Prune itemsets that have their support less than the minimum support where minimum support is the threshold value to decide upon whether to drop or keep an itemset. It is also global to the entire system.

c.      Generate the candidates of the K+1 itemsets from the local items and find their support.

d.     Store the results of the K+1 itemsets (the itemset and its support) into G.

e.      Send the master processor a signal indicating end of processing.

6.      At the master processor, at the end of each arrival of a finalization signal:

a.      Generate the K+1 superset using itemsets in G.

7.     Increment K by 1.

8.     Go to step 2 and repeat until there are no more itemsets to generate.

EXPERIMRNTS & RESULTS

The goal of this experiment was to study the performance of parallelizing Apriori on CPU for large data sets. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were recorded were compared to performance statistics of state-of-the art Apriori. By applying the above mentioned parallel procedures, the CPU program was able to compete with the state-of-the art Apriori. Results showed that the proposed method is as fast as state-of-the art method but is better regarding BUS utilization where 75% bus utilization was incurred by state-of-the art apriori and 34% bus utilization was incurred by our proposed method. This greatly reduced total power consumption of the algorithm.

CONCLUSION

In this paper parallel Apriori algorithm was implemented by applying a new approach using intermediate data structure for computing and caching k+1 superset for use in future iterations to facilitate computation of the superset in parallel and increase degree of parallelism. The implementation of the parallel technique showed competing results with state-of-the art apriori in speed but outperformed it in bus utilization by 40% the running time of the algorithm which would make the algorithm efficient for energy consumption than the current best Apriori.



Sieve of Eratosthenes

Sieve of Eratosthenes is an ancient algorithm for finding prime numbers developed by the great mathematician Eratosthenes. Although the algorithm is centuries years old, it is still widely in use and considered one of the most common algorithms for generating prime numbers. There are couple of algorithms that outperform this one, the Sieve of Atkins and the Sieve of Subarama. I will discuss these on future posts but for the moment lets dig into Eratosthenes’s algorithm:

The algorithm takes as input an integer N, and generates all primes from 2 up to N. It works as follows:

  • generate all numbers from 2 up to N in ascending order
  • since 2 is the first prime, walk through the list of numbers and eliminate every second number from it
  • grab the next integer that survived the elimination(which is 3), and eliminate every third number from the list
  • then grab the next integer that survived that elimination(which is 5) and repeat the same process until the square of the next grabbed number is greater than N

Below is a graphical representation of the algorithm from Wikipedia:
.

It is easy to implement and test this algortihm in few lines of code. Create a console application and replace the main method with the following one and we’re done 🙂 (Special greetings to Mohammad Selek who contributed the source code).

     Sub Main()
        Dim s As String
        Console.Write("Find primes up to: ")
        s = Console.ReadLine
        Dim upperLimit As Integer = s.Trim
        Dim prime(upperLimit) As Integer
        Dim c1, c2, c3 As Integer
        For c1 = 2 To upperLimit
            prime(c1) = 0
        Next
        prime(0) = 1
        prime(1) = 1
        For c2 = 2 To Math.Sqrt(upperLimit) + 1
            If prime(c2) = 0 Then
                c1 = c2
                c3 = 2 * c1
                While c3 < upperLimit + 1
                    prime(c3) = 1
                    c3 = c3 + c1
                End While
            End If
        Next
        For c1 = 0 To upperLimit
            If prime(c1) = 0 Then
                Console.Write(c1 & "     ")
            End If
        Next
        Console.ReadKey()
    End Sub

Here is the output when N=1000

Finding the Majority element in an array

In its simplest terms, the Majority element is the element that occurs the most within a given array of numbers. Here, we mean by “the most” that the number of occurrences of the element must be more than half the size of the array. In mathematical terms, given an unsorted array of integers from 1 to n, where n is the length of the array, an element is a Majority if it occurs at least (n/2) + 1 times.

The Majority element is an important problem in computer science and is used as a basis for more complex algorithms. We also encounter the majority element in our daily life. For example, the ownership of a large company is determined by shareholders having more than half the total share. Another example is the elections in your country where the winner is determined by obtaining more than half the voices.

In this article, we will look at a fast algorithm that runs in linear time, in exactly O(n) + O(n) to find the Majority element. The algorithm was discovered by Rengakrishnan Subramanian in December 2001. His official paper can be found here.

The algorithm approaches the problem in two steps, in the first step we will call the method FindElement that takes an array of integers as an input and returns an element that is “most likely” to be Majority. In the second step, the algorithm will call FindMajority which takes the same array as an input along with the most probable element we determined earlier and will return the element if it turns out to be Majority, or null if its not.

The FindElement algorithm works as follows:

1-declare two variables, we’ll call the first one count and the second one canbeMajority to store the candidate Majority.

2-assume that the first number in the array is Majority and assign its value to canbeMajority.

3-traverse the input array from start to end while applying the following conditions:

  • If the next element in the array is equal to canbeMajority, then count is incremented by 1.
  • If the next element in the array is not equal to canbeMajority, then count is decremented by 1.
  • If at the next element, the count is equal to zero, then count is set to one and this element is considered as canbeMajority

Here is the implementation of FindElement and FindMajority in vb.net:

    Private Function FindElement(ByRef arr() As Integer) As Integer
        Dim count As Integer = 0
        Dim element As Integer = 0
        For i As Integer = 0 To arr.Length - 1
            If count = 0 Then
                count = 1
                element = arr(i)
            ElseIf arr(i) = element Then
                count += 1
            Else
                count -= 1
            End If
        Next
        Return element
    End Function

    Private Function FindMajority(ByRef arr() As Integer) As Integer
        Dim element As Integer = FindElement(arr)
        Dim count As Integer = 0
        For i As Integer = 0 To arr.Length - 1
            If arr(i) = element Then count += 1
        Next
        If count > arr.Length / 2 Then Return element Else Return Nothing
    End Function

I have created a simple application that reads or generates arbitrary array of numbers and find its Majority but I will leave the implementation of the application as an exercise to the reader :). However you can compare your results with mine by looking at the picture below

MajorityElement

MajorityElement