Parallel Apriori algorithm for frequent pattern mining

Apriori is a frequent pattern mining algorithm for discovering association rules. It is one of the most well-known algorithms for discovering frequent patterns along with FP-Growth algorithm. However, as a result of the current advances in the area of storage of very large databases and the tremendous growth in number of transactions, sequential Apriori becomes a bottleneck because of the long running time of the algorithm. In this paper, our goal is to develop a new parallel version of Apriori that aims to reduce the overall running time of the algorithm. Although Apriori is not known to be highly parallelizable, several attempts have been made to parallelize it in various ways, either by using parallel I/O hardware to optimize database scans or by distributing the workload on multiple processors. However, many of the parallel approaches suffer from noticeable latencies in synchronizing results being collected from each individual processor after a parallel iteration terminates. Our approach focuses on trying to maximize the workload being executed in parallel and to minimize the synchronization point delays by adopting a parallel pre-computing scheme during generation of the superset. By applying our new approach, the running time of the algorithm is reduced by an order of magnitude compared to other parallel implementations of the same algorithm.

INTRODUCTION

Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

Apriori has a bottleneck when the number of transactions is very large since multiple database scans will be performed by apriori in every iteration. In order to improve the scalability of Apriori, several data structures and methods were constructed that can boost performance. These include:

Transaction Reduction: Atransaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration

Partitioning: A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is minsup, the number of transactions in that partition. For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.

Sampling: The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy gainst efficiency. The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.

Dynamic itemset counting: A dynamic itemset counting technique [1] was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent.

However, it seems that even with the above mentioned approaches, performance of Apriori decreases substantially when large number of transactions is involved. Therefore, many parallel approaches have been conducted to increase the speed of the Apriori algorithm including:

1. Partitioning the transactions into N partitions where N is the number of processors then for each iteration computes local support then exchange results with all N-1 processors to generate the next superset.[2]

2. Use a master/slave scheme by selecting one processor to be the master and N-1 processors to be working processors. In this approach, the input transactions are not partitioned but the master processor generates unique itemsets and divides the list of itemsets equally to the N-1 processors to compute their support. Once computations of supports complete, all N-1 processors send their results to the master in order to compute the next superset and divide the work further.[3]

In this paper, a variation of the 2nd approach is studied and implemented, where an additional data structure is used to predict itemsets of the next iteration and cache them for later retrieval thus reducing database scans. The 2nd approach is selected because it provides high flexibility with the way itemsets get computed between different cores and mainly because of the similarity of the master-slave architecture between this method and the proposed one.

PROPOSED METHOD

Our approach focuses on the algorithm part of Apriori by trying to maximize the workload being executed in parallel and to minimize, as much as possible, the synchronization point delays by allowing a portion of the superset generation forming the candidates to be generated in parallel as well. Only the remaining part of the superset will be generated at the synchronization point.

The proposed algorithm is described in the following steps:

1.     Select one processor to be the master, the other N-1 processors are slaves

2.     Create a local Hash Table, at the master; we refer to it as G during the remaining part of this paper.

3.     Initialize K, the itemset size, to 1

4.     Generate the 1-itemsets normally using the master processor only

5.     At the master, divide the k-itemsets into N-1 groups and assign each group to exactly 1 of the available N-1 processors. Then for each processor:

a.      Lookup the support of each local itemset in G, if it’s not found, compute support.

b.     Prune itemsets that have their support less than the minimum support where minimum support is the threshold value to decide upon whether to drop or keep an itemset. It is also global to the entire system.

c.      Generate the candidates of the K+1 itemsets from the local items and find their support.

d.     Store the results of the K+1 itemsets (the itemset and its support) into G.

e.      Send the master processor a signal indicating end of processing.

6.      At the master processor, at the end of each arrival of a finalization signal:

a.      Generate the K+1 superset using itemsets in G.

7.     Increment K by 1.

8.     Go to step 2 and repeat until there are no more itemsets to generate.

EXPERIMRNTS & RESULTS

The goal of this experiment was to study the performance of parallelizing Apriori on CPU for large data sets. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were recorded were compared to performance statistics of state-of-the art Apriori. By applying the above mentioned parallel procedures, the CPU program was able to compete with the state-of-the art Apriori. Results showed that the proposed method is as fast as state-of-the art method but is better regarding BUS utilization where 75% bus utilization was incurred by state-of-the art apriori and 34% bus utilization was incurred by our proposed method. This greatly reduced total power consumption of the algorithm.

CONCLUSION

In this paper parallel Apriori algorithm was implemented by applying a new approach using intermediate data structure for computing and caching k+1 superset for use in future iterations to facilitate computation of the superset in parallel and increase degree of parallelism. The implementation of the parallel technique showed competing results with state-of-the art apriori in speed but outperformed it in bus utilization by 40% the running time of the algorithm which would make the algorithm efficient for energy consumption than the current best Apriori.