## Finding Adjacent Prime Numbers using Linq

Suppose you have a range of numbers defined by a lower bound L and an upper bound U, the problem is to find 2 pairs of prime numbers, one of which will be the closest primes and the second pair will have the farthest distance in between. For example, given a range of integers from 1 to 20, the two most adjacent primes are 2 and 3 having a distance 1 while the least adjacent primes are 7 and 11 having a distance of 4.

Given a list of integers from 1 to 10,000 can you tell the least adjacent prime numbers? (the answer is 9551 and 9587)  🙂

Writing an algorithm that find adjacent primes can be tricky but is also short and straightforward. below is a naive vb.net implementation for this algorithm. I will not explain code in detail but basically what it does is to loop over the given range from L up to square root of U and flags all prime numbers within the range. then we compute the distance using linq and another query orders by distance in asc and desc in order to select the top 1 from each query.

Sub Main()
While True

Dim PrimeNumbers As New List(Of Integer)

Console.WriteLine(“———-Prime Distance———-“)
Console.Write(“Enter L: “)
Dim lowerLimit As String = Console.ReadLine().Trim
Console.Write(“Enter U: “)
Dim upperLimit As Integer = Console.ReadLine().Trim
Dim prime(upperLimit) As Integer
Dim c1, c2, c3 As Integer
For c1 = 2 To upperLimit
prime(c1) = 0
Next
prime(0) = 1
prime(1) = 1

For c2 = 2 To Math.Sqrt(upperLimit) + 1
If prime(c2) = 0 Then
c1 = c2
c3 = 2 * c1
While c3 < upperLimit + 1
prime(c3) = 1
c3 = c3 + c1
End While
End If
Next
For c1 = 0 To upperLimit
If prime(c1) = 0 Then
End If
Next
Dim ordered = From a In PrimeNumbers Order By a Ascending Select a
Dim query = From a In ordered, b In ordered Where b > a Select P1 = a, P2 = b, Distance = Math.Abs(a – b)
query = query.GroupBy(Function(x) x.P1).Select(Function(x) x.First) ‘distinct
Dim minDistance = (From a In query Order By a.Distance Ascending Select a).Take(1).SingleOrDefault
Dim maxDistance = (From a In query Order By a.Distance Descending Select a).Take(1).SingleOrDefault
Console.WriteLine(“———————————“)
Console.WriteLine(“{0},{1} are the closest primes having distance {2}”, minDistance.P1, minDistance.P2, minDistance.Distance)
Console.WriteLine(“———————————“)
Console.WriteLine(“{0},{1} are the most distant primes having distance {2}”, maxDistance.P1, maxDistance.P2, maxDistance.Distance)
Console.WriteLine(“———————————“)
End While
End Sub

## SQL Query Optimization and Performance Guide

Contents

Windows Settings    3

Storage on 2 Physical disks – standard solution    7

Storage on 3 Physical disks – extended solution    7

Storage on 4 physical disks – optimal solution    7

Storage on external hard disk or flash memory    9

Database Auto growth    9

Index Management    9

Stored procedures    12

Cursors    12

Query optimization    13

Scheduled Maintenance Plan    13

Check disk usage by top tables    14

## Windows Settings

• Adjust performance for background services
• Configure size of windows paging file(virtual memory) to be twice the size of physical memory
• Turn off system protection feature for all disks except for C
• Disable unneeded sql services
• Configure weekly defragmentation schedule for all disks

## Storage on 2 Physical disks – standard solution

• Store log file on C
• Store data file on the second disk

## Storage on 3 Physical disks – extended solution

• Store data file on second disk
• Store log file on third disk
• Store Windows Paging File(Virtual Memory) on C

## Storage on 4 physical disks – optimal solution

• Store windows paging file on C
• Store primary data file on second disk
• Store log file on third disk
• Create Secondary data file on fourth disk to store indexes

When creating indexes, select the secondary file group

## Storage on external hard disk or flash memory

• Not allowed
• This will disable all optimizations performed by SQL Server
• This will limit read/write speed to 25Mbits/sec for external hard disk and slower for flash memory

## Database Auto growth

• Do not set to grow in %. Use static value instead(100MB is good option)

## Index Management

• Create non-clustered index only for tables that have rate of SELECT much larger than rate of INSERT and UPDATE
• Do not create indexes for all table columns
• For indexes created on base tables(for example Person, Admission, AdmissionRequestDetail), set the index fill factor to 70 instead of 80(default option). This will enhance performance for INSERT.
• Use the following procedure to analyze index fragmentation. You should “reorganize” indexes when the External Fragmentation value for the index is between 10-15 and the Internal Fragmentation value is between 60-75. Otherwise, you should rebuild indexes.

ALTER
procedure
[dbo].[CheckIndexFragmentation]
as

SELECT
object_name(dt.object_id)
Tablename,si.name

IndexName,dt.avg_fragmentation_in_percent
AS

ExternalFragmentation,dt.avg_page_space_used_in_percent
AS

InternalFragmentation

FROM

(

SELECT
object_id,index_id,avg_fragmentation_in_percent,avg_page_space_used_in_percent

FROM
sys.dm_db_index_physical_stats
(db_id(‘Clis3’),null,null,null,‘DETAILED’

)

WHERE
index_id
<> 0)
AS
dt
INNER
JOIN
sys.indexes
si
ON
si.object_id=dt.object_id

AND
si.index_id=dt.index_id AND dt.avg_fragmentation_in_percent>10

AND
dt.avg_page_space_used_in_percent<75 ORDER
BY avg_fragmentation_in_percent

DESC

GO

• Create non-clustered indexes only for columns that appear in:
• Where clause
• Order By clause
• Join clause
• Distinct
• All foreign keys
• Avoid indexing small tables
• Create indexed views for columns used by Linq query in the where clause, if the index is not created on the table but do not create the index on both.
• Do not use Year(), Month() ,Day() functions in the where clause on a column even if it has an index. Using these functions and other similar functions will disable the index and will do a full table scan. So change the query to make use of the index. Example:

select
*
from
Person
where
YEAR(BirthDate)=1986

select
*
from
Person
where
BirthDate
>=
‘1986-01-01’
and
BirthDate
<
‘1987-01-01’

• Use the index usage report to check if there are unused indexes on the table in order to delete them. If the number of seeks is 0. The index can be deleted.

## Stored procedures

• Do not name the procedure sp_something. This will cause a delay when the procedure executes.
• Use Set Nocount On at the top of the procedure to avoid additional round trip to the server
• Try to avoid using exec of dynamic sql statements

## Cursors

• Do not use a cursor to iterate over a set of records. Creating and using a cursor is very expensive and resource consuming. Instead, use a while loop with defined upper bound. This option is not available for SQL 2000. Example:

declare
@count
int

select
@count=COUNT(*)
from
SubDepartment

declare
@i
int=1

declare
@temp
table(id
int,rownumber
int)

insert
into
@temp
select
ID,ROW_NUMBER()
over(order
by
[order])
from SubDepartment

while (@i<=@count)

begin

update
SubDepartment
set
[Order]=@i
where
ID=
(select
ID
from
@temp
where
rownumber=@i)

set
@i=@i+1

end

## Query optimization

• Do not use OR in the where clause. Instead, use UNION ALL to implement OR functionality but only if there is an index on the column used in OR operator. If there is no index, this optimization will slow down performance by doing multiple table scans. If there is an index this approach will speed up the query. Example:

select
*
from
Person
where
person.Sex=0 or person.Sex=1

select
*
from
person
where
person.Sex=0 union
all

select
*
from
person
where
Person.sex=1

• Avoid using inline queries, use joins instead
• Avoid using inline function in the query. Instead, create a precomputed column with the function formula to be stored for selection in the query. May not be applicable for all cases.
• When writing queries containing NOT IN, the query will have poor performance as the optimizer need to use nested table scan to perform this activity. This can be avoided by using EXISTS or NOT EXISTS
• When you have the choice to use IN or BETWEEN clause in the query, always use BETWEEN. This will speed up the query
• When having multiple columns in where clause separated by AND operator, make sure to order the columns from least likely true to most likely true if the expected result is most likely true. Otherwise, order the columns from most likely false to least likely false. In other words, the condition that filters the least number of records must appear first in the where clause.
• For queries that require immediate feedback to the user, use FAST n option to immediately return first n records while the query continues to fetch the remaining records. This option can be used in procedures and views but it does not apply for linq queries because .ToList() ignores this option. Example:

SELECT
*
FROM
person
WHERE
firstname
like
‘ali’
OPTION(FAST 100)

## Scheduled Maintenance Plan

• Use daily maintenance plan to perform daily cleanup, update statistics, check integrity, rebuild indexes, organize indexes, shrink database files and perform a full backup.
• Do not include system databases in the plan. Select the option “User databases only”
• Make sure SQL Server Agent startup option is set to Automatic
• Schedule the plan to run daily at night. 1am is a good option.
• The below plan is standard and can be used for most cases.

## Check disk usage by top tables

• Monitor disk usage by most used tables. These will be the target for potential indexes.
• Use this approach to check if the application is performing repeated unnecessary reads

## My presentation on Automatic Loop Parallelization for multicore systems

This is my ppt presentation I made recently on automatic loop parallelization. It is an explanation for a new innovation on how to perform automatic loop parallelization efficiently. Below is the download link and a summary for each slide. enjoy!

AutomaticLoopParallelization

1

2

3:

With the advance of multi-core and many core systems, developers can now make use of parallel programming and this made it essential for developing better programs.

Towards taking advantage of multicore platforms, sequential programs are being replaced by parallel programs.

Developing parallel applications is a difficult and can be prone to errors. Therefore, the necessity for automatic parallelization becomes important

4:

Loops are the most important target for parallelization.

We chose loops because they are the most important factor that affects the performance of any program.

Most programs spend a lot of their running time iterating through one or more compute intensive loops.

According to the 90/10 rule, 10% code is a loop.

In this thesis we focus on Automatic Loop Parallelization.

5:

Available loop parallelization techniques are not efficient because of having:

Long running time

Huge processors communication

Large memory usage

6.

In this thesis, we propose a new loop parallelization algorithm that:

Extracts parallelism among different loop iterations and also inside statements of the same iteration.

And Incurs less processor, memory and time overheads.

7.

For example, we can see from the two figures that parallel loops:

utilize the available architecture effectively by dirtributing the workload among the processors and take less time to execute.

8.

Our algorithm takes as an input:

A loop whose array references are unknown except at runtime.

And an access array that defines the indices B and C of the access array for every loop iteration.

The output is a parallel schedule that defines the execution order of all the statements of the loop. Statements within the same set are run concurrently, whereas statements belonging to different sets are run one after the other.

9.

In the course of loop execution, 3 types of dependencies can exist among the loop statements.

10.

11.

How to obtain a full parallel schedule of the loop?

Using dependency information, we build a list of sets containing independent items.

Independent items in each set run in parallel.

12.

Many algorithms have been developed for parallelizing loops including:

The first one is

13.

Zhu and Yew was the first approach for parallelizing loops.

It consists of 2 phases:

The registration phase where a parallel schedule is constructed.

And the execution phase where the obtained schedule is executed.

14.

The algorithm stores in memory

For every array element a key value.

And for every loop iteration a D value to indicate if the iteration is assigned to a set in the schedule.

15.

In the 1st step, The keys are initialized to infinity.

Int the 2nd step, we set the key to the iteration number where the element was first accessed.

In the final step, we check if the key is equal to the iteration number then add it to the set.

16.

We take as an example this access table.

In the first round of the algorithm,

the keys are initialized to infinity and the D values are initialized to 1.

We fill the keys of all the array elements.

after passing across all the loop iterations and comparing the keys to the iteration numbers. We add iterations 1, 3 and 7 to the first set.

We set the D values of iterations 1,3 and 7 to 0.

This process is reapted until all iterations are assigned to sets

17.

The main advantage of zhu and yew algorithm is that it is simple.

but:

Many memory accesses and runs.

It doesn’t introduce parallelization within statements of the same iteration.

18.

Midkiff and Padua algorithm is similar to zhu and yew algorithm but with one improvement in that it allows concurrent reads.

it consists of 2 phases: the registration and the execution phase.

19.

The algorithm stores for each array element 2 keys:

Integer key stores the iteration number that does the first write for the array element.

bit key stores a true value if there any reads before the first write of the array element.

The algorithm also stores a D array for identify the iterations that are done.

20.

First, integer keys are initialized to N+1

Second, we set the integer key to the iteration number where the element was 1st written.

Third, we set its bit key to true if there are any read operation before the 1st write.

Finally, we check for 4 conditions to determine if the iteration is added to the set:

21.

We take as an example this access table.

In the first round, we compute the integer and bit key values of all array elements. We then pass by all loop iterations and check if any iteration meets the 4 conditions. Iterations 1, 4 and 7 are added to the first set. We set their d values to 1 which means that they are done.

Repeating this processes will produces the final result sets.

22.

The main advantage of this algorithm is that it allows concurrent reads. However, it performs many memory accesses and rounds. Also, it doesn’t allow for intra-parallelization.

23.

Leung and Zahorjan algorithm introduced a new idea based on parallelizing the inspection phase.

The iterations are divided among the available processors and each processor constructs its own sub schedule for its assigned iterations.

Finally the sub schedules of all the processors are concatenated together to form the final schedule.

24.

For example, assume we have 15 iterations in a loop to be divided among 3 processors.

Each processor is assigned 5 iterations.

The sub schedules which are constructed by each processor are grouped together to form the final schedule where the last set of the sub schedule of processor 1 is followed by the first set of the sub schedule constructed by processor 2 and the last set of the sub schedule of processor 2 is followed by the first set of the sub schedule constructed by processor 3.

25.

Is faster, but

it decrease the degree of parallelism in cases where multiple iterations that are independent are assigned to different processors, this would result in having the iterations placed in different sets in the final parallel schedule instead of being placed together in the same set to be run concurrently.

26.

The chen, yew, and torellas algorithm builds a ticket table for every dependence chain.

The ticket table stores the sequence numbers for accessing the array elements in the correct order.

It cosists of the inspection and the execution phase.

27.

Assume we have this dependence chain for a specific array element.

The iterations are divided among the available processors, and each processor builds its own ticket table.

The first processor can mark its iterations correctly because it is the first processor. Processor 1 can easily set  0, 1 and 2 respectively

The following processors can’t mark their iterations correctly because they don’t know their order in the preceding processors. So we set a ‘?’ to the heads of each processor.

Set the key value of the array elements in the dependence chain following the head to the number of all previous elements.

Processors then communicate globally to send the last value of the dependence chain that this processor contains. Processor1 sends a 2 to processor2. Processor 2 sends a ? to processor 3.

28.

For each iteration, the element gets added to the set if its key is equal to the ticket value otherwise increment the key by 1.

29.

The main advantages of this algorithm is that it allows for intra-parallelization and it performs the inspection phase in parallel. However, it incurs huge communication overhead among the processors to maintain a correct ticket table. Also it doesn’t allow for concurrent reads.

30.

The xu and chaudhary algorithm is a timestamp algorithm for parallelizing loops. It allows for concurrent reads.

This algorithm assigns a 2 dimensional value A(I,j) for every array element where I denotes the sequence number of the array element and j denotes the read group size in case of a read operation.

31.

A(i,j) denotes the array element of the current iteration and A(m,n) refers to its direct predecessor.

L-closure and R-closure are used to indicate the beginning and the end of a read group.

The u value denotes an undefined value.

r denotes the processor number of the current iteration.

g(r) represents the total number of elements contained in all the previous processors.

h(r) denotes the number of references in the same block r.

32.

If the current operation is the head of the dependence chain the we set its time stamp value to 1 if it is the first processor or u otherwise.

The following timestamp values are assigned based on the type of the current operation and its previous operation. If we have a read preceded by a write or a read preceded by a read, or a write preceded by a read, or a write preceded by a write.

33.

If we have this access table.

We take as an example the dependence chains of array elements 3 and 10.

After applying the timestamp rules we obtain this table.

34.

In the second step of the algorithm we modify the timestamp values according to these rules and depending whether this operation is a read or write.

35.

After applying the rules of the second step, we obtain the following table where we notice that every read operation has its j value equal to the read group size.

36.

In the last step of the algorithm and for each dependence chain we assign the time values based on this rules. If we have a write then time is incremented by 1. If we have a read then time is incremented by 1/readgroupsize.

37.

After applying the rules of the last step we obtain the times shown here. The decimal values are then rounded to the upper bound so that consecutive reads would have the same time value and thus would be run in parallel.

38.

The main advantages of this algorithm is that it allows for concurrent reads. It performs the inspection phase in parallel. And it allows intra-parallelization. However, it incurs the largest cpu, memory and time overheads.

39.

Kao, ayng and tsend algorithm is a technique for parallelizing loops by building a table containing the iteration number where a given element is read and written for each iteration.

For every loop iteration:

The array element A(B(i)) is written at this iteration: W(A(B(i)))=i

The array element A(C(i)) is read at this iteration: R(A(C(i)))=i

WF(i)= WF(Max[W(A(B(i-1))), R(A(B(i-1))), and W(A(C(i-1)))]) + 1.

40.

We take this access table as an example.

In iteration 1, A(3) is written and A(1) is read so we store 1 for w(3) and r(1). The wavefront is initialized to 1.

In iteration 2, all the values are copied and we set w(7) and r(3) to 2. The maximum of w(7),r(7) and w(3) is 1. So we set wf(2)=wf(1)+1=2.

We follow the same steps until obtaining the final parallel schedule.

41.

This algorithm allows concurrent reads, but it incurs memory overhead and doesn’t allow intra-parallelization.

42.

We propose a new approach for parallelizing loops based on extracting independent sets.

Similar to the algorithms in the related work section, our proposed approach consists of 2 phases:

The inspector phase where a parallel schedule is constructed.

The executer phase where the parallel schedule is executed.

43.

The key idea behind our algorithm is that if statement S depends on previous statements, then S is not allowed to proceed until the previous statements are executed.

And to make sure that S is executed after the statements that it depends on:

S is assigned a larger sequence number.

44.

The algorithm takes as an input a loop that contains an array whose references are indexed data that are determined at runtime. The access array defines the accessed array elements for every loop iteration.

The algorithm produces as an output a parallel schedule that defines the execution order of all the loop statements for all the iterations.

45.

The inspection phase consists of 2 steps:

Sequence Numbering: every statement is assigned the sequence number of the statement it depends on incremented by 1, except in the case when there are consecutive reads then both are assigned the same sequence number on order to allow for concurrent reads.

Dependency Storage: the index of the current statement and the type of the operation are stored in a dependency table for the used array element in order to keep track of dependencies.

46.

In step 1 of the inspection phase we use a sequence array for ordering the execution of the loop statements.

Assuming we have a loop of size N and the loop body contains 2 statements (R & W), we create a sequence array having a size of 2N.

47.

If a statement S1 is dependent on another statement S2 then S1 is not allowed to execute before S2 terminates.

S1 is assigned a sequence number greater than S2 by 1.

For example, if statement of index 9 is dependent on statement of index 2 then S(9)=S(2) + 1

48.

In step 2 of the inspection phase a dependency array is used to identify the dependent statements of a given loop.

Assuming we have an input array of size M, we create a new array for storing dependencies having a size of 2M.

I stores the index of the last statement that used the array element.

T is used to store the type of the last operation that used the array element if it is a Read or Write.

49.

The algorithm of the inspection phase proceeds as follows:

We loop through all the elements of the access table AT where each element is identified by its index k.

50.

We take as an example this access table.

The second table shows the index of each cell of the access table.

For simplicity we take the dependence chain of array element 10.

51.

The first element of the dependence chain of A(10) whose index is 4 has no previous dependencies so it is assigned a sequence number of 1 and the current index and operation type are stored in the I and T values of the dependency.

Each subsequent statement is assigned the sequence number of its previous statement incremented by 1 except in the case where of consecutive reads.

S(9)=S(4) +1

S(12)=S(9)+1

S(27)=S(20)+1

S(29)=S(27)+1

52.

The final sequence number is shown and the statements are distributed among the sets in an increasing order of sequence number. Statements having the same sequence number are placed in the same set.

53.

In the execution phase of the algorithm, items of the same set are run concurrently, and when one set  finishes its execution, it allows the next set to execute. This is done until all of the sets get executed.

54.

We implemented all of the algorithms discussed in the related work section using vb.net.

We used 3 loops found in the Perfect Club Benchmark for testing.

The experiments were intended to show the difference in performance between proposed approach and the state of the art algorithms in terms of:

Time taken to build the parallel schedule

CPU Consumption

Memory Usage

55.

The prefect club benchmark is an acronym for PERFormance Evaluation by Cost-effective transformations.

It was created at the Center for Supercomputing Research and Development (CSRD) at the university of Illinois at urbana champaign  in order to focus the performance evaluation at the applications level.

56.

The three Loops were used by the other algorithms for testing.

The first loop is a nested loop of size 250 by 250 and its array size is 1000.

The second and third loops are single loops of size 8000 and of array size 24000.

57.

Implementation of all the algorithms was executed on the same PC having intel processor core i7 of speed 2.3 GHz, a memory of 8GB and a 64 bit operating system.

58.

We implemented our approach using the vb.net language. We added 3 classes:

AccessList class for storing the access table, LastOperationType class for storing the dependency table, and parallellist class that corresponds to each set of the final schedule containing the independent items.

59.

We filled the access table at runtime.

60.

The build schedule algorithm is shown here.

61.

We measure the time, cpu, memory and thread consumption for building the parallel schedule by our proposed algorithm and by each algorithm discussed in the related work part.

62.

The time plot graph shows that our algorithm takes the least time for building the schedule.

63.

The cpu plot graph shows that our algorithm, midkiff and padua, and zhu and yew algorithms incur the least cpu overheads.

64.

The memory plot graph shows that our algorithm along with the chen yew torellas algorithm incur the least memory overheads.

65.

Threads are best utilized by our proposed algorithm because the number of hardware threads is equal to the number of software threads so there is no context switching.

66.

From comparison with other algorithms, our proposed algorithm is the fastest in building the parallel schedule. It incurs the least cpu and memory overheads. And it it is the best in utilizing the threads.

67.

The limitation of our proposed approach is that in the loop body we used only one array similar to the algorithms discussed in the related work part that use only one array too in the loop body.

In case of having many arrays that show cross dependencies among each other we need to add more logic to the algorithm.

68.

In conclusion

We described a tool that performs automatic loop parallelization.

We evaluated the effectiveness of our tool by testing it against 6 algorithms and using 3 loops obtained from perfect club benchmark.

In comparing to state of the art loop parallelization techniques, the proposed algorithm achieved a 980% (9.8X) speedup and a 10% reduction in memory usage.

69.

The proposed algorithm has four main advantages:

More parallelism by allowing the statements inside an iteration to be executed in parallel.

Less time overhead for finding independent sets.

70.

As part of our future work we intend to:

Conduct an empirical evaluation of our vb.net implementation against more algorithms.

Use arrays that exhibit cross dependencies.

Use loops from other benchmarks.

Provide implementation of the proposed approach in other languages.

## Practical Data Mining

 American university of beirut Advanced Tutorial with Visual Studio & SQL Server 2008R2 Ali Tarhini 04/19/2011

Contents    i

## Introduction

The data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft Sql server 2008. The data mining algorithms and tools in Sql server 2008 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.

The most visible components in Sql server 2008 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see “Choosing Between SQL Server Management Studio and Business Intelligence Development Studio” in SQL Server Books Online.

All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing models.

After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see “Viewing a Data Mining Model” in SQL Server Books Online.

Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.

To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see “Data Mining Extensions (DMX) Reference” in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.

ou use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process. Sql server 2008 includes nine algorithms:

• Microsoft Decision Trees
• Microsoft Clustering
• Microsoft Naïve Bayes
• Microsoft Sequence Clustering
• Microsoft Time Series
• Microsoft Association
• Microsoft Neural Network
• Microsoft Linear Regression
• Microsoft Logistic Regression

Using a combination of these nine algorithms, you can create solutions to common business problems. These algorithms are described in more detail later in this tutorial.

Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. Sql server 2008 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see “DTS Data Mining Tasks and Transformations” in SQL Server Books Online.

In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with Sql server 2008, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.

If you are new to data mining, download “Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis Services” (msdn.microsoft.com/library/default.asp?url=/servers/books/sqlserver/mining.asp).

AdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.

Adventure Works sells products wholesale to specialty shops and to individuals through the Internet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.

Database Details

The Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:

• North America (83%)
• Europe (12%)
• Australia (7%)

The database contains data for three fiscal years: 2002, 2003, and 2004.

The products in the database are broken down by subcategory, model, and product.

Business Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.

Working in an IDE is beneficial for the following reasons:

• You have powerful customization tools available to configure Business Intelligence Development Studio to suit your needs.
• You can integrate your Analysis Services project with a variety of other business intelligence projects encapsulating your entire solution into a single view.
• Full source control integration enables your entire team to collaborate in creating a complete business intelligence solution.

The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.

If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.

SQL Server Management Studio

SQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work.

After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the data mining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models.

Data Transformation Services

Data Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in Sql server 2008. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.

DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.

You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.

Data mining algorithms are the foundation from which mining models are created. The variety of algorithms included in Sql server 2008 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see “Data Mining Algorithms” in SQL Server Books Online.

### Microsoft Decision Trees

The Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes.

In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen to cause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.

### Microsoft Clustering

The Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.

### rosoft Naïve Bayes

The Microsoft Naïve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Naïve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.

### Microsoft Time Series

The Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.

Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years.

A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorithm can use cross-variable correlations in its predictions. For example, prior sales at one store may be useful in predicting current sales at another store.

### Microsoft Association

The Microsoft Association algorithm is specifically designed for use in market basket analyses. The algorithm considers each attribute/value pair (such as product/bicycle) as an item. An itemset is a combination of items in a single transaction. The algorithm scans through the dataset trying to find itemsets that tend to appear in many transactions. The SUPPORT parameter defines how many transactions the itemset must appear in before it is considered significant. For example, a frequent itemset may contain {Gender=”Male”, Marital Status = “Married”, Age=”30-35″}. Each itemset has a size, which is number of items it contains. In this case, the size is 3.

Often association models work against datasets containing nested tables, such as a customer list followed by a nested purchases table. If a nested table exists in the dataset, each nested key (such as a product in the purchases table) is considered an item.

that C is predicted by A and B. The probability threshold is a parameter that determines the minimum probability before a rule can be considered. The probability is also called “confidence” in data mining literature.

Association models are also useful for cross sell or collaborative filtering. For example, you can use an association model to predict items a user may want to purchase based on other items in their basket.

### Microsoft Sequence Clustering

The Microsoft Sequence Clustering algorithm analyzes sequence-oriented data that contains discrete-valued series. Usually the sequence attribute in the series holds a set of events with a specific order (such as a click path). By analyzing the transition between states of the sequence, the algorithm can predict future states in related sequences.

The Microsoft Sequence Clustering algorithm is a hybrid of sequence and clustering algorithms. The algorithm groups multiple cases with sequence attributes into segments based on similarities of these sequences. A typical usage scenario for this algorithm is Web customer analysis for a portal site. A portal Web site has a set of affiliated domains such as News, Weather, Money, Mail, and Sport. Each Web customer is associated with a sequence of Web clicks on these domains. The Microsoft Sequence Clustering algorithm can group these Web customers into more-or-less homogenous groups based on their navigations patterns. These groups can then be visualized, providing a detailed understanding of how customers are using the site.

### Microsoft Neural Network

In Microsoft Sql server 2008 Analysis Services, the Microsoft Neural Network algorithm creates classification and regression mining models by constructing a multilayer perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm provider, given each state of the predictable attribute, the algorithm calculates probabilities for each possible state of the input attribute. The algorithm provider processes the entire set of cases , iteratively comparing the predicted classification of the cases with the known actual classification of the cases. The errors from the initial classification of the first iteration of the entire set of cases is fed back into the network, and used to modify the network’s performance for the next iteration, and so on. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes. One of the primary differences between this algorithm and the Microsoft Decision Trees algorithm, however, is that its learning process is to optimize network parameters toward minimizing the error while the Microsoft Decision Trees algorithm splits rules in order to maximize information gain. The algorithm supports the prediction of both discrete and continuous attributes.

### Microsoft Linear Regression

The Microsoft Linear Regression algorithm is a particular configuration of the Microsoft Decision Trees algorithm, obtained by disabling splits (the whole regression formula is built in a single root node). The algorithm supports the prediction of continuous attributes.

### Microsoft Logistic Regression

The Microsoft Logistic Regression algorithm is a particular configuration of the Microsoft Neural Network algorithm, obtained by eliminating the hidden layer. The algorithm supports the prediction of both discrete and continuous attributes.

Throughout this tutorial you will work in Business Intelligence Development Studio (as depicted in Figure 1). For more information about working in Business Intelligence Development Studio, see “Using SQL Server Management Studio” in SQL Server Books Online.

The tutorial is broken up into three sections: Preparing the SQL Server Database, Preparing the Analysis Services Database, and Building and Working with the Mining Models.

### Preparing the SQL Server Database

The AdventureWorksDW database, which is the basis for this tutorial, is installed with SQL Server (not by default, but as an option at installation time) and already contains views that will be used to create the mining models. If it was not installed at the installation time, you can add it by selecting Change button from Control Panel à Add/Remove Programs à Microsoft Sql server 2008. Look for AdventureWorksDW Sample Data Warehouse under Books online and Samples of Workstation Components.

### Preparing the Analysis Services Database

Before you begin to create and work with mining models, you must perform the following tasks:

1. Create a new Analysis Services project
2. Create a data source.
3. Create a data source view.

#### Creating an Analysis Services Project

Each Analysis Services project defines the schema for the objects in a single Analysis Services database. The Analysis Services database is defined by the mining models, OLAP cubes, and supplemental objects that it contains. For more information about Analysis Services projects, see “Creating an Analysis Services Project in Business Intelligence Development Studio” in SQL Server Books Online.

To create an Analysis Services project

1. Open Business Intelligence Development Studio.
2. Select New and Project from the File menu.
3. Select Analysis Services Project as the type for the new project and name it AdventureWorks.
4. Click Ok.

The new project opens in Business Intelligence Development Studio.

#### Creating a Data Source

A data source is a data connection that is saved and managed within your project and deployed to your Analysis Services database. It contains the server name and database where your source data resides, as well as any other required connection properties.

To create a data source

1. Right-click the Data Source project item in Solution Explorer and select New Data Source.
2. On the Welcome page, click Next.
4. The Connection Manager dialog box opens. In the Server name drop-down box, select the server where AdventureWorksDW is hosted (for example, localhost), enter your credentials, and then in the Select the database on the server drop-down box select the AdventureWorksDW database.
5. Click OK to close the Connection Manager dialog box.
6. Click Next.
7. By default the data source is named Adventure Works DW. Click Finish

The new data source, Adventure Works DW, appears in the Data Sources folder in Solution Explorer.

#### ating a Data Source View

A data source view provides an abstraction of the data source, enabling you to modify the structure of the data to make it more relevant to your project. Using data source views, you can select only the tables that relate to your particular project, establish relationships between tables, and add calculated columns and named views without modifying the original data source.

For more information, see “Working with Data Source Views” in SQL Server Books Online.

To create a data source view

1. In Solution Explorer, right-click Data Source
View, and then click New Data Source View.
2. On the Welcome page, click Next.
3. The Adventure Works DW data source you created in the last step is selected by default in the Relational data sources window. Click Next.
4. If you want to create a new data source, click New Data Source to launch the Data Source Wizard.
5. Select the tables in the following list and click the right arrow button to include them in the new data source view:
6. vAssocSeqLineItems
7. vAssocSeqOrders
8. vTargetMail
9. vTimeSeries
10. Click Next.
11. By default the data source view is named Adventure Works DW. Click Finish.

Data Source View Editor opens to display the Adventure Works DW data source view, as shown in Figure 2. Solution Explorer is also updated to include the new data source view. You can now modify the data source view to better work with the data.

Figure 2   Adventure Works DW data source view

Using Data Source View Editor, you can make changes to the way you see the data in the data source. For example, you can change the name of any object to something that may be more relevant to the project. The name in the original data source is not modified, but you can refer to the object by this friendly name in your project.

In order to create the market basket and sequence clustering scenarios, you need to create a new many-to-one relationship between vAssocSeqOrders and vAssocSeqLineItems. Using this relationship you can make vAssocSeqLineItems a nested table of vAssocSeqOrders for creating the models.

To create a new relationship

1. In the data source view, select OrderNumber from the vAssocSeqLineItems table.
2. Drag the selected column into the vAssocSeqOrders table, and place it on the OrderNumber column.

A new many-to-one relationship exists between vAssocSeqOrders and vAssocSeqLineItems.

The data mining editor (shown in Figure 4) contains all of the tools and viewers that you will use to build and work with the mining models. For more information about the data mining editor, see “Using the Data Mining Tools” in SQL Server Books Online.

Figure 4   Data mining editor

Throughout this tutorial you will work through the following scenarios:

• Targeted mailing
• Forecasting
• Sequence clustering

In the targeted mailing scenario, you will build the models, compare their predictive capabilities using the Mining Accuracy Chart view, and create predictions using Prediction Query Builder. In the other scenarios, you will build and explore the models.

The marketing department of Adventure Works is interested in increasing sales by targeting specific customers for a mailing campaign. By investigating the attributes of known customers, they want to discover some kind of pattern that can be applied to potential customers, which can then be used to predict who is more likely to purchase a product from Adventure Works.

Additionally, the marketing department wants to find any logical groupings of customers already in their database. For example, a grouping may contain customers with similar buying patterns and demographics.

Adventure Works contains a list of past customers and a list of potential customers.

Upon completion of this task, the marketing department will have the following:

• A set of mining models that will be able to suggest the most likely customers from a list of potential customers
• A clustering of their current customers

In order to complete the scenario, you will use the Microsoft Naïve Bayes, Microsoft Decision Trees, and Microsoft Clustering algorithms. The scenario consists of five tasks:

• Create the mining model structure.
• Create the mining models.
• Explore the mining models.
• Test the accuracy of the mining models.
• Create predictions from the mining models.

The first step is to use the Mining Model Wizard to create a new mining structure. The Mining Model Wizard also creates an initial mining model based on the Microsoft Decision Trees algorithm.

To create the targeted mailing mining structure

1. In Solution Explorer, right-click Mining Structures, and then click New Mining Structure.

The Mining Model Wizard opens.

2. On the Welcome page, click Next.
3. Click From existing relational database or data warehouse, and then click Next.
4. Under Which data mining technique do you want to use?, click Microsoft Decision Trees.

You will create several models based on this initial structure, but the initial model is based on the Microsoft Decision Trees algorithm.

5. Click Next.

By default the Adventure Works DW is selected in the Select Data Source View window. You may click Browse to view the tables in the data source view inside of the wizard.

6. Click Next.
7. Select the Case check box next to the vTargetMail table, and then click Next.
8. Select the Key check box next to the CustomerKey column.

If the source table from the data source view indicates a key, the Mining Model Wizard automatically chooses that column as a key for the model.

9. Select the Input and Predictable check boxes next to the BikeBuyer column.

This action enables the column for prediction in new datasets. When you indicate that a column is predictable, the Suggest button is enabled. Clicking Suggest opens the Suggest Related Column dialog box, which lists the columns that are most closely related to the predictable column.

The Suggest
Related Columns dialog box orders the attributes by their correlation with the predictable attribute. Columns with a value higher than 0.05 are automatically selected to be included in the model. If you agree with the suggestion, click OK, which marks the selected columns as inputs in the wizard. If you don’t agree, you can either modify the suggestion or click Cancel.

10. Input check boxes next to the columns listed in the following table.

## Motion Mining & Detection in Video Streams

Motion Mining & Detection in Video Streams

Ali Tarhini, Fatima Hamdan, Hazem Hajj

Department of Electrical and Computer Engineering

American University of Beirut

Beirut, Lebanon

{fsh08, hh63}@aub.edu.lb

Abstract–Motion detection is the process of finding and extracting motion in continuous media. There are many approaches for motion detection in continuous video streams. All of them are based on comparing the current video frame with one from the previous frames or with something known as the “background”. This method is useful in video compression when estimating changes in the frame is the only requirement, not the whole frame. Although this algorithm is simple, it suffers from disadvantages. If the object is moving smoothly, only small changes from frame to frame is detected. Another problem is when the object is moving slowly, the algorithms will not give any useful result at all. Another approach is to compares the current frame to the first frame in the video sequence rather than comparing with the previous one. But, this approach fails in practice because it will always have motion detected on the same region since the initial frame is static. In this paper, we propose a new method which overcomes the problems of previous methods by creating a moving frame over a time interval and comparing against this frame. Our tests show that the proposed algorithm outperforms the previous ones with speed and accuracy.

Keywords: Motion Mining, Video Stream, Detection.

INTRODUCTION

Motion detection is the process of finding and extracting motion in continuous media. There are many approaches for motion detection in continuous video streams. All of them are based on comparing the current video frame with one from the previous frames or with something known as the “background”. One of the most common approaches is to compare the current frame with the previous one. This method is useful in video compression when estimating changes in the frame is the only requirement, not the whole frame. Although this algorithm is straightforward in motion mining, it suffers from disadvantages. If the object is moving smoothly, only small changes from frame to frame is detected. So it’s impossible to get the whole moving object. Another problem is when the object is moving slowly, the algorithms will not give any useful result at all. There is another approach that compares the current frame the first frame in the video sequence rather than comparing with the previous one. So, if there were no objects in the initial frame, comparison of the current frame with the first one will give us the whole moving object independently of its motion speed. But, this approach fails in practice because if there was, for example, a car on the first frame, but then it is gone then this method will always have motion detected on the same place where the car was. Although one workaround over this problem is to renew the initial frame at regular intervals, but still it will not yield good results in the cases where the probability of guaranteeing that the first frame will contain only static background. But there can be an inverse situation. For example, if we add a picture on the wall in the room we will get motion detected continuously until the initial frame is renewed.

The most efficient motion mining algorithms are based on building a “background”, also known as the “scene” and comparing each current frame with that scene. There are many approaches to build the scene, but most of them are too complex and require significant computation time which consumes excessive processing power and introduce latency which is not accepted in real time systems. In this paper, we introduce a new approach for motion mining using the scene building method while improving performance over the current algorithms.

BACKGROUND & PREVIOUS WORK

One of the widely known algorithms in motion detection is the Mask Motion Object Detection(MMOD) algorithm. This algorithm works in steps that involve background subtraction, frame difference, frame difference mask and background difference mask are generated and utilized to detect moving objects. First, a threshold value is calculated under the assumption that camera noise obeys Gaussian distribution and background change is caused mainly by camera noise. The frame difference is found by subtracting the current frame from the previous frame. The background difference is found by subtracting the current frame from the background frame. Then the frame difference mask and the background difference mask are found by comparing the obtained frame difference and background difference with the threshold. N(x, y, t) denotes the total frame number that pixel (x,y) may belong to background region. If N(x, y, t) >= fthreshold which shows that the probability which pixel belong to background region is high, then the background difference mask is used to detect moving objects otherwise the frame difference mask is used. To detect moving objects, the current frame is anded with the chosen mask.

Motion Region Estimation Technique(MRET) is another algorithm which also uses image subtraction for motion detection. It starts by processing the output image obtained from image subtraction after thresholding and noise removal. Thresholding determine the areas of output image (from subtraction) consists of pixels whose values lie within the threshold value range. Threshold value also indicates the sensitivity of motion to
detect. After thresholding, the image may still contain a small amount of noise. Noise is removed by using median filtering method. Median filtering is more effective than convolution when the goal is to simultaneously reduce noise and preserve edges. The gray level of each pixel is replaced by the median of the gray levels in a neighborhood of that pixel. Motion region information can be obtained by using AND and OR double difference image IO during different time t and frame n from video frames.

PROPOSED METHOD

Given a background frame representing the initial frame, a current frame representing the latest frame captured from the video source. As a preprocessing approach, both the background frame and the current frame are converted to grayscale. These grayscale images will be used throughout the rest of the algorithm. The algorithm then proceeds to compare the current frame with the background frame. But this would give the results we discussed earlier that suffer from many problems. The tweak here is to “move” the background frame in time at regular intervals to approach the current frame, but not reach it because if the two frames match, the difference becomes zero and therefore motion will not be determined. In order to approach the current frame from the background frame, the color value of the pixels in the background frame is changed by one level per frame.

The algorithm takes two parameters as input, the background frame and the current frame then it works as follows:

1. Create a grayscale version of the background frame
2. Create a grayscale version of the current frame
3. Subtract the current frame from the background frame to yield the difference frame
4. Render the difference on screen
5. Check whether X frame count have passed(X=threshold)
6. If step5 evaluates to true then merge background frame with the current frame. Otherwise go back to step 3 and repeat.

EXPERIMENTS& RESULTS

The goal of this experiment was to study the performance of the proposed algorithm. A camera is setup in a room and its video streams is fed into 3 motion detection algorithms. The first one uses the background for comparison, the 2nd one used the previous frame for comparison and the 3rd one is our proposed algorithm that uses a moving background for comparison with the current frame.

Figure1: Comparison with background

Figure2: Comparison with previous frame

Figure3: Comparison with moving background

In figure1, the comparison with the background frame detects the entire person in the picture. This looks good for the first sight but it’s obvious that the approach is not feasible in practice because the background happened to be static, were it dynamic(i.e clock on the wall)…

In figure2, the person was walking slowly and with minimum movement. As we can see, the detected region of motion appears to be cluttered and disconnected.

In figure3, the person was also moving slowly and with minimum movement. We can see the great improvement in the detected region as all the body was detected as one object. The reason for the improvement from the algorithm used in figure2 is that the comparison was done with a combination of previous frames in order to approximate the state of the background frame a little while ago. This approach is better because if the comparison was with the previous frame, only a small difference will be detected. This shows that our algorithm outperforms the previous algorithms.

CONCLUSION

In this paper, a motion detection algorithm was implemented by applying a new approach to improve the accuracy of existing motion detection algorithms. The implementation this technique increased the efficiency of the detected region when the object is subject to slow and/or smooths movement.

## Mining Association Rules in Wireless Sensor Network

Mining Association Rules in Wireless Sensor Network

Ali Tarhini, Fatima Hamdan

Department of Electrical and Computer Engineering

American University of Beirut

Beirut, Lebanon

{aat16, fsh08}@aub.edu.lb

Abstract Wireless Sensor Networks produce large amounts of data during their lifetime operation and the data must be transferred to the base station. Transferring large scale data consumes energy because of the long transmission time. Recently, frequent pattern mining received attention in extracting knowledge from WSNs data. In particular, mining for association rules on the sensor data provides useful information for different applications. In this paper, we study one of the most widely used association rules mining algorithms, the Apriori, and show the effects of implementing this algorithm in a WSN in order to predict the source of future events and to identify sets of temporally correlated sensors therefore thereby improving the Quality of Service of WSANs in various ways.

INTRODUCTION

Wireless Sensor Networks produce large amounts of data during their lifetime operation and the data must be transferred to the base station. Transferring large scale data consumes energy because of the long transmission time. Recently, frequent pattern mining received attention in extracting knowledge from WSNs data. In particular, mining for association rules on the sensor data provides useful information for applications that require predicting the source of future events or faulty nodes. For example, we are expecting to receive an event from a certain node, and it does not occur. It may also be used to identify the source of the next event in the case of the emergency preparedness class of applications. Identifying correlated sensors can also be helpful in compensating for the undesirable effects of unreliable wireless communication such as missed reading and in the resource management process such as deciding which nodes can be switched safely to a sleep mode without affecting the coverage of the network. In this paper, we study the possibility of extracting association rules between sensors using one of the most widely used data mining algorithms for frequent patterns, The Apriori algorithm. The sensors’ association rules will capture the set of sensors that report events within the same time interval. An example of such rules could be (s1s2 ⇒ s3) which can be interpreted as follows: if we received events from sensor s1 and sensor s2, then we may receive events from sensor s3 with λ units of time. Generating these rules requires a redefinition of the association rule mining problem, as the definition originally was proposed for business data, in order to fit with WSN terminologies. Also, an efficient extraction methodology is needed to extract the required data for the mining algorithms.

RELATED WORK

The mining association rules problem was first introduced in [1], where it was originally proposed in terms of transactional databases. These rules were able to predict the items that can be purchased within the same transaction. Such rules have a great impact on making decisions about which item should be put on sale or which items should be placed near to each other.

Mining association rules is a costly process and the complexity grows exponentially with the number of the items presented in the database. Many algorithms have been developed to generate association rules, each of them adopting a different optimization technique that applies either to the structure of the data or to the algorithm that traverses the search space. Among these algorithms are Apriori [1], FP-growth [4], and DHP (Directed Hashing and Pruning) [5]. A comprehensive survey of the recent algorithms for mining association rules is presented in [6]. In [7], a framework for extracting association rules from sensor networks is proposed. The association involves the sensors’ values as the main objects in the rule. In their methodology, the time is divided into intervals, and the sensors’ values at that interval formulate the transaction to be stored in the database. Each different value of a sensor is regarded as single element and it is assumed that sensors take on a finite number of discrete states. Each transaction is associated with a weight value indicating the validity of the transaction. The support of the pattern is then defined to be the total length of all non overlapping intervals in which the pattern occurs. The main differences between this methodology and the one presented in this paper, is that we assume objects in the same context are the sensors themselves, not their values, and the support of the pattern is the number of epochs in which the pattern occurs as a subset. This makes the formulation presented in this paper a more general one.

Most of the data mining techniques applied to sensor data are based on centralized data extraction, where the data is collected at a single site and then the mining technique is applied. However, there are still some techniques that use the distributed nature of sensors and try to apply distributed mining algorithms. In [8] they propose a distributed framework for building a classifier. In their framework, most of the work is pushed to the sensors themselves to build local models.

APRIORI IN WSNs

Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

In order to allow Apriori to be used in WSN, we will need to map the sensors to objects so that each sensor in the WSN domain will represent an item in the apriori domain and hence a transmission of multiple sensors within the same time interval T in the WSN domain will represent an item set in the Apriori domain. Formally, Let S = {s1, s2, . . . , sm} be a set of sensors in a particular sensor network. We assume that the time is divided into equal sized slots {t1, t2, . . . , tn} such that ti+1 – ti = λ, for all 1 < i < n. λ is the size of each time slot. A transaction T = {s1, s2, . . . , sk} S is called an itemset, or simply, a pattern of sensors sharing common transmission behavior.

PROPOSED METHOD

Our algorithm runs on the base station, where transmissions are received from sensors every time interval. Upon receiving of each message, the system will record the sensors participating in the message which represents an itemset. The itemset is them stored for later processing when the Apriori runs. This process of logging sensors participating in the transmission is a continuous process and it is up to the system administrator to determine the proper number of itemsets to store. Off course, the more itemsets stores, there will be more likely to detect additional patterns but this will require more memory and energy consumption. On the other hand, if only a small number of itemsets are stored, the mining accuracy of Apriori will decrease and the resultant patterns will not be necessarily useful or meaningful in the WSN.

Once the determined number of itemsets has been stored, the database is now ready for association rule mining and Apriori algorithm is triggered to start processing. The algorithm works as follows:

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space. We will first describe this property, and then show an example illustrating its use.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

Note that a two-step process is followed to compute Lk from Lk-1 for k > 2; these consist of join and prune actions:

1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.
2. The prune step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

EXPERIMRNTS & RESULTS

Will be available in the full paper.

REFERENCES

[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference. Management of Data, Washington, D.C., United States May 1993.

[2] Intel Lab Data, http://berkeley.intel-research.net/labdata/.

[3] G. Mathur, P. Desnoyers, D. Ganesan, and P Shenoy. Ultra-Low Power Data Storage for Sensor Networks.In proceedinig of the Fifth IEEE/ACM Conference on Information Processing in Sensor Networks (IPSNSPOTS), Nashville TN, April 19 -21, 2006 [4] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000.

[5] J.S. Park, M. Chen, P.S. Yu. An Effective Hash-Based Algorithm for Mining Association Rules. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 175-186, May

1995.