Data mining in Sql Server 2008 & Visual Studio

Architecture of Microsoft SQL S

Image via Wikipedia

Creating a Project in the Business Intelligence Development Studio

Follow these steps to create a new project. To start BIDS, click the Start button and go to All Programs->Microsoft SQL Server 2008->SQL Server Business Intelligence Development Studio. In BIDS, select File New Project. You will see the Business Intelligence Projects template. Click the Analysis Services Project template. Type “AnalysisServices2008Tutorial” as the project name and select the directory in which you want to create this project. Click OK to create the project.

The Solution Explorer Pane

The Solution Explorer contains the following:

1) Data source objects: They contain details of a connection to a data source, which include server name, catalog or database name, and login credentials. You establish connections to relational servers by creating a data source for each one.

2) Data Source Views: When working with a large operational data store you don’t always want to see all the tables in the database. With Data Source Views (DSVs), you can limit the number of visible tables by including only the tables that are relevant to your analysis.

3) Cubes: A collection of measure groups (from the fact tables) and a collection of dimensions form a cube. Each measure group is composed of a set of measures. Cubes can have more than three dimensions and not necessarily the three – dimensional objects as their name suggests.

4) Dimensions: They are the set of tables that are used for building the cube. Attributes that are needed for the analysis task are selected from each table.

5) Mining Structures: Data mining is the process of analyzing raw data using algorithms that help discover interesting patterns not typically found by ad – hoc analysis. Mining Structures are objects that hold information about a data set. A collection of mining models form a mining structure. Each mining model is built using a specific data mining algorithm and can be used for analyzing patterns in existing data or predicting new data values.


The Properties Pane


If you click an object in the Solution Explorer, the properties for that object appear in the Properties pane. Items that cannot be edited are grayed out. If you click a particular property, the description of that property appears in the Description pane at the bottom of the Properties pane.




Data mining in sql server 2008


The data mining process is regarded as a series of steps to be followed which include the following:

1) Creating a Data Source:

Cubes and dimensions of an Analysis Services database must retrieve their data values from tables in a relational data store. This data store, typically part of a data warehouse, must be defined as a data source.

To create a data source, follow these steps:

a) Select the Data Sources folder in the Solution Explorer.

b) Right – click the Data Sources folder and click New Data Source. This launches the Data Source Wizard.

c) In the data source wizard you will provide the connection information about the relational data source that contains the “Adventure Works DW 2008” database. Click the New button under Data Connection Properties to specify the connection details. You will enter here the server name, the database name, and choose one of the two authentication modes either sql server authentication or windows authentication.

d) In the Impersonation Information page you need to specify the impersonation details that Analysis Services will use to connect to the relational data source. There are four options. You can provide a domain username and password to impersonate or select the Analysis Service instance’s service account for connection. The option Use the credentials of the current user is primarily used for data mining where you retrieve data from the relational server for prediction. If you use the Inherit option, Analysis Services uses the impersonation information specified for the database.

e) On the final page, the Data Source Wizard chooses the relational database name you have selected as the name for the data source object you are creating. You can choose the default name specified or specify a new name here.

2) Creating a Data Source View ( DSV )

The Adventure Works DW database contains 25 tables. The cube you build in this chapter uses 10 tables. Data Source Views give you a logical view of the tables that will be used within your OLAP database.

To create a Data Source View, follow these steps:

a) Select the Data Source Views folder in the Solution Explorer.

b) Right – click Data Source Views and select New Data Source View. This launches the Data Source View Wizard.

c) In the data source view wizard you can select the tables and views that are needed for the Analysis Services database you are creating. Click the > button

so that the tables move to the Included Objects list. We will include in the data source view here the following set of tables:

FactInternetSales, FactResellerSales, DimProduct, DimReseller, DimPromotion, DimCurrency, DimEmployee, DimSalesTerritory, DimTime, DimCustomer, Dim Geography.

d) At the final page of the DSV Wizard you can specify your own name for the DSV object or use the default name. Specify the “Adventure Works DW” for the DSV Name in the wizard and click Finish.

If you open the data source view in the solution explorer the data source view editor opens which contains three main areas: Diagram Organizer, the Tables view, and the Diagram view. In the diagram view you can see a diagram of all the added tables with their relationships among each other. In the tables view you can see all the tables that are contained in this data source view. In the diagram organizer, you can right click in the pane here to create a new diagram and drag and drop the tables that u wish to add, or simply add any table u want then right click on it and choose add related tables, this will add all the related tables to the given chosen table. In order to add a new field to a given table, you simply right click on the table in the diagram view and choose add named reference, a dialog will appear where you can enter the name of the new field and the formula upon which it is derived. For example, to add a new field named FullName to the table employee, you write the following formula: FirstName + ‘ ‘ + MiddleName + ‘ ‘ + LastName.

There are different layouts in the data source view. You can switch between rectangular layout and diagonal layout in the DSV by right – clicking in the DSV Designer and selecting the layout type of your choice.

To see a sample of the data specified by your DSV, right – click a table in the DSV Designer and select Explore Data. The data presented is only a subset of the underlying table data. By default the first 5,000 rows are retrieved and shown within this window. You can change the number of rows retrieved by clicking the Sampling Options button. Clicking the Sampling Options button launches the Data Exploration Options dialog where you can change the sampling method, sample count, and number of states per chart, which is used for displaying data in the chart format.

When you click the Pivot Table tab you get an additional window called PivotTable Field List that shows all the columns of the table. You can drag and drop these columns inside the pivot table in the row, column, details, or filter areas. The values in the row and column provide you with an intersection point for which the detailed data is shown.

3) Creating New Dimensions

Dimensions help you define the structure of your cube so as to facilitate effective data analysis. Specifically, dimensions provide you with the capability of slicing data within a cube, and these dimensions can be built from one or more dimension tables.

a) Create the DimGeography dimension:

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension.

 In the Select Creation Method screen select the “Use an existing table” option and click next.

 In the Specify Source Information page, you need to select the DSV for creating the dimension, select the main table from which the dimension is to be designed, specify the key columns for the dimension, and optionally specify a name column for the dimension key value. By default, the first DSV in your project is selected. Because the current project has only one DSV (the Adventure WorksDW DSV), it is selected. Select the DimGeography table from the Main table drop – down list.

 Click the Next button to proceed to the next step in the Dimension Wizard.

 The Dimension Wizard now analyzes the DSV to detect any outward – facing relationships from the DimGeography table. An outward – facing relationship is a relationship between the DimGeography table and another table, such that a column in the DimGeography table is a foreign key related to another table. The Select Related Tables screen shows that the wizard detected an outward relationship between the DimGeography table and the DimSalesTerritory table. In this example you will be modeling the DimGeography table as a star schema table instead of snowflake schema. Deselect the DimSalesTerritory table and click next.

 The Select Dimension Attributes screen of the Dimension Wizard displays the columns of the main table that have been selected for the dimension you’re creating.

 Select all the attributes of the DimGeography table (all the attributes in the screen), leave their Attribute Type as Regular, allow them to be browsed, and click next.

 The final screen of the Dimension Wizard shows the attributes that will be created for the dimension based on your choices in the wizard. Click the Finish button.

Open the DimGeography dimension by double clicking on it in the solution explorer. In the Dimension structure tab you can see all the table attributes that have been added to this dimension. In the hierarchies’ pane, drag and drop the English country region name attribute followed by the State Province Name followed by the city and then the postal code. Then you have to build the relationships among these attributes in the hierarchy by clicking on the attribute relationships tab, and then dragging the postal code attribute towards the city, this means that the postal code value determines

the city. Drag the city towards the state. Drag the state towards the country. This will build the functional dependencies among the attributes in the hierarchy. Then you have to ensure that the city value is unique in determining the state name value by setting the key columns property of the city attribute to both the state province code and city, and setting its name columns to the city attribute. Similarly set the key columns of the postal code attribute to the postal code, the city, and the state province code attributes, and set its name columns to the postal code.

Deploy the project, by right clicking the project name and choosing deploy. After a successful deployment, you can browse the dimension by selecting the browse tab, where you can see all the data of the dimgeography table arranged according to their hierarchical levels.

b) Create the DimTime dimension

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension. When the welcome screen of the Dimension Wizard opens up, click next.

 In the Select Creation Method page of the wizard, select the “Use an existing table” option and click next.

 In the Specify Source Information page, select DimTime as the main table from which the dimension is to be designed and click next.

  In the Select Dimension Attributes page, in addition to the Date Key attribute, enable the checkboxes for the following attributes: Calendar Year, Calendar Semester, Calendar Quarter, English Month Name, and Day Number of Month.

 Set the Attribute Type for the “Calendar Year” attribute to Date Calendar Year.

 Set the Attribute Type for the “Calendar Semester” attribute to Date Calendar Half Year.

 Set the Attribute Type for the “Calendar Quarter” attribute to Date Calendar Quarter.

 Set the Attribute Type for the “English Month Name” attribute to Date Calendar Month.

 Set the Attribute Type for the “Day Number of Month” attribute to Date Calendar Day of Month.

 Create a multilevel hierarchy Calendar Date with the levels Calendar year, Calendar Semester, Calendar Quarter, Month (rename English Month Name), and Day (rename Day Number Of Month).

 Save the project and deploy it to the analysis services instance.

 Switch to the Browser pane of the DimTime dimension, where you can see that the date hierarchy is arranged according to the hierarchy that we defined above.

c) Create the DimEmployee dimension

 Launch the Dimension Wizard by right – clicking Dimensions in the Solution Explorer and selecting New Dimension. If the welcome screen of the Dimension Wizard opens up, click next.

 Make sure the “Use an existing table” option is selected and click next.

 In the Specify Source Information page, select DimEmployee as the main table from which the dimension is to be designed and click next.

 On the Select Related Tables screen, uncheck the DimSalesTerritory table and click next.

 In the Select Dimensions Attributes dialog, the Dimension Wizard has detected three columns of the DimEmployee table to be included as attributes. The Dimension Wizard will select columns if they are either the primary key of the table or a foreign key of the table or another table in the DSV. The attributes suggested by the Dimension Wizard in this example are the key attribute Employee Key, the parent – child attribute Parent Employee Key, and the Sales Territory Key, which is a foreign key column to the DimSalesTerritory table.

 Select all the columns of the DimEmployee table as attributes and click next.

 Double – click the DimEmployee dimension in the Solution Explorer to open the Dimension Designer.

 Change the NameColumn property of the Key attribute Dim Employee to FullName and deploy the project to your Analysis Services instance.

When you browse the Parent – Child hierarchy, you will see the members of the hierarchy showing the full names of the employees.

4) Creating a Cube Using the Cube Wizard

Cubes are the principal objects of an OLAP database that help in data analysis. Cubes are multidimensional structures that are primarily composed of dimensions and facts. The data from a fact table that is stored within the cube for analysis are called measures.

To build a new cube, follow these steps:

a) Right – click the Cubes folder and select New Cube. Click next on the introduction page to proceed.

b) In the Select Creation Method page you have the option to build a cube from existing tables, create an empty cube, or create a cube based on a template and generate new tables in the data source. Choose to build the cube from the existing tables in the Adventure Works DW data source. Click Next to proceed to the next step in the Cube Wizard.

c) The next page of the Cube Wizard is the Measure Group Tables selection page. You now must select one or more tables that will serve as fact tables for your Measure Group. The Suggest button on this screen can be used to have the Cube Wizard scan the DSV to detect the fact tables in the DSV and

detect fact tables. Click the Suggest button to have the Cube Wizard automatically select potential Measure Group tables. The Cube Wizard now scans the DSV to detect the fact and dimension tables in the DSV, automatically selects the candidate tables. Any table that has an outgoing relationship is identified as a candidate fact table, whereas a table that has an incoming relationship is detected as a dimension table. Select both the FactResellerSales and the FactInternetSales as the fact tables. And then select the measures that you need to include from these fact tables for the analysis task.

d) In the Select Existing Dimensions page, the Cube Wizard displays a list of all existing dimensions defined in the project. Accept the selection of all the dimensions and click next.

e) The Cube Wizard asks you to select any new dimensions to be created from existing tables in the data source that are not already used for dimensions in the project. You can deselect dimensions that are not needed for your cube on this page. This illustration will use the Fact tables only as measure groups and not for dimensions. Deselect the Fact Reseller Sales and Fact Internet Sales dimensions on this page and click next.

f) In the final page of the Cube Wizard you can specify the name of the cube to be created and review the measure groups, measures, dimensions, attributes, and hierarchies. Use the default name Adventure Works DW suggested by the Cube Wizard and click Finish.

After creating the cube, the new dimensions are automatically created. But these dimensions will have only their primary and foreign keys selected. You have to open each created dimension and select the attributes that you need to add from each table.

g) Press F5 to deploy, build and process the cube. Deploying the cube means building the cube according to the structure that you have defined, while processing the cube means computing all the aggregation values for all the cells in the cube.

You can add a new calculated measure to the cube by Right – clicking in the Script Organizer pane of the Calculation Scripts tab and entering the formula for this new measure.

Now that the cube has been deployed, switch the BIDS Cube Designer view to the Browser page. In the Browser page you will see three panes: a Measure Group pane, a Filter pane, and a Data pane. Suppose you want to analyze the Internet sales of products based on the promotions offered to customers and the marital status of those customers. First you would need to drag and drop [DimPromotion].[English Promotion Type] from the Measure Group pane to the OWC rows area. Next, drag and drop [Dim Customer].[Marital Status] from the Measure Group pane to the OWC columns area. Finally, drag and drop the measure [Sales Amount] from the Fact Internet Sales measure group to the Drop Totals or Detail Fields Here area of the OWC pane.

You can also use MDX queries to query the cube. These MDX queries are similar to the sql server queries. Just as SQL (Structured Query Language) is a query language used to retrieve data from relational databases, MDX (Multi – Dimensional expressions) is a query language used to retrieve data from multidimensional databases.

The format of MDX query is shown below:

SELECT [< axis expression >, [< axis expression > …]]

FROM [< cube_expression >]

[WHERE [slicer expression]]

5) Creating a Mining Structure

Analysis Services 2008 provides nine data mining algorithms that can be utilized to solve various business problems. These algorithms can be broadly classified into five categories based on the nature of the business problem they can be applied to. They are:

1) Classification

2) Regression

3) Segmentation

4) Sequence analysis

5) Association

We aim at grouping customers that undergo similar characteristics.

To create a relational mining model, follow the following steps:

a) Right – click the Mining Structures folder in the Solution Explorer and select New Mining Structure as to launch the Data Mining Wizard that helps you to create data mining structures and models. Click the Next button.

b) Select the “From existing cube” radio button and click next.

c) Select Microsoft Clustering and click next.

d) Choose the Customer table as the primary table and enter the following attributes as inputs for building clusters:

Age, Yearly Income, Number of cars owned, Number of Children at home and Occupation.

You will now see the clustering mining model represented as several nodes with lines between these nodes. By default the clustering mining model groups the customer into ten different clusters. The number of clusters generated can be changed from a property for the cluster mining model. Each cluster is shown as a node in the cluster viewer. Darker shading on the node indicates that the cluster favors a specific input column and vice versa. If there is a similarity between two clusters, it is indicated by a line connecting the two nodes. Similar to the shade of the color node, if the relationship is stronger between two nodes, it is indicated via a darker line. You can move the slider on the left of the cluster diagram from All Links to Strongest Links. As you do this you can see the weaker relationships between the clusters are not displayed. You can change the cluster name by right – clicking the cluster and selecting Rename. You can select desired input columns of the mining model from the Shading Variable drop –

down to see the effect of the column on the various clusters. When you choose a specific shading variable column you need to choose one of the states of the column to be used as the shading variable for the clusters.

The Cluster Profiles view shows the relationship between the mining columns of the model and the clusters in a matrix format. The intersection cell of a specific column and a cluster shows a histogram bar of the various values of the column that are part of the cluster. The size of each bar reflects the number of items used to train the model.

The cluster Characteristics tab shows the characteristics of a single cluster and how the various states of the input columns make up the cluster.

The Cluster Discrimination tab shows the characteristics of a Cluster in comparison with the characteristics of the complement of this Cluster.

Parallel Apriori algorithm for frequent pattern mining

Apriori is a frequent pattern mining algorithm for discovering association rules. It is one of the most well-known algorithms for discovering frequent patterns along with FP-Growth algorithm. However, as a result of the current advances in the area of storage of very large databases and the tremendous growth in number of transactions, sequential Apriori becomes a bottleneck because of the long running time of the algorithm. In this paper, our goal is to develop a new parallel version of Apriori that aims to reduce the overall running time of the algorithm. Although Apriori is not known to be highly parallelizable, several attempts have been made to parallelize it in various ways, either by using parallel I/O hardware to optimize database scans or by distributing the workload on multiple processors. However, many of the parallel approaches suffer from noticeable latencies in synchronizing results being collected from each individual processor after a parallel iteration terminates. Our approach focuses on trying to maximize the workload being executed in parallel and to minimize the synchronization point delays by adopting a parallel pre-computing scheme during generation of the superset. By applying our new approach, the running time of the algorithm is reduced by an order of magnitude compared to other parallel implementations of the same algorithm.


Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

Apriori has a bottleneck when the number of transactions is very large since multiple database scans will be performed by apriori in every iteration. In order to improve the scalability of Apriori, several data structures and methods were constructed that can boost performance. These include:

Transaction Reduction: Atransaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration

Partitioning: A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is minsup, the number of transactions in that partition. For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.

Sampling: The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy gainst efficiency. The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.

Dynamic itemset counting: A dynamic itemset counting technique [1] was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent.

However, it seems that even with the above mentioned approaches, performance of Apriori decreases substantially when large number of transactions is involved. Therefore, many parallel approaches have been conducted to increase the speed of the Apriori algorithm including:

1. Partitioning the transactions into N partitions where N is the number of processors then for each iteration computes local support then exchange results with all N-1 processors to generate the next superset.[2]

2. Use a master/slave scheme by selecting one processor to be the master and N-1 processors to be working processors. In this approach, the input transactions are not partitioned but the master processor generates unique itemsets and divides the list of itemsets equally to the N-1 processors to compute their support. Once computations of supports complete, all N-1 processors send their results to the master in order to compute the next superset and divide the work further.[3]

In this paper, a variation of the 2nd approach is studied and implemented, where an additional data structure is used to predict itemsets of the next iteration and cache them for later retrieval thus reducing database scans. The 2nd approach is selected because it provides high flexibility with the way itemsets get computed between different cores and mainly because of the similarity of the master-slave architecture between this method and the proposed one.


Our approach focuses on the algorithm part of Apriori by trying to maximize the workload being executed in parallel and to minimize, as much as possible, the synchronization point delays by allowing a portion of the superset generation forming the candidates to be generated in parallel as well. Only the remaining part of the superset will be generated at the synchronization point.

The proposed algorithm is described in the following steps:

1.     Select one processor to be the master, the other N-1 processors are slaves

2.     Create a local Hash Table, at the master; we refer to it as G during the remaining part of this paper.

3.     Initialize K, the itemset size, to 1

4.     Generate the 1-itemsets normally using the master processor only

5.     At the master, divide the k-itemsets into N-1 groups and assign each group to exactly 1 of the available N-1 processors. Then for each processor:

a.      Lookup the support of each local itemset in G, if it’s not found, compute support.

b.     Prune itemsets that have their support less than the minimum support where minimum support is the threshold value to decide upon whether to drop or keep an itemset. It is also global to the entire system.

c.      Generate the candidates of the K+1 itemsets from the local items and find their support.

d.     Store the results of the K+1 itemsets (the itemset and its support) into G.

e.      Send the master processor a signal indicating end of processing.

6.      At the master processor, at the end of each arrival of a finalization signal:

a.      Generate the K+1 superset using itemsets in G.

7.     Increment K by 1.

8.     Go to step 2 and repeat until there are no more itemsets to generate.


The goal of this experiment was to study the performance of parallelizing Apriori on CPU for large data sets. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were recorded were compared to performance statistics of state-of-the art Apriori. By applying the above mentioned parallel procedures, the CPU program was able to compete with the state-of-the art Apriori. Results showed that the proposed method is as fast as state-of-the art method but is better regarding BUS utilization where 75% bus utilization was incurred by state-of-the art apriori and 34% bus utilization was incurred by our proposed method. This greatly reduced total power consumption of the algorithm.


In this paper parallel Apriori algorithm was implemented by applying a new approach using intermediate data structure for computing and caching k+1 superset for use in future iterations to facilitate computation of the superset in parallel and increase degree of parallelism. The implementation of the parallel technique showed competing results with state-of-the art apriori in speed but outperformed it in bus utilization by 40% the running time of the algorithm which would make the algorithm efficient for energy consumption than the current best Apriori.

FBI Data-Mining Program:Total Information Awareness


Total Information Awareness (TIA), a project of the Defense Department’s Defense Advanced Research Projects Agency (DARPA), aims to build a centralized database containing private transactional data on all Americans, including “records on credit-card purchases, plane flights, e-mails, websites, and housing.” In the tradition of the Bush Administration’s policy of preemptive action, TIA’s goal is to create electronic tools to “better detect, classify, and identify potential foreign terrorists…to increase the probability that authorized agencies of the United States can preempt adverse actions.” The government establishes baseline patterns identifying what they see as suspicious behavior, such as buying one-way plane tickets or drastic changes in spending habits, and then conducts pattern-based electronic searches of huge amounts of information to find matches for their trends. The personal information is “mined” from private sector databases as well as government databases.

In September 2003, Congress gave into public objection by denying funding to TIA or any successor program. Nevertheless, the Pentagon’s Technology and Privacy Advisory Committee (TAPAC) reported that TIA-like activities could be continued to be pursued outside the public’s view. Since that time, it has been reported that aspects of TIA went underground and had now become a fast-growing data-mining system billed as a tool for hunting terrorists and is being used in hacker and domestic criminal investigations, containing tens of thousands of records from private corporate databases, including car-rental companies, large hotel chains and at least one national department store.


TIA is both a meta-search engine–querying many data sources at once–and a tool that performs pattern and link analysis that uncover hidden patterns and subtle relationships in data and infer rules that allow for the prediction of future results. Subject-based “link analysis” through utilization of the FBI’s collection data sets, combined with public records on predicated subjects. Link analysis uses these data sets to find links between subjects, suspects, and addresses or other pieces of relevant information, and other persons, places, and things. This technique is currently being used on a limited basis by the FBI. The algorithm also uses “pattern analysis” as part of its queries that take a predictive model or pattern of behavior and search for that pattern in data sets starting from a predefined predictive model. TIA included several sub projects for mining different sets of data, such as:


Although the investment in TIA was considerably huge and lots of efforts were put together in the data collection and mining process, the alleged ability of data-miners to discover hidden “patterns” and “trends” among disparate data-sets was naïve and lacked precision because so little is known about what patterns indicate terrorist activity; as a result, they are likely to generate huge numbers of false leads. Besides, Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual; the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information. All these led to a failure in developing an efficient counter-terrorism prediction system and caused the complete termination of the TIA project.


ChiMerge discretization algorithm

Chi merge is a simple algorithm that uses the chi-square statistic to discretize numeric attributes. It is a supervised, bottom-up data discretization method. It checks each pair of adjacent rows in order to determine if the class frequencies of the two intervals are significantly different. It tests the following hypothesis: that the two adjacent intervals are independent. If the hypothesis is confirmed the intervals are merged into a single interval, if not, they remain separated.

In this problem we select one of the following as an attribute: sepal length, sepal width, petal length, and petal width. Three Classes are available which are Iris-setosa, Iris-versicolor, Iris-virginica.

Chi merge algorithm is described below:

  1. Set the values for minintervals and maxintervals in order not to result in a large number or small number of intervals. Choose minintervals and maxintervals such that:

2 ≤ minintervals < maxintervals

  1. Look up the threshold value to use based on the significance level and the number of degrees of freedom. The threshold value can be found in statistical tables depending on two factors:
  • The significance level, the higher the significance level, the higher the threshold value and the more likely the adjacent intervals will be merged.
  • The number of degrees of freedom=(2-1)*([number of classes]-1) = [number of classes] -1

Here we have 3 classes thus the number of degrees of freedom is 2.

  1. A) Sort the attributes into ascending numerical order.

B) Create a frequency table containing one row for each distinct attribute value and one column for each class. Enter the number of occurrences of each distinct value of the attribute for each possible class in the cells of the table. Then generate a set of distinct intervals. Set the interval lower bound equal to the attribute value (inclusive) that belongs to this interval, and set its upper bound to the attribute value (exclusive) belonging to the next interval.

C) If the number of rows = minintervals then stop otherwise go to next step.

D) For each pair of adjacent rows in the frequency table calculate e the expected frequency value for that combination from the product of row and column sums divided by the total number of occurrences in the two rows combined. The expected value is the frequency value that would be expected to occur by chance given the assumption of independence. Calculate the value of (O – E) 2/E (if E<0.5, replace E by 0.5) where O is the observed frequency value for that combination, then add the values obtained to give chi 2 for that pair of adjacent rows. No chi2 is calculated for the final interval because there is not one below it.

E) Chi merge selects the smallest value of chi 2.This value is compared     with a threshold, the two intervals are merged based on the following condition if chi2 is less than threshold or if the number of rows > maxintervals, if the condition is met then the two intervals are merged and the attribute for the merged row to that of the first of the two consistent rows and the upper interval bound to the upper bound interval of the next   attribute. Then go to step C, but if the condition is not met then the algorithm stops.

The chi2 values are calculated for the revised frequency table, chi merge proceeds iteratively in this way merging two intervals at each stage until the chi2 for the remaining pairs of intervals are greater than the threshold value and the number of intervals is less than the maximum number of intervals, hence no further merging of intervals is possible and the discretisation is complete.

Check these ChiMerge powerpoint slides that visualizes the above algorithm.

Now lets take the IRIS data set, obtained from (UC-Irvine Machine Learning Data Repository), as a data set to be discretized. We will perform data discretization for each of the four numerical attributes using the Chi-Merge method having the stopping criteria be: max-interval = 6.


These attributes refer to Sepal Length, Sepal Width, Petal Length,Petal Width & Class respectively. Lets create two classes that i’ll call “IRISDataItem” and “IRISWorkItem”. The former will be used to store the attribute values for each when the data file is read while the latter will be used for later processing and storing the ChiSquare for each distinct attribute.

Public Class IrisDataItem

    Public Enum IRISClass As Byte
        IrisSetosa = 1
        IrisVersicolour = 2
        IrisVirginica = 3
    End Enum

    Public Property SepalLength As Single
    Public Property SepalWidth As Single
    Public Property PetalLength As Single
    Public Property PetalWidth As Single
    Public Property [Class] As IRISClass

End Class

Public Class IrisWorkItem
    Public Property Attribute As Decimal
    Public Property Frequency As Integer
    Public Property SetosaCount As Integer
    Public Property VersicolorCount As Integer
    Public Property VirginicaCount As Integer
    Public Property Start As Decimal
    Public Property [End] As Decimal
    Public Property ChiSquare As Decimal

    'Operator overloading for merging of two adjacent items
    Public Shared Operator ^(ByVal Item1 As IrisWorkItem, ByVal Item2 As IrisWorkItem) As IrisWorkItem
        Item1.SetosaCount += Item2.SetosaCount
        Item1.VirginicaCount += Item2.VirginicaCount
        Item1.VersicolorCount += Item2.VersicolorCount
        Item1.End = Item2.End
        Item2.Frequency = -1
        Return Item1
    End Operator

    'Operator overloading for calculating the ChiSqaure for two items
    Public Shared Operator &(ByVal Item1 As IrisWorkItem, ByVal Item2 As IrisWorkItem) As Decimal
        Return ChiMerge.ChiSquare(Item1, Item2)
    End Operator
End Class

Our next task is to actually read the data file and fill create a List(Of IRISDataItem) object. This can be easily done by reading the file from disc, or you can embed the file in the application resources and read from there. Here is a couple of functions that handle the two ways of reading the file:

    Private Function ReadDataFile(ByVal Path As String) As StringCollection
        If Not String.IsNullOrWhiteSpace(Path) Then
            Dim reader As New IO.StreamReader(Path)
            Dim data As New StringCollection
            While Not reader.EndOfStream
            End While
            Return data
            Throw New Exception("Invalid IRIS File Path!")
        End If
    End Function

    Private Function ReadDataFileFromResource() As StringCollection
        Dim strFile As String = System.Text.Encoding.ASCII.GetString(My.Resources.iris)
        Dim reader As New IO.StringReader(strFile)
        Dim data As New StringCollection
        While reader.Peek > 0
        End While
        Return data
    End Function

    Public Shared Function CreateIrisDataItems(ByRef StringItems As StringCollection) As List(Of IrisDataItem)
        If StringItems IsNot Nothing AndAlso StringItems.Count > 0 Then
            Dim items As New List(Of IrisDataItem)
            For Each i In StringItems
                If Not String.IsNullOrWhiteSpace(i) Then
                    Dim item As New IrisDataItem
                    Dim strItem() As String = i.Split(",")
                    item.SepalLength = strItem(0)
                    item.SepalWidth = strItem(1)
                    item.PetalLength = strItem(2)
                    item.PetalWidth = strItem(3)
                    Select Case strItem(4)
                        Case "Iris-setosa"
                            item.Class = IrisDataItem.IRISClass.IrisSetosa
                        Case "Iris-versicolor"
                            item.Class = IrisDataItem.IRISClass.IrisVersicolour
                        Case "Iris-virginica"
                            item.Class = IrisDataItem.IRISClass.IrisVirginica
                    End Select
                End If
            Return items
        End If
        Return Nothing
    End Function

    Public Shared Function CreateIrisWorkItems(ByVal AttributeIndex As Integer, ByVal DataItems As List(Of IrisDataItem)) As List(Of IrisWorkItem)
        Dim workItems As New List(Of IrisWorkItem)
                              Dim workItem As IrisWorkItem = Nothing
                              Dim value As Decimal
                              Select Case AttributeIndex
                                  Case 0
                                      workItem = workItems.Find(Function(w) w.Attribute = dataItem.SepalLength)
                                      value = dataItem.SepalLength
                                  Case 1
                                      workItem = workItems.Find(Function(w) w.Attribute = dataItem.SepalWidth)
                                      value = dataItem.SepalWidth
                                  Case 2
                                      workItem = workItems.Find(Function(w) w.Attribute = dataItem.PetalLength)
                                      value = dataItem.PetalLength
                                  Case 3
                                      workItem = workItems.Find(Function(w) w.Attribute = dataItem.PetalWidth)
                                      value = dataItem.PetalWidth
                              End Select
                              If workItem IsNot Nothing Then
                                  workItem.Frequency += 1
                                  workItem = New IrisWorkItem
                                  workItem.Frequency = 1
                              End If
                              workItem.Attribute = value
                              Select Case dataItem.Class
                                  Case IrisDataItem.IRISClass.IrisSetosa
                                      workItem.SetosaCount += 1
                                  Case IrisDataItem.IRISClass.IrisVersicolour
                                      workItem.VersicolorCount += 1
                                  Case IrisDataItem.IRISClass.IrisVirginica
                                      workItem.VirginicaCount += 1
                              End Select
                          End Sub)
        Return workItems.OrderBy(Function(w) w.Attribute).ToList
    End Function

Having the data ready in our hands, we can now proceed to implement the ChiSquare function which is basically an implementation of the formula:

Where :

k = number of classes

Aij = number of samples in ith interval, jth class

Eij = expected frequency of Aij = (Ri * Cj) / N

Ri = number of samples in ith interval

Cj = number of samples in jth class

N = total number of samples on the two intervals

If Eij=<0.5 then set Eij to 0.5

    Public Shared Function ChiSquare(ByVal Item1 As IrisWorkItem, ByVal Item2 As IrisWorkItem) As Decimal
        ChiSquare = 0
        Dim table(,) As Integer = {{Item1.SetosaCount, Item1.VersicolorCount, Item1.VirginicaCount}, {Item2.SetosaCount, Item2.VersicolorCount, Item2.VirginicaCount}}
        Dim rowSums() As Decimal = {Item1.SetosaCount + Item1.VersicolorCount + Item1.VirginicaCount, Item2.SetosaCount + Item2.VersicolorCount + Item2.VirginicaCount}
        Dim columnSums() As Decimal = {Item1.SetosaCount + Item2.SetosaCount, Item1.VersicolorCount + Item2.VersicolorCount, Item1.VirginicaCount + Item2.VirginicaCount}
        Dim totalSum As Decimal = rowSums.Sum
        Dim E As Decimal = 0
        For i As Integer = 0 To table.GetLength(0) - 1
            For j As Integer = 0 To table.GetLength(1) - 1
                E = rowSums(i) * columnSums(j) / totalSum
                If E > 0 Then ChiSquare += ((table(i, j) - E) ^ 2) / IIf(E < 0.5, 0.5, E)
    End Function

One last thing to do before we implement Chi-Merge algorithm is to setup the initial interval bounds and prepare them. This is easily achieved by taking the attribute value itself to be the lower bound of the interval and the next attribute value to be the upper bound of the interval. Below is an easy way of doing this:

    Public Shared Sub CreateIntervalBounds(ByVal Items As List(Of IrisWorkItem))
        For i As Integer = 0 To Items.Count - 2
            Items(i).Start = Items(i).Attribute
            Items(i).End = Items(i + 1).Attribute
        Items(Items.Count - 1).Start = Items(Items.Count - 1).Attribute
        Items(Items.Count - 1).End = Items(Items.Count - 1).Attribute + 1
    End Sub

Finally, we are ready to implement the Chi-Merge algorithm. One thing to note about the below implementation is the use of LINQ to Objects and Lambda Expressions to filter out tuples that need to be dropped and merged. So you could probably that the code below will compile only using Visual Studio 2010 and .Net Framework 4.0. I haven’t tested it on Visual Studio 2008 Sp1 but if you do let me know if it compiles well.

    Public Sub Compute(ByRef Items As List(Of IrisWorkItem))
        For i As Integer = 0 To Items.Count - 2
            Items(i).ChiSquare = Items(i) & Items(i + 1) 'calculate chiSquare
        Dim minChiSquare As Decimal = Items.Take(Items.Count - 1).Min(Function(x) x.ChiSquare)
        If minChiSquare < THRESHOLD Then
            Dim i As Integer = 0
            While i < Items.Count
                If Items(i).ChiSquare = minChiSquare Then
                    Items(i) ^= Items(i + 1) 'merge
                    Dim j As Integer = i
                    i += 1
                    While i < Items.Count - 1 AndAlso Items(i).ChiSquare = minChiSquare
                        Items(j) ^= Items(i + 1) 'merge
                        i += 1
                    End While
                End If
                i += 1
            End While
            Items.RemoveAll(Function(x) x.Frequency = -1)
            If Items.Count >= MIN_INTERVAL Then Compute(Items)
            If Items.Count > MAX_INTERVAL Then
                THRESHOLD += 0.5
            End If
        End If
    End Sub

Once you run the algorithm, you can compare the obtained intervals and split points with the ones found on the internet. Below are my final results compared to the results on