FBI Data-Mining Program:Total Information Awareness

Application:

Total Information Awareness (TIA), a project of the Defense Department’s Defense Advanced Research Projects Agency (DARPA), aims to build a centralized database containing private transactional data on all Americans, including “records on credit-card purchases, plane flights, e-mails, websites, and housing.” In the tradition of the Bush Administration’s policy of preemptive action, TIA’s goal is to create electronic tools to “better detect, classify, and identify potential foreign terrorists…to increase the probability that authorized agencies of the United States can preempt adverse actions.” The government establishes baseline patterns identifying what they see as suspicious behavior, such as buying one-way plane tickets or drastic changes in spending habits, and then conducts pattern-based electronic searches of huge amounts of information to find matches for their trends. The personal information is “mined” from private sector databases as well as government databases.

In September 2003, Congress gave into public objection by denying funding to TIA or any successor program. Nevertheless, the Pentagon’s Technology and Privacy Advisory Committee (TAPAC) reported that TIA-like activities could be continued to be pursued outside the public’s view. Since that time, it has been reported that aspects of TIA went underground and had now become a fast-growing data-mining system billed as a tool for hunting terrorists and is being used in hacker and domestic criminal investigations, containing tens of thousands of records from private corporate databases, including car-rental companies, large hotel chains and at least one national department store.

Techniques:

TIA is both a meta-search engine–querying many data sources at once–and a tool that performs pattern and link analysis that uncover hidden patterns and subtle relationships in data and infer rules that allow for the prediction of future results. Subject-based “link analysis” through utilization of the FBI’s collection data sets, combined with public records on predicated subjects. Link analysis uses these data sets to find links between subjects, suspects, and addresses or other pieces of relevant information, and other persons, places, and things. This technique is currently being used on a limited basis by the FBI. The algorithm also uses “pattern analysis” as part of its queries that take a predictive model or pattern of behavior and search for that pattern in data sets starting from a predefined predictive model. TIA included several sub projects for mining different sets of data, such as:

Results:

Although the investment in TIA was considerably huge and lots of efforts were put together in the data collection and mining process, the alleged ability of data-miners to discover hidden “patterns” and “trends” among disparate data-sets was naïve and lacked precision because so little is known about what patterns indicate terrorist activity; as a result, they are likely to generate huge numbers of false leads. Besides, Privacy concerns about mined or analyzed personal data also include concerns about the quality and accuracy of the mined data; the use of the data for other than the original purpose for which the data were collected without the consent of the individual; the protection of the data against unauthorized access, modification, or disclosure; and the right of individuals to know about the collection of personal information, how to access that information, and how to request a correction of inaccurate information. All these led to a failure in developing an efficient counter-terrorism prediction system and caused the complete termination of the TIA project.

References:

http://epic.org/privacy/profiling/tia/

http://en.wikipedia.org/wiki/Information_Awareness_Office

http://dissidentvoice.org/2009/10/fbi-data-mining-programs-resurrect-total-information-awareness/

Advertisements

Swap 2 variables without using temporary variable!

Perhaps the oldest technique most of programmers use, whether beginners or professional coders, to swap 2 variables a and b, is to declare a new temporary variable c, copy a to c then assign b to a and finally assign c to b. The following code should look familiar:

        Dim a As Integer = 10
        Dim b As Integer = 20
        Dim c As Integer = a
        a = b
        b = c

This technique is really simple and easy to understand and perhaps this is the main reason of its widespread use. However, there is a minor problem using this technique, which is the allocation of a new variable(c) which incurs more memory usage. Although its just 1 more variable overhead, a total of 4 bytes integer it still represents a major drawback for memory-critical applications or on machines where memory space is too low or limited(such as mobile devices). The first new technique is also simple and has the advantage of not allocating space for a new temporary variable. This method uses addition and subtraction in a clever way in order to keep track of the a and b:

        Dim a As Integer = 10
        Dim b As Integer = 20
        a = a + b
        b = a - b
        a = a - b

Lets analyze this code line by line:

  1. a=10
  2. b=20
  3. a= 10 + 20 = 30
  4. b= 30 – 20 = 10
  5. a= 30 – 10 = 20
  6. Final result: a=20 and b=10

This is really a very nice and smart way of swapping. You can encapsulate it in a function that takes two arguments by reference so that you don’t have to remember this arithmetic steps each time but I will leave this as an exercise to the reader:). Now let’s take a look at the next method which involves multiplication and division instead of addition and subtraction:

        Dim a As Integer = 10
        Dim b As Integer = 20
        a = a * b
        b = a / b
        a = a / b

Again,Lets analyze this code line by line:

  1. a=10
  2. b=20
  3. a= 10 * 20 = 200
  4. b= 200 / 20 = 10
  5. a= 200 / 10 = 20
  6. Final result: a=20 and b=10

I have created a console application and added all the 3 methods then I surrounded each method with a very long loop in order to simulate a compute intensive operation in order to measure the speed of each one using a Stopwatch. The complete code looks like this:

    Sub Main()
        Dim a As Integer = 10
        Dim b As Integer = 20
        Console.WriteLine("Original values")
        Console.WriteLine("a: {0}, b: {1}", a, b)
        Dim s As New Stopwatch
        s.Start()
        For i As Integer = 0 To 10000002
            a = a + b
            b = a - b
            a = a - b
        Next
        s.Stop()
        Console.WriteLine("Swap using addition subtraction")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        a = 10
        b = 20

        s.Restart()
        For i As Integer = 0 To 10000002
            Dim c As Integer = a
            a = b
            b = c
        Next

        s.Stop()
        Console.WriteLine("Swap with temporary variable")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        a = 10
        b = 20

        s.Restart()
        For i As Integer = 0 To 10000002
            a = a * b
            b = a / b
            a = a / b
        Next
        s.Stop()
        Console.WriteLine("Swap using multiplication division")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        Console.ReadKey()
    End Sub

Running this application produced interesting measurements, 1 in particular that I did not expect:

As you can see, the method using a temporary variable outperformed the other two. I was expecting that the multiplication and division method to be the worst because of the division overhead but I thought that the addition and subtraction method will compete, but it looks like using the temporary variable is about twice faster. This can be explained because memory allocation in the .net framework is incredibly fast and the heap manager does some magic in allocation. In fact, memory allocation in .net is as fast as allocating memory in C using the malloc function and it’s even faster sometimes.

Final thoughts:
If you are using the temporary variable method for swapping variables, keep using it:) it’s the right thing to do unless you are a memory geek or developing applications for mobile devices then perhaps you should switch to the addition/subtraction method.

Cross Product of arrays using LINQ

Cross Product is usually a database operation on two tables similar to a Join operation with the only difference is that the cross product will yield all the possible combinations  between the two tables. Database Management Systems such as SQL Server and Oracle provided the cross product operation easily using the cross join keyword. The problem however in real life applications appears when we want to do cross product within the application logic. Suppose you have two arrays of numbers and you want to find out all the possible combinations of the two arrays together, then pick one of those combinations that meet your specific business needs. Fortunately, with the introduction of LINQ in the past few years, cross product has become as simple as writing any other naive query. Lets take a look at the example below: consider two arrays A and B having 4 values each:

        Dim A() As Integer = {1, 2, 3, 4}
        Dim b() As Integer = {5, 6, 7, 8}

Cross product is simply obtained using the following LINQ query followed by writing out the results to console:

        Dim crossProduct = From x In A, y In b Select x, y
        For Each i In crossProduct
            Console.WriteLine(i)
        Next

This will yield the following results:

Sieve of Eratosthenes

Sieve of Eratosthenes is an ancient algorithm for finding prime numbers developed by the great mathematician Eratosthenes. Although the algorithm is centuries years old, it is still widely in use and considered one of the most common algorithms for generating prime numbers. There are couple of algorithms that outperform this one, the Sieve of Atkins and the Sieve of Subarama. I will discuss these on future posts but for the moment lets dig into Eratosthenes’s algorithm:

The algorithm takes as input an integer N, and generates all primes from 2 up to N. It works as follows:

  • generate all numbers from 2 up to N in ascending order
  • since 2 is the first prime, walk through the list of numbers and eliminate every second number from it
  • grab the next integer that survived the elimination(which is 3), and eliminate every third number from the list
  • then grab the next integer that survived that elimination(which is 5) and repeat the same process until the square of the next grabbed number is greater than N

Below is a graphical representation of the algorithm from Wikipedia:
.

It is easy to implement and test this algortihm in few lines of code. Create a console application and replace the main method with the following one and we’re done 🙂 (Special greetings to Mohammad Selek who contributed the source code).

     Sub Main()
        Dim s As String
        Console.Write("Find primes up to: ")
        s = Console.ReadLine
        Dim upperLimit As Integer = s.Trim
        Dim prime(upperLimit) As Integer
        Dim c1, c2, c3 As Integer
        For c1 = 2 To upperLimit
            prime(c1) = 0
        Next
        prime(0) = 1
        prime(1) = 1
        For c2 = 2 To Math.Sqrt(upperLimit) + 1
            If prime(c2) = 0 Then
                c1 = c2
                c3 = 2 * c1
                While c3 < upperLimit + 1
                    prime(c3) = 1
                    c3 = c3 + c1
                End While
            End If
        Next
        For c1 = 0 To upperLimit
            If prime(c1) = 0 Then
                Console.Write(c1 & "     ")
            End If
        Next
        Console.ReadKey()
    End Sub

Here is the output when N=1000

Performance issues in 2D array traversal

So whats so special about traversing a 2D array. From the view of a developer, it’s just a couple of nested loops over the width and height of the array and applying some computation over each element. For example, consider the problem of finding the sum of numbers in a 2D array which involves creating two nested loops to read the value of each number and adding it to a global sum. Another problem that heavily involves using 2D arrays is matrix multiplication using the naive matrix multiplication algorithm which requires nesting of 3 loops. Although faster algorithms were developed for matrix multiplication, such as the Strassen algorithm which is the fastest known algorithm until today. However, in this post, we will focus on computing the sum of a 2D arrays to simplify the problem and show the performance drawbacks that might be incured if the developer is not aware about the programming language used. Without any further due, lets take a look at the following example. As a first step we will generate a random 2D integer array of 13000 rows and 13000 rows, a total of 169000000 integers and filling the array with random values using the Rnd() function:

    Private Function GenerateRandomArray() As Integer(,)
        Dim ar(12999, 12999) As Integer
        For i As Integer = 0 To ar.GetLength(0) - 1
            For j As Integer = 0 To ar.GetLength(1) - 1
                ar(i, j) = Rnd() * 1000
            Next
        Next
        Return ar
    End Function

We proceed to compute the sum of the array traversing over the rows in the outer loop and the columns in the inner loop. The implementation is simple as shown below:

    Private Function RowMajorSum(ByVal ar(,) As Integer) As Long
        Dim sum As Long = 0
        For i As Integer = 0 To ar.GetLength(0) - 1
            For j As Integer = 0 To ar.GetLength(1) - 1
                sum += ar(i, j)
            Next
        Next
        Return sum
    End Function

Running this function on my personal laptop showed that the sum is computed correctly. I also used a stop watch to record the time it took for the function to complete:

        Dim time As New Stopwatch
        time.Start()
        Dim sum1 As Long = RowMajorSum(ar)
        time.Stop()
        Label1.Text = "Time:" & time.ElapsedMilliseconds & ", Sum:" & sum1

Label1 displayed the following:

Time:2109, Sum:84513231237

We will now create a second function which does the same as the first one, but now the order of traversal will be reversed so that the outer loop iterates over the columns and the inner loop iterated over the rows:

    Private Function ColumnMajorSum(ByVal ar(,) As Integer) As Long
        Dim sum As Long = 0
        For i As Integer = 0 To ar.GetLength(1) - 1
            For j As Integer = 0 To ar.GetLength(0) - 1
                sum += ar(j, i)
            Next
        Next
        Return sum
    End Function

Running this function on the same machine again produces the following:

    Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click
        Dim time As New Stopwatch
        time.Start()
        Dim sum2 As Long = ColumnMajorSum(ar)
        time.Stop()
        Label1.Text = "Time:" & time.ElapsedMilliseconds & ", Sum:" & sum2
    End Sub
Time:3402, Sum:84513231237

As you can see, the second function took almost twice the time! although the idea of computing the sum is the same. So what really happened under the hoods of this program turns out to be quiet interesting. In fact, because we are using Vb.net, the RowMajor function was faster than the ColumnMajor one. Had we used Fortran or Matlab to do the same thing, the results would be the opposite!
The reason is related to how the numbers are stored in memory. In the .Net CLR, arrays are stored in Row-Major order, that is they are laid contiguously in linear memory. So for example, if we have the following array:

Languages that adopt the Row-Major order method will have the numbers laid out like this:

 1  2  3  4  5  6

While languages that adopt the Column-Major order method will have the same number laid out like this:

1  4  2  5  3  6

So how does this relate to the performance difference we got earlier. In the RowMajorSum method, every time we read an item from the array, the value being read and some neighboring values are retrieved into the cache. In reality, cache line size varies between 64 bytes and 128bytes. On my machine the cache line size is 64 bytes, so 16 integers are retrieved from main memory and cached so that when the next number is retrieved, no trip is required to main memory, instead it will be retrieved from the cache and we know that cache access is much faster than main memory access. So for our example, 13000*13000/16 memory accesses are required to finalize the sum computation. In the ColumnMajorSum method, the cache behaves in the same way, but unfortunately the next number being read will never exist on the cache because the layout of the numbers in memory are contiguously laid out in rows so this method will end up going every time to main memory to read the next value so the required memory accesses in this case is about 13000*13000 .

As a conclusion, I want to highlight on the issue that 2D array traversal is an important issue and choosing which vectors to be on the outer and inner loop is critical. Developers need to take a closer look on the type of application they are working on and have a deeper knowledge of the programming language they are using. In a future post, i will explain some tips and tricks to overcome cache problems and explain how to better take advantage of the cache.

Listen for Removable Device events

Handling device events in the .net framework is not an easy task and is really cumbersome for many developers. Application developers often need to listen for some hardware events, such as adding or removing a USB Flash memory or disconnecting an external hard disk and do some application logic based on that event. For example, consider an application that runs on a server and the administrator wants to keep track of what Plug & Play devices are being attached to this server on daily basis. We also want to track the name of the attached device, its size and probably the files that are being transfered from and to the device.

It seems frustrating and hard to implement using the ordinary .net libraries available. So we will resort to using WMI to query the operating system for this information and create or own RemovableDevice wrapper.

We need to add a reference to System.Management to get the WMI APIs. The key behind this whole feature lies in the ManagementEventWatcher class by firing a query against Win32_VolumeChangeEvent, which is a win32 object that stores information about volumes and storage devices attached to the machine, to listen for Hardware events, then handle the MediaConnectWatcher.EventArrived event to distinguish the event types and fire our custom events based on those. Below is the full implementation of our RemovableDevice wrapper:


Imports System.Management

Public Class RemovableDevice

    Public Enum VolumeEventType As UInteger
        ConfigurationChange = 1
        DeviceArrival = 2
        DeviceRemoval = 3
        Docking = 4
        Any = 5
    End Enum

    Private Shared WithEvents MediaConnectWatcher As System.Management.ManagementEventWatcher
    Public Shared Event OnAnyChange(ByVal EventType As VolumeEventType, ByVal DriveName As String, ByVal TimeCreated As Date)
    Public Shared Event OnConfigurationChange(ByVal DriveName As String, ByVal TimeCreated As Date)
    Public Shared Event OnArrival(ByVal DriveName As String, ByVal TimeCreated As Date)
    Public Shared Event OnRemoval(ByVal DriveName As String, ByVal TimeCreated As Date)
    Public Shared Event OnDock(ByVal DriveName As String, ByVal TimeCreated As Date)
    Private Shared mEventType As VolumeEventType

    Public Shared Sub ListenForChanges(Optional ByVal EventType As VolumeEventType = VolumeEventType.Any)
        mEventType = EventType
        Dim query2 As String
        If EventType = 5 Then
            query2 = "SELECT * FROM Win32_VolumeChangeEvent"
        Else
            query2 = "SELECT * FROM Win32_VolumeChangeEvent WHERE EventType=" & EventType
        End If
        ' Dim query2 As String = "SELECT * FROM __InstanceCreationEvent WITHIN 5 WHERE TargetInstance ISA ""Win32_USBHub"""
        MediaConnectWatcher = New System.Management.ManagementEventWatcher(query2)
        MediaConnectWatcher.Start()
    End Sub

    Private Shared Sub Arrived(ByVal sender As Object, ByVal e As System.Management.EventArrivedEventArgs) Handles MediaConnectWatcher.EventArrived
        Dim timeCreated As Date = Date.FromBinary(e.NewEvent("Time_Created"))
        Select Case e.NewEvent("EventType")
            Case 1
                RaiseEvent OnConfigurationChange(e.NewEvent("DriveName"), timeCreated)
            Case 2
                RaiseEvent OnArrival(e.NewEvent("DriveName"), timeCreated)
            Case 3
                RaiseEvent OnRemoval(e.NewEvent("DriveName"), timeCreated)
            Case 4
                RaiseEvent OnDock(e.NewEvent("DriveName"), timeCreated)
        End Select
        If mEventType = VolumeEventType.Any Then
            RaiseEvent OnAnyChange(e.NewEvent("EventType"), e.NewEvent("DriveName"), timeCreated)
        End If
    End Sub
End Class

The code below calls into ListenForChanges shared method to listen for any of the 5 standard volume events in windows but each event is handled separatly:

        Dim device As New RemovableDevice
        RemovableDevice.ListenForChanges(VolumeEventType.Any)
    Private Sub Device_OnAnyChange(ByVal EventType As VolumeEventType, ByVal DriveName As String, ByVal TimeCreated As Date) Handles d.OnAnyChange
        MsgBox(EventType & DriveName & TimeCreated)
    End Sub

    Private Sub Device_OnArrival(ByVal DriveName As String, ByVal TimeCreated As Date) Handles d.OnArrival
        MsgBox("i am arrived")
    End Sub

    Private Sub Device_OnRemoval(ByVal DriveName As String, ByVal TimeCreated As Date) Handles d.OnRemoval
        MsgBox("i am removed")
    End Sub

    Private Sub Device_OnDock(ByVal DriveName As String, ByVal TimeCreated As Date) Handles d.OnDock
        MsgBox("i am docked")
    End Sub

    Private Sub Device_OnConfigurationChange(ByVal DriveName As String, ByVal TimeCreated As Date) Handles d.OnConfigurationChange
        MsgBox("my config is changed")
    End Sub

Finding the Majority element in an array

In its simplest terms, the Majority element is the element that occurs the most within a given array of numbers. Here, we mean by “the most” that the number of occurrences of the element must be more than half the size of the array. In mathematical terms, given an unsorted array of integers from 1 to n, where n is the length of the array, an element is a Majority if it occurs at least (n/2) + 1 times.

The Majority element is an important problem in computer science and is used as a basis for more complex algorithms. We also encounter the majority element in our daily life. For example, the ownership of a large company is determined by shareholders having more than half the total share. Another example is the elections in your country where the winner is determined by obtaining more than half the voices.

In this article, we will look at a fast algorithm that runs in linear time, in exactly O(n) + O(n) to find the Majority element. The algorithm was discovered by Rengakrishnan Subramanian in December 2001. His official paper can be found here.

The algorithm approaches the problem in two steps, in the first step we will call the method FindElement that takes an array of integers as an input and returns an element that is “most likely” to be Majority. In the second step, the algorithm will call FindMajority which takes the same array as an input along with the most probable element we determined earlier and will return the element if it turns out to be Majority, or null if its not.

The FindElement algorithm works as follows:

1-declare two variables, we’ll call the first one count and the second one canbeMajority to store the candidate Majority.

2-assume that the first number in the array is Majority and assign its value to canbeMajority.

3-traverse the input array from start to end while applying the following conditions:

  • If the next element in the array is equal to canbeMajority, then count is incremented by 1.
  • If the next element in the array is not equal to canbeMajority, then count is decremented by 1.
  • If at the next element, the count is equal to zero, then count is set to one and this element is considered as canbeMajority

Here is the implementation of FindElement and FindMajority in vb.net:

    Private Function FindElement(ByRef arr() As Integer) As Integer
        Dim count As Integer = 0
        Dim element As Integer = 0
        For i As Integer = 0 To arr.Length - 1
            If count = 0 Then
                count = 1
                element = arr(i)
            ElseIf arr(i) = element Then
                count += 1
            Else
                count -= 1
            End If
        Next
        Return element
    End Function

    Private Function FindMajority(ByRef arr() As Integer) As Integer
        Dim element As Integer = FindElement(arr)
        Dim count As Integer = 0
        For i As Integer = 0 To arr.Length - 1
            If arr(i) = element Then count += 1
        Next
        If count > arr.Length / 2 Then Return element Else Return Nothing
    End Function

I have created a simple application that reads or generates arbitrary array of numbers and find its Majority but I will leave the implementation of the application as an exercise to the reader :). However you can compare your results with mine by looking at the picture below

MajorityElement

MajorityElement