Performance issues in 2D array traversal
November 16, 2010 1 Comment
So whats so special about traversing a 2D array. From the view of a developer, it’s just a couple of nested loops over the width and height of the array and applying some computation over each element. For example, consider the problem of finding the sum of numbers in a 2D array which involves creating two nested loops to read the value of each number and adding it to a global sum. Another problem that heavily involves using 2D arrays is matrix multiplication using the naive matrix multiplication algorithm which requires nesting of 3 loops. Although faster algorithms were developed for matrix multiplication, such as the Strassen algorithm which is the fastest known algorithm until today. However, in this post, we will focus on computing the sum of a 2D arrays to simplify the problem and show the performance drawbacks that might be incured if the developer is not aware about the programming language used. Without any further due, lets take a look at the following example. As a first step we will generate a random 2D integer array of 13000 rows and 13000 rows, a total of 169000000 integers and filling the array with random values using the Rnd() function:
Private Function GenerateRandomArray() As Integer(,) Dim ar(12999, 12999) As Integer For i As Integer = 0 To ar.GetLength(0) - 1 For j As Integer = 0 To ar.GetLength(1) - 1 ar(i, j) = Rnd() * 1000 Next Next Return ar End Function
We proceed to compute the sum of the array traversing over the rows in the outer loop and the columns in the inner loop. The implementation is simple as shown below:
Private Function RowMajorSum(ByVal ar(,) As Integer) As Long Dim sum As Long = 0 For i As Integer = 0 To ar.GetLength(0) - 1 For j As Integer = 0 To ar.GetLength(1) - 1 sum += ar(i, j) Next Next Return sum End Function
Running this function on my personal laptop showed that the sum is computed correctly. I also used a stop watch to record the time it took for the function to complete:
Dim time As New Stopwatch time.Start() Dim sum1 As Long = RowMajorSum(ar) time.Stop() Label1.Text = "Time:" & time.ElapsedMilliseconds & ", Sum:" & sum1
Label1 displayed the following:
We will now create a second function which does the same as the first one, but now the order of traversal will be reversed so that the outer loop iterates over the columns and the inner loop iterated over the rows:
Private Function ColumnMajorSum(ByVal ar(,) As Integer) As Long Dim sum As Long = 0 For i As Integer = 0 To ar.GetLength(1) - 1 For j As Integer = 0 To ar.GetLength(0) - 1 sum += ar(j, i) Next Next Return sum End Function
Running this function on the same machine again produces the following:
Private Sub Button2_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles Button2.Click Dim time As New Stopwatch time.Start() Dim sum2 As Long = ColumnMajorSum(ar) time.Stop() Label1.Text = "Time:" & time.ElapsedMilliseconds & ", Sum:" & sum2 End Sub
As you can see, the second function took almost twice the time! although the idea of computing the sum is the same. So what really happened under the hoods of this program turns out to be quiet interesting. In fact, because we are using Vb.net, the RowMajor function was faster than the ColumnMajor one. Had we used Fortran or Matlab to do the same thing, the results would be the opposite!
The reason is related to how the numbers are stored in memory. In the .Net CLR, arrays are stored in Row-Major order, that is they are laid contiguously in linear memory. So for example, if we have the following array:
Languages that adopt the Row-Major order method will have the numbers laid out like this:
1 2 3 4 5 6
While languages that adopt the Column-Major order method will have the same number laid out like this:
1 4 2 5 3 6
So how does this relate to the performance difference we got earlier. In the RowMajorSum method, every time we read an item from the array, the value being read and some neighboring values are retrieved into the cache. In reality, cache line size varies between 64 bytes and 128bytes. On my machine the cache line size is 64 bytes, so 16 integers are retrieved from main memory and cached so that when the next number is retrieved, no trip is required to main memory, instead it will be retrieved from the cache and we know that cache access is much faster than main memory access. So for our example, 13000*13000/16 memory accesses are required to finalize the sum computation. In the ColumnMajorSum method, the cache behaves in the same way, but unfortunately the next number being read will never exist on the cache because the layout of the numbers in memory are contiguously laid out in rows so this method will end up going every time to main memory to read the next value so the required memory accesses in this case is about 13000*13000 .
As a conclusion, I want to highlight on the issue that 2D array traversal is an important issue and choosing which vectors to be on the outer and inner loop is critical. Developers need to take a closer look on the type of application they are working on and have a deeper knowledge of the programming language they are using. In a future post, i will explain some tips and tricks to overcome cache problems and explain how to better take advantage of the cache.