Finding Adjacent Prime Numbers using Linq

Suppose you have a range of numbers defined by a lower bound L and an upper bound U, the problem is to find 2 pairs of prime numbers, one of which will be the closest primes and the second pair will have the farthest distance in between. For example, given a range of integers from 1 to 20, the two most adjacent primes are 2 and 3 having a distance 1 while the least adjacent primes are 7 and 11 having a distance of 4.

Given a list of integers from 1 to 10,000 can you tell the least adjacent prime numbers? (the answer is 9551 and 9587)  🙂

Writing an algorithm that find adjacent primes can be tricky but is also short and straightforward. below is a naive implementation for this algorithm. I will not explain code in detail but basically what it does is to loop over the given range from L up to square root of U and flags all prime numbers within the range. then we compute the distance using linq and another query orders by distance in asc and desc in order to select the top 1 from each query.

    Sub Main()
While True

Dim PrimeNumbers As New List(Of Integer)

Console.WriteLine(“———-Prime Distance———-“)
Console.Write(“Enter L: “)
Dim lowerLimit As String = Console.ReadLine().Trim
Console.Write(“Enter U: “)
Dim upperLimit As Integer = Console.ReadLine().Trim
Dim prime(upperLimit) As Integer
Dim c1, c2, c3 As Integer
For c1 = 2 To upperLimit
prime(c1) = 0
prime(0) = 1
prime(1) = 1

For c2 = 2 To Math.Sqrt(upperLimit) + 1
If prime(c2) = 0 Then
c1 = c2
c3 = 2 * c1
While c3 < upperLimit + 1
prime(c3) = 1
c3 = c3 + c1
End While
End If
For c1 = 0 To upperLimit
If prime(c1) = 0 Then
End If
PrimeNumbers = PrimeNumbers.Where(Function(x) x >= lowerLimit).ToList
Dim ordered = From a In PrimeNumbers Order By a Ascending Select a
Dim query = From a In ordered, b In ordered Where b > a Select P1 = a, P2 = b, Distance = Math.Abs(a – b)
query = query.GroupBy(Function(x) x.P1).Select(Function(x) x.First) ‘distinct
Dim minDistance = (From a In query Order By a.Distance Ascending Select a).Take(1).SingleOrDefault
Dim maxDistance = (From a In query Order By a.Distance Descending Select a).Take(1).SingleOrDefault
Console.WriteLine(“{0},{1} are the closest primes having distance {2}”, minDistance.P1, minDistance.P2, minDistance.Distance)
Console.WriteLine(“{0},{1} are the most distant primes having distance {2}”, maxDistance.P1, maxDistance.P2, maxDistance.Distance)
End While
End Sub

Adjacent Primes Simulation

Adjacent Primes Simulation


SQL Query Optimization and Performance Guide


Windows Settings    3

Storage on 2 Physical disks – standard solution    7

Storage on 3 Physical disks – extended solution    7

Storage on 4 physical disks – optimal solution    7

Storage on external hard disk or flash memory    9

Database Auto growth    9

Index Management    9

Stored procedures    12

Cursors    12

Query optimization    13

Scheduled Maintenance Plan    13

Check disk usage by top tables    14

Windows Settings

  • Adjust performance for background services
  • Configure size of windows paging file(virtual memory) to be twice the size of physical memory
  • Turn off system protection feature for all disks except for C
  • Disable unneeded sql services
  • Configure weekly defragmentation schedule for all disks

Storage on 2 Physical disks – standard solution

  • Store log file on C
  • Store data file on the second disk

Storage on 3 Physical disks – extended solution

  • Store data file on second disk
  • Store log file on third disk
  • Store Windows Paging File(Virtual Memory) on C

Storage on 4 physical disks – optimal solution

  • Store windows paging file on C
  • Store primary data file on second disk
  • Store log file on third disk
  • Create Secondary data file on fourth disk to store indexes

    When creating indexes, select the secondary file group

Storage on external hard disk or flash memory

  • Not allowed
  • This will disable all optimizations performed by SQL Server
  • This will limit read/write speed to 25Mbits/sec for external hard disk and slower for flash memory

Database Auto growth

  • Do not set to grow in %. Use static value instead(100MB is good option)

Index Management

  • Create non-clustered index only for tables that have rate of SELECT much larger than rate of INSERT and UPDATE
  • Do not create indexes for all table columns
  • For indexes created on base tables(for example Person, Admission, AdmissionRequestDetail), set the index fill factor to 70 instead of 80(default option). This will enhance performance for INSERT.
  • Use the following procedure to analyze index fragmentation. You should “reorganize” indexes when the External Fragmentation value for the index is between 10-15 and the Internal Fragmentation value is between 60-75. Otherwise, you should rebuild indexes.











    <> 0)

    si.index_id=dt.index_id AND dt.avg_fragmentation_in_percent>10

    dt.avg_page_space_used_in_percent<75 ORDER
    BY avg_fragmentation_in_percent



  • Create non-clustered indexes only for columns that appear in:
    • Where clause
    • Order By clause
    • Join clause
    • Distinct
    • All foreign keys
  • Avoid indexing small tables
  • Create indexed views for columns used by Linq query in the where clause, if the index is not created on the table but do not create the index on both.
  • Do not use Year(), Month() ,Day() functions in the where clause on a column even if it has an index. Using these functions and other similar functions will disable the index and will do a full table scan. So change the query to make use of the index. Example:



  • Use the index usage report to check if there are unused indexes on the table in order to delete them. If the number of seeks is 0. The index can be deleted.

Stored procedures

  • Do not name the procedure sp_something. This will cause a delay when the procedure executes.
  • Use Set Nocount On at the top of the procedure to avoid additional round trip to the server
  • Try to avoid using exec of dynamic sql statements


  • Do not use a cursor to iterate over a set of records. Creating and using a cursor is very expensive and resource consuming. Instead, use a while loop with defined upper bound. This option is not available for SQL 2000. Example:





    from SubDepartment

        while (@i<=@count)





Query optimization

  • Do not use OR in the where clause. Instead, use UNION ALL to implement OR functionality but only if there is an index on the column used in OR operator. If there is no index, this optimization will slow down performance by doing multiple table scans. If there is an index this approach will speed up the query. Example:

    person.Sex=0 or person.Sex=1

    person.Sex=0 union


  • Avoid using inline queries, use joins instead
  • Avoid using inline function in the query. Instead, create a precomputed column with the function formula to be stored for selection in the query. May not be applicable for all cases.
  • When writing queries containing NOT IN, the query will have poor performance as the optimizer need to use nested table scan to perform this activity. This can be avoided by using EXISTS or NOT EXISTS
  • When you have the choice to use IN or BETWEEN clause in the query, always use BETWEEN. This will speed up the query
  • When having multiple columns in where clause separated by AND operator, make sure to order the columns from least likely true to most likely true if the expected result is most likely true. Otherwise, order the columns from most likely false to least likely false. In other words, the condition that filters the least number of records must appear first in the where clause.
  • For queries that require immediate feedback to the user, use FAST n option to immediately return first n records while the query continues to fetch the remaining records. This option can be used in procedures and views but it does not apply for linq queries because .ToList() ignores this option. Example:

    OPTION(FAST 100)

Scheduled Maintenance Plan

  • Use daily maintenance plan to perform daily cleanup, update statistics, check integrity, rebuild indexes, organize indexes, shrink database files and perform a full backup.
  • Do not include system databases in the plan. Select the option “User databases only”
  • Make sure SQL Server Agent startup option is set to Automatic
  • Schedule the plan to run daily at night. 1am is a good option.
  • The below plan is standard and can be used for most cases.

Check disk usage by top tables

  • Monitor disk usage by most used tables. These will be the target for potential indexes.
  • Use this approach to check if the application is performing repeated unnecessary reads

My presentation on Automatic Loop Parallelization for multicore systems

This is my ppt presentation I made recently on automatic loop parallelization. It is an explanation for a new innovation on how to perform automatic loop parallelization efficiently. Below is the download link and a summary for each slide. enjoy!






With the advance of multi-core and many core systems, developers can now make use of parallel programming and this made it essential for developing better programs.

Towards taking advantage of multicore platforms, sequential programs are being replaced by parallel programs.

Developing parallel applications is a difficult and can be prone to errors. Therefore, the necessity for automatic parallelization becomes important


Loops are the most important target for parallelization.

We chose loops because they are the most important factor that affects the performance of any program.

Most programs spend a lot of their running time iterating through one or more compute intensive loops.

According to the 90/10 rule, 10% code is a loop.

In this thesis we focus on Automatic Loop Parallelization.


Available loop parallelization techniques are not efficient because of having:

Long running time

Huge processors communication

Large memory usage


In this thesis, we propose a new loop parallelization algorithm that:

Extracts parallelism among different loop iterations and also inside statements of the same iteration.

And Incurs less processor, memory and time overheads.



For example, we can see from the two figures that parallel loops:

utilize the available architecture effectively by dirtributing the workload among the processors and take less time to execute.


Our algorithm takes as an input:

A loop whose array references are unknown except at runtime.

And an access array that defines the indices B and C of the access array for every loop iteration.

The output is a parallel schedule that defines the execution order of all the statements of the loop. Statements within the same set are run concurrently, whereas statements belonging to different sets are run one after the other.


In the course of loop execution, 3 types of dependencies can exist among the loop statements.

Read from slide


Read from slide



How to obtain a full parallel schedule of the loop?

Using dependency information, we build a list of sets containing independent items.

Independent items in each set run in parallel.


Many algorithms have been developed for parallelizing loops including:

The first one is


Zhu and Yew was the first approach for parallelizing loops.

It consists of 2 phases:

The registration phase where a parallel schedule is constructed.

And the execution phase where the obtained schedule is executed.


The algorithm stores in memory

For every array element a key value.

And for every loop iteration a D value to indicate if the iteration is assigned to a set in the schedule.


In the 1st step, The keys are initialized to infinity.

Int the 2nd step, we set the key to the iteration number where the element was first accessed.

In the final step, we check if the key is equal to the iteration number then add it to the set.


We take as an example this access table.

In the first round of the algorithm,

the keys are initialized to infinity and the D values are initialized to 1.

We fill the keys of all the array elements.

after passing across all the loop iterations and comparing the keys to the iteration numbers. We add iterations 1, 3 and 7 to the first set.

We set the D values of iterations 1,3 and 7 to 0.

This process is reapted until all iterations are assigned to sets


The main advantage of zhu and yew algorithm is that it is simple.


Many memory accesses and runs.

It doesn’t introduce parallelization within statements of the same iteration.

It doesn’t allow concurrent reads.


Midkiff and Padua algorithm is similar to zhu and yew algorithm but with one improvement in that it allows concurrent reads.

it consists of 2 phases: the registration and the execution phase.


The algorithm stores for each array element 2 keys:

Integer key stores the iteration number that does the first write for the array element.

bit key stores a true value if there any reads before the first write of the array element.

The algorithm also stores a D array for identify the iterations that are done.


First, integer keys are initialized to N+1

Second, we set the integer key to the iteration number where the element was 1st written.

Third, we set its bit key to true if there are any read operation before the 1st write.

Finally, we check for 4 conditions to determine if the iteration is added to the set:

Read from slide


We take as an example this access table.

In the first round, we compute the integer and bit key values of all array elements. We then pass by all loop iterations and check if any iteration meets the 4 conditions. Iterations 1, 4 and 7 are added to the first set. We set their d values to 1 which means that they are done.

Repeating this processes will produces the final result sets.


The main advantage of this algorithm is that it allows concurrent reads. However, it performs many memory accesses and rounds. Also, it doesn’t allow for intra-parallelization.


Leung and Zahorjan algorithm introduced a new idea based on parallelizing the inspection phase.

The iterations are divided among the available processors and each processor constructs its own sub schedule for its assigned iterations.

Finally the sub schedules of all the processors are concatenated together to form the final schedule.


For example, assume we have 15 iterations in a loop to be divided among 3 processors.

Each processor is assigned 5 iterations.

The sub schedules which are constructed by each processor are grouped together to form the final schedule where the last set of the sub schedule of processor 1 is followed by the first set of the sub schedule constructed by processor 2 and the last set of the sub schedule of processor 2 is followed by the first set of the sub schedule constructed by processor 3.


Is faster, but


it decrease the degree of parallelism in cases where multiple iterations that are independent are assigned to different processors, this would result in having the iterations placed in different sets in the final parallel schedule instead of being placed together in the same set to be run concurrently.



The chen, yew, and torellas algorithm builds a ticket table for every dependence chain.

The ticket table stores the sequence numbers for accessing the array elements in the correct order.

It cosists of the inspection and the execution phase.


Assume we have this dependence chain for a specific array element.

The iterations are divided among the available processors, and each processor builds its own ticket table.

The first processor can mark its iterations correctly because it is the first processor. Processor 1 can easily set  0, 1 and 2 respectively

The following processors can’t mark their iterations correctly because they don’t know their order in the preceding processors. So we set a ‘?’ to the heads of each processor.

Set the key value of the array elements in the dependence chain following the head to the number of all previous elements.

Processors then communicate globally to send the last value of the dependence chain that this processor contains. Processor1 sends a 2 to processor2. Processor 2 sends a ? to processor 3.


For each iteration, the element gets added to the set if its key is equal to the ticket value otherwise increment the key by 1.




The main advantages of this algorithm is that it allows for intra-parallelization and it performs the inspection phase in parallel. However, it incurs huge communication overhead among the processors to maintain a correct ticket table. Also it doesn’t allow for concurrent reads.


The xu and chaudhary algorithm is a timestamp algorithm for parallelizing loops. It allows for concurrent reads.

This algorithm assigns a 2 dimensional value A(I,j) for every array element where I denotes the sequence number of the array element and j denotes the read group size in case of a read operation.


A(i,j) denotes the array element of the current iteration and A(m,n) refers to its direct predecessor.

L-closure and R-closure are used to indicate the beginning and the end of a read group.

The u value denotes an undefined value.

r denotes the processor number of the current iteration.

g(r) represents the total number of elements contained in all the previous processors.

h(r) denotes the number of references in the same block r.


If the current operation is the head of the dependence chain the we set its time stamp value to 1 if it is the first processor or u otherwise.

The following timestamp values are assigned based on the type of the current operation and its previous operation. If we have a read preceded by a write or a read preceded by a read, or a write preceded by a read, or a write preceded by a write.


If we have this access table.

We take as an example the dependence chains of array elements 3 and 10.

After applying the timestamp rules we obtain this table.


In the second step of the algorithm we modify the timestamp values according to these rules and depending whether this operation is a read or write.


After applying the rules of the second step, we obtain the following table where we notice that every read operation has its j value equal to the read group size.


In the last step of the algorithm and for each dependence chain we assign the time values based on this rules. If we have a write then time is incremented by 1. If we have a read then time is incremented by 1/readgroupsize.


After applying the rules of the last step we obtain the times shown here. The decimal values are then rounded to the upper bound so that consecutive reads would have the same time value and thus would be run in parallel.


The main advantages of this algorithm is that it allows for concurrent reads. It performs the inspection phase in parallel. And it allows intra-parallelization. However, it incurs the largest cpu, memory and time overheads.


Kao, ayng and tsend algorithm is a technique for parallelizing loops by building a table containing the iteration number where a given element is read and written for each iteration.

For every loop iteration:

The array element A(B(i)) is written at this iteration: W(A(B(i)))=i

The array element A(C(i)) is read at this iteration: R(A(C(i)))=i

WF(i)= WF(Max[W(A(B(i-1))), R(A(B(i-1))), and W(A(C(i-1)))]) + 1.


We take this access table as an example.

In iteration 1, A(3) is written and A(1) is read so we store 1 for w(3) and r(1). The wavefront is initialized to 1.

In iteration 2, all the values are copied and we set w(7) and r(3) to 2. The maximum of w(7),r(7) and w(3) is 1. So we set wf(2)=wf(1)+1=2.

We follow the same steps until obtaining the final parallel schedule.




This algorithm allows concurrent reads, but it incurs memory overhead and doesn’t allow intra-parallelization.


We propose a new approach for parallelizing loops based on extracting independent sets.

Similar to the algorithms in the related work section, our proposed approach consists of 2 phases:

The inspector phase where a parallel schedule is constructed.

The executer phase where the parallel schedule is executed.


The key idea behind our algorithm is that if statement S depends on previous statements, then S is not allowed to proceed until the previous statements are executed.

And to make sure that S is executed after the statements that it depends on:

S is assigned a larger sequence number.


The algorithm takes as an input a loop that contains an array whose references are indexed data that are determined at runtime. The access array defines the accessed array elements for every loop iteration.

The algorithm produces as an output a parallel schedule that defines the execution order of all the loop statements for all the iterations.


The inspection phase consists of 2 steps:

Sequence Numbering: every statement is assigned the sequence number of the statement it depends on incremented by 1, except in the case when there are consecutive reads then both are assigned the same sequence number on order to allow for concurrent reads.

Dependency Storage: the index of the current statement and the type of the operation are stored in a dependency table for the used array element in order to keep track of dependencies.


In step 1 of the inspection phase we use a sequence array for ordering the execution of the loop statements.

Assuming we have a loop of size N and the loop body contains 2 statements (R & W), we create a sequence array having a size of 2N.


If a statement S1 is dependent on another statement S2 then S1 is not allowed to execute before S2 terminates.

S1 is assigned a sequence number greater than S2 by 1.

For example, if statement of index 9 is dependent on statement of index 2 then S(9)=S(2) + 1


In step 2 of the inspection phase a dependency array is used to identify the dependent statements of a given loop.

Assuming we have an input array of size M, we create a new array for storing dependencies having a size of 2M.

I stores the index of the last statement that used the array element.

T is used to store the type of the last operation that used the array element if it is a Read or Write.


The algorithm of the inspection phase proceeds as follows:

We loop through all the elements of the access table AT where each element is identified by its index k.

Read from slide.


We take as an example this access table.

The second table shows the index of each cell of the access table.

For simplicity we take the dependence chain of array element 10.


The first element of the dependence chain of A(10) whose index is 4 has no previous dependencies so it is assigned a sequence number of 1 and the current index and operation type are stored in the I and T values of the dependency.

Each subsequent statement is assigned the sequence number of its previous statement incremented by 1 except in the case where of consecutive reads.

S(9)=S(4) +1


S(14)=S(18)=S(20)=S(12) concurrent reads




The final sequence number is shown and the statements are distributed among the sets in an increasing order of sequence number. Statements having the same sequence number are placed in the same set.


In the execution phase of the algorithm, items of the same set are run concurrently, and when one set  finishes its execution, it allows the next set to execute. This is done until all of the sets get executed.


We implemented all of the algorithms discussed in the related work section using

We used 3 loops found in the Perfect Club Benchmark for testing.

The experiments were intended to show the difference in performance between proposed approach and the state of the art algorithms in terms of:

Time taken to build the parallel schedule

CPU Consumption

Memory Usage

Thread Utilization


The prefect club benchmark is an acronym for PERFormance Evaluation by Cost-effective transformations.

It was created at the Center for Supercomputing Research and Development (CSRD) at the university of Illinois at urbana champaign  in order to focus the performance evaluation at the applications level.


The three Loops were used by the other algorithms for testing.

The first loop is a nested loop of size 250 by 250 and its array size is 1000.

The second and third loops are single loops of size 8000 and of array size 24000.


Implementation of all the algorithms was executed on the same PC having intel processor core i7 of speed 2.3 GHz, a memory of 8GB and a 64 bit operating system.


We implemented our approach using the language. We added 3 classes:

AccessList class for storing the access table, LastOperationType class for storing the dependency table, and parallellist class that corresponds to each set of the final schedule containing the independent items.


We filled the access table at runtime.


The build schedule algorithm is shown here.


We measure the time, cpu, memory and thread consumption for building the parallel schedule by our proposed algorithm and by each algorithm discussed in the related work part.


The time plot graph shows that our algorithm takes the least time for building the schedule.


The cpu plot graph shows that our algorithm, midkiff and padua, and zhu and yew algorithms incur the least cpu overheads.


The memory plot graph shows that our algorithm along with the chen yew torellas algorithm incur the least memory overheads.


Threads are best utilized by our proposed algorithm because the number of hardware threads is equal to the number of software threads so there is no context switching.




From comparison with other algorithms, our proposed algorithm is the fastest in building the parallel schedule. It incurs the least cpu and memory overheads. And it it is the best in utilizing the threads.


The limitation of our proposed approach is that in the loop body we used only one array similar to the algorithms discussed in the related work part that use only one array too in the loop body.

In case of having many arrays that show cross dependencies among each other we need to add more logic to the algorithm.


In conclusion

We described a tool that performs automatic loop parallelization.

We evaluated the effectiveness of our tool by testing it against 6 algorithms and using 3 loops obtained from perfect club benchmark.

In comparing to state of the art loop parallelization techniques, the proposed algorithm achieved a 980% (9.8X) speedup and a 10% reduction in memory usage.


The proposed algorithm has four main advantages:

More parallelism by allowing the statements inside an iteration to be executed in parallel.

Less time overhead for finding independent sets.

Less processor overhead.

Less memory overhead.


As part of our future work we intend to:

Conduct an empirical evaluation of our implementation against more algorithms.

Use arrays that exhibit cross dependencies.

Use loops from other benchmarks.

Provide implementation of the proposed approach in other languages.