March 27, 2011 1 Comment
- Working With Shared-Memory Multicore.
- Shared-Memory and Distributed-Memory Systems.
- Parallel Programming and Multicore Programming.
- Hardware Threads and Software Threads.
- Amdahl’s Law.
- Gustafson’s Law.
- Working with Lightweight Concurrency.
- Creating Successful Task-Based Designs.
- Designing With Concurrency in Mind.
- Interleaved Concurrency, Concurrency, and Parallelism.
- Minimizing Critical Sections.
Working With Shared-Memory Multicore
Most machines today have at least a dual-core processor. However, quad–core and octal-core processors, with four and eight cores, respectively, are quite popular on servers, advanced workstations, and even on high-end mobile computers. Modern processors offer new multicore architectures. Thus, it is very important to prepare the software designs and the code to exploit these architectures. The different kinds of applications generated with C# 2010 and .NET Framework 4 run on one or many CPUs. Each of these processors can have a different number of cores, capable of executing multiple instructions at the same time.
Multicore processor can be simply described as many interconnected processors in a single package. All the cores have access to the main memory, as illustrated in figure below. Thus, this architecture is known as shared–memory multicore. Sharing memory in this way can easily lead to a performance bottleneck.
Multicore processors have many different complex architectures, designed to offer more parallel-execution capabilities, improve overall throughput, and reduce potential bottlenecks. At the same time, multicore processors try to reduce power consumption and generate less heat. Therefore, many modern processors can increase or reduce the frequency for each core according to their workload, and they can even sleep cores when they are not in use. Windows 7 and Windows Server 2008 R2 support a new feature called Core Parking. When many cores aren’t in use and this feature is active, these operating systems put the remaining cores to sleep. When these cores are necessary, the operating systems wake the sleeping cores.
Modern processors work with dynamic frequencies for each of their cores. Because the cores don’t work with a fixed frequency, it is difficult to predict the performance for a sequence of instructions.
For example, Intel Turbo Boost Technology increases the frequency of the active cores. The process of increasing the frequency for a core is also known as overclocking.
If a single core is under a heavy workload, this technology will allow it to run at higher frequencies when the other cores are idle. If many cores are under heavy workloads, they will run at higher frequencies but not as high as the one achieved by the single core. The processor cannot keep all the cores overclocked for a long time, because it consumes more power and its temperature increases faster. The average clock frequency for all the cores under heavy workloads is going to be lower than the one achieved for the single core. Therefore, under certain situations, some code can run at higher frequencies than other code, which can make measuring real performance gains a challenge.
Shared-Memory and Distributed-Memory Systems
Distributed-memory computer systems are composed of many processors with their own private memory, as illustrated in the below figure. Each processor can be in a different computer, with different types of communication channels between them. Examples of communication channels are wired and wireless networks. If a job running in one of the processors requires remote data, it has to communicate with the corresponding remote microprocessor through the communication channel. One of the most popular communications protocols used to program parallel applications to run on distributed-memory computer systems is Message Passing Interface (MPI). It is possible to use MPI to take advantage of shared-memory multicore with C# and .NET Framework. However, MPI’s main focus is to help developing applications run on clusters. Thus, it adds a big overhead that isn’t necessary in shared-memory multicore, where all the cores can access the memory without the need to send messages.
The figure below shows a distributed-memory computer system with three machines. Each machine has a quad-core processor, and shared-memory architecture for these cores. This way, the private memory for each microprocessor acts as a shared memory for its four cores. A distributed-memory system forces you to think about the distribution of the data, because each message to retrieve remote data can introduce an important latency. Because you can add new machines (nodes) to increase the number of processors for the system, distributed-memory systems can offer great scalability.
Parallel Programming and Multicore Programming
Traditional sequential code, where instructions run one after the other, doesn’t take advantage of multiple cores because the serial instructions run on only one of the available cores. Sequential code written with C# or VB 2010 won’t take advantage of multiple cores if it doesn’t use the new features offered by .NET Framework 4 to split the work into many cores. There isn’t an automatic parallelization of existing sequential code.
Parallel programming is a form of programming in which the code takes advantage of the parallel execution possibilities offered by the underlying hardware. Parallel programming runs many instructions at the same time.
Multicore programming is a form of programming in which the code takes advantage of the multiple execution cores to run many instructions in parallel. Multicore and multiprocessor computers offer more than one processing core in a single machine. Hence, the goal is to “do more with less” meaning that the goal is to do more work in less time by distributing the work to be done in the available cores.
Modern microprocessors can also execute the same instruction on multiple data, a technique known as Single Instruction, Multiple Data or SIMD. This way, you can take advantage of these vector processors to reduce the time needed to execute certain algorithms.
Hardware Threads and Software Threads
A multicore processor has more than one physical core. A physical core is a real independent processing unit that makes it possible to run multiple instructions at the same time, in parallel. In order to take advantage of multiple physical cores, it is necessary to run many processes or to run more than one thread in a single process, creating multithreaded code. However, each physical core can offer more than one hardware thread, also known as a logical core or logical processor. Microprocessors with Intel Hyper-Threading Technology (HT or HTT) offer many architectural states per physical core. For example, many processors with four physical cores with HT duplicate the architectural states per physical core and offer eight hardware threads. This technique is known as simultaneous multithreading (SMT) and it uses the additional architectural states to optimize and increase the parallel execution at the microprocessor’s instruction level. SMT isn’t restricted to just two hardware threads per physical core; for example, you could have four hardware threads per core. This doesn’t mean that each hardware thread represents a physical core. SMT can offer performance improvements for multithreaded code under certain scenarios.
Each running program in Windows is a process. Each process creates and runs one or more threads, known as software threads to differentiate them from the previously explained hardware threads.
A process has at least one thread, the main thread. An operating system scheduler shares out the available processing resources fairly between all the processes and threads it has to run. Windows scheduler assigns processing time to each software thread. When Windows scheduler runs on a multicore processor, it has to assign time from a hardware thread, supported by a physical core, to each software thread that needs to run instructions. As an analogy, you can think of each hardware thread as a swim lane and a software thread as a swimmer.
Windows recognizes each hardware thread as a schedulable logical processor. Each logical processor can run code for a software thread. A process that runs code in multiple software threads can take advantage of hardware threads and physical cores to run instructions in parallel. The figure below shows software threads running on hardware threads and on physical cores. Windows scheduler can decide to reassign one software thread to another hardware thread to load-balance the work done by each hardware thread.
Because there are usually many other software threads waiting for processing time, load balancing will make it possible for these other threads to run their instructions by organizing the available resources. The figure below shows Windows Task Manager displaying eight hardware threads (logical cores and their workloads). Load balancing refers to the practice of distributing work from software threads among hardware threads so that the workload is fairly shared across all the hardware threads. However, achieving perfect load balance depends on the parallelism within the application, the workload, the number of software threads, the available hardware threads, and the load-balancing policy.
Windows runs hundreds of software threads by assigning chunks of processing time to each available hardware thread. You can use Windows Resource Monitor to view the number of software threads for a specific process in the Overview tab. The CPU panel displays the image name for each process and the number of associated software threads in the Threads column, as shown in the figure below where the vlc.exe process has 32 software threads.
Core Parking is a Windows kernel power manager and kernel scheduler technology designed to improve the energy efficiency of multicore systems. It constantly tracks the relative workloads of every hardware thread relative to all the others and can decide to put some of them into sleep mode. Core Parking dynamically scales the number of hardware threads that are in use based on workload. When the workload for one of the hardware threads is lower than a certain threshold value, the Core Parking algorithm will try to reduce the number of hardware threads that are in use by parking some of the hardware threads in the system. In order to make this algorithm efficient, the kernel scheduler gives preference to unparked hardware threads when it schedules software threads. The kernel scheduler will try to let the parked hardware threads become idle, and this will allow them to transition into a lower-power idle state.
Core Parking tries to intelligently schedule work between threads that are running on multiple hardware threads in the same physical core on systems with processors that include HT. This scheduling decision decreases power consumption. Windows Server 2008 R2 supports the complete Core Parking technology. However, Windows 7 also uses the Core Parking algorithm and infrastructure to balance processor performance between hardware threads with processors that include HT. The figure below shows Windows Resource Monitor displaying the activity of eight hardware threads, with four of them parked.
Regardless of the number of parked hardware threads, the number of hardware threads returned by
.NET Framework 4 functions will be the total number, not just the unparked ones. Core Parking technology doesn’t limit the number of hardware threads available to run software threads in a process. Under certain workloads, a system with eight hardware threads can turn itself into a system with two hardware threads when it is under a light workload, and then increase and spin up reserve hardware threads as needed. In some cases, Core Parking can introduce an additional latency to schedule many software threads that try to run code in parallel. Therefore, it is very important to consider the resultant latency when measuring the parallel performance.
If you want to take advantage of multiple cores to run more instructions in less time, it is necessary to split the code in parallel sequences. However, most algorithms need to run some sequential code to coordinate the parallel execution. For example, it is necessary to start many pieces in parallel and then collect their results. The code that splits the work in parallel and collects the results could be sequential code that doesn’t take advantage of parallelism. If you concatenate many algorithms like this, the overall percentage of sequential code could increase and the performance benefits achieved may decrease. Gene Amdahl, a renowned computer architect, made observations regarding the maximum performance improvement that can be expected from a computer system when only a fraction of the system is improved. He used these observations to define Amdahl’s Law, which consists of the following formula that tries to predict the theoretical maximum performance improvement (known as speedup) using multiple processors. It can also be applied with parallelized algorithms that are going to run with multicore microprocessors.
Maximum speedup (in times) = 1 / ((1 – P) + (P/N))
- P is the portion of the code that runs completely in parallel.
- N is the number of available execution units (processors or physical cores).
According to this formula, if you have an algorithm in which only 50 percent (P = 0.50) of its total work is executed in parallel, the maximum speedup will be 1.33x on a microprocessor with two physical cores. The figure below illustrates an algorithm with 1,000 units of work split into 500 units of sequential work and 500 units of parallelized work. If the sequential version takes 1,000 seconds to complete, the new version with some parallelized code will take no less than 750 seconds.
Maximum speedup (in times) = 1 / ((1 – 0.50) + (0.50 / 2)) = 1.33x
The maximum speedup for the same algorithm on a microprocessor with eight physical cores will be a really modest 1.77x. Therefore, the additional physical cores will make the code take no less than 562.5 seconds.
Maximum speedup (in times) = 1 / ((1 – 0.50) + (0.50 / 8)) = 1.77x
The figure below shows the maximum speedup for the algorithm according to the number of physical cores, from 1 to 16. As we can see, the speedup isn’t linear, and it wastes processing power as the number of cores increases.
The figure below shows the same information using a new version of the algorithm in which 90 percent (P = 0.90) of its total work is executed in parallel. In fact, 90 percent of parallelism is a great achievement, but it results in a 6.40x speedup on a microprocessor with 16 physical cores.
Maximum speedup (in times) = 1 / ((1 – 0.90) + (0.90 / 16)) = 6.40x
John Gustafson noticed that Amdahl’s Law viewed the algorithms as fixed, while considering the changes in the hardware that runs them. Thus, he suggested a reevaluation of this law in 1988. He considers that speedup should be measured by scaling the problem to the number of processors and not by fixing the problem size. When the parallel-processing possibilities offered by the hardware increase, the problem workload scales. Gustafson’s Law provides the following formula with the focus on the problem size to measure the amount of work that can be performed in a fixed time:
Total work (in units) = S + (N × P)
- S represents the units of work that run with a sequential execution.
- P is the size of each unit of work that runs completely in parallel.
- N is the number of available execution units (processors or physical cores).
You can consider a problem composed of 50 units of work with a sequential execution. The problem can also schedule parallel work in 50 units of work for each available core. If you have a processor with two physical cores, the maximum amount of work is going to be 150 units.
Total work (in units) = 50 + (2 × 50) = 150 units of work
The figure below illustrates an algorithm with 50 units of work with a sequential execution and a parallelized section. The latter scales according to the number of physical cores. This way, the parallelized section can process scalable, parallelizable 50 units of work. The workload in the parallelized section increases when more cores are available. The algorithm can process more data in less time if there are enough additional units of work to process in the parallelized section. The same algorithm can run on a processor with eight physical cores. In this case, it will be capable of processing 450 units of work in the same amount of time required for the previous case:
Total work (in units) = 50 + (8 × 50) = 450 units of work
The figure below shows the speedup for the algorithm according to the number of physical cores, from 1 to 16. This speedup is possible provided there are enough units of work to process in parallel when the number of cores increases. As you can see, the speedup is better than the results offered by applying Amdahl’s Law.
The figure below shows the total amount of work according to the number of available physical cores, from 1 to 32.
The figure below illustrates many algorithms composed of several units of work with a sequential execution and parallelized sections. The parallelized sections scale as the number of available cores increases. The impact of the sequential sections decreases as more scalable parallelized sections run units of work. In this case, it is necessary to calculate the total units of work for both the sequential and parallelized sections and then apply them to the formula to find out the total work with eight physical cores:
Total sequential work (in units) = 25 + 150 + 100 + 150 = 425 units of work
Total parallel unit of work (in units) = 50 + 200 + 300 = 550 units of work
Total work (in units) = 425 + (8 × 550) = 4,825 units of work
A sequential execution would be capable of executing only 975 units of work in the same amount of time:
Total work with a sequential execution (in units) =
25 + 50 + 150 + 200 + 100 + 300 + 150 = 975 units of work
Working with Lightweight Concurrency
Unfortunately, neither Amdahl’s Law nor Gustafson’s Law takes into account the overhead introduced by parallelism. Nor do they consider the existence of patterns that allow the transformation of sequential parts into new algorithms that can take advantage of parallelism. It is very important to reduce the sequential code that has to run in applications to improve the usage of the parallel execution units.
In previous .NET Framework versions, if you wanted to run code in parallel in a C# application you had to create and manage multiple threads (software threads). Therefore, you had to write complex multithreaded code. Splitting algorithms into multiple threads, coordinating the different units of code, sharing information between them, and collecting the results are indeed complex programming jobs. As the number of logical cores increases, it becomes even more complex, because you need more threads to achieve better scalability. The multithreading model wasn’t designed to help developers tackle the multicore revolution. In fact, creating a new thread requires a lot of processor instructions and can introduce a lot of overhead for each algorithm that has to be split into parallelized threads. Many of the most useful structures and classes were not designed to be accessed by different threads, and, therefore, a lot of code had to be added to make this possible. This additional code distracts the developer from the main goal: achieving a performance improvement through parallel execution.
Because this multithreading model is too complex to handle the multicore revolution, it is known as heavyweight concurrency. It adds an important overhead. It requires adding too many lines of code to handle potential problems because of its lack of support of multithreaded access at the framework level, and it makes the code complex to understand.
The aforementioned problems associated with the multithreading model offered by previous .NET
Framework versions and the increasing number of logical cores offered in modern processors motivated the creation of new models to allow creating parallelized sections of code. The new model is known as lightweight concurrency, because it reduces the overall overhead needed to create and execute code in different logical cores. It doesn’t mean that it eliminates the overhead introduced by parallelism, but the model is prepared to work with modern multicore microprocessors. The heavyweight concurrency model was born in the multiprocessor era, when a computer could have many physical processors with one physical core in each. The lightweight concurrency model takes into account the new micro architectures in which many logical cores are supported by some physical cores. The lightweight concurrency model is not just about scheduling work in different logical cores. It also adds support of multithreaded access at the framework level, and it makes the code much simpler to understand. Most modern programming languages are moving to the lightweight concurrency model. Luckily, .NET Framework 4 is part of this transition. Thus, all the managed languages that can generate .NET applications can take advantage of the new model.
Creating Successful Task-Based Designs
Sometimes, you have to optimize an existing solution to take advantage of parallelism. In these cases, you have to understand an existing sequential design or a parallelized algorithm that offers a reduced scalability, and then you have to refactor it to achieve a performance improvement without introducing problems or generating different results. You can take a small part or the whole problem and create a task–based design, and then you can introduce parallelism. The same technique can be applied when you have to design a new solution. You can create successful task-based designs by following these steps:
- Split each problem into many subproblems and forget about sequential execution.
- Think about each subproblem as any of the following:
- Data that can be processed in parallel — Decompose data to achieve parallelism.
- Data flows that require many tasks and that could be processed with some kind of complex parallelism — Decompose data and tasks to achieve parallelism.
- Tasks that can run in parallel — decompose tasks to achieve parallelism.
- Organize your design to express parallelism.
- Determine the need for tasks to chain the different subproblems. Try to avoid dependencies as much as possible (minimizes locks).
- Design with concurrency and potential parallelism in mind.
- Analyze the execution plan for the parallelized problem considering current multicore microprocessors and future architectures. Prepare your design for higher scalability.
- Minimize critical sections as much as possible.
- Implement parallelism using task-based programming whenever possible.
- Tune and iterate.
The aforementioned steps don’t mean that all the subproblems are going to be parallelized tasks running in different threads. The design has to consider the possibility of parallelism and then, when it is time to code, you can decide the best option according to the performance and scalability goals. It is very important to think in parallel and split the work to be done into tasks. This way, you will be able to parallelize your code as needed. If you have a design prepared for a classic sequential execution, it is going to take a great effort to parallelize it by using task-based programming techniques.
Designing With Concurrency in Mind
When you design code to take advantage of multiple cores, it is very important to stop thinking that the code inside a C# application is running alone. C# is prepared for concurrent code, meaning that many pieces of code can run inside the same process simultaneously or with an interleaved execution. The same class method can be executed in concurrent code. If this method saves a state in a static variable and then uses this saved state later, many concurrent executions could yield unexpected and unpredictable results.
As previously explained, parallel programming for multicore microprocessors works with the shared-memory model. The data resides in the same shared memory, which could lead to unexpected results if the design doesn’t consider concurrency. It is a good practice to prepare each class and method to be able to run concurrently, without side effects. If you have classes, methods, or components that weren’t designed with concurrency in mind, you would have to test their designs before using them in parallelized code.
Each subproblem detected in the design process should be capable of running while the other subproblems are being executed concurrently. If you think that it is necessary to restrict concurrent code when a certain subproblem runs because it uses legacy classes, methods, or components, it should be made clear in the design documents. Once you begin working with parallelized code, it is very easy to incorporate other existing classes, methods, and components that create undesired side effects because they weren’t designed for concurrent execution.
Interleaved Concurrency, Concurrency, and Parallelism
The figure below illustrates the differences between interleaved concurrency and concurrency when there are two software threads and each one executes four instructions. The interleaved concurrency scenario executes one instruction for each thread, interleaving them, but the concurrency scenario runs two instructions in parallel, at the same time. The design has to be prepared for both scenarios.
Concurrency requires physically simultaneous processing to happen.
Parallelized code can run in many different concurrency and interleaved concurrency scenarios, even when it is executed in the same hardware configuration. Thus, one of the great challenges of a parallel design is to make sure that its execution with different possible valid orders and interleaves will lead to the correct result, otherwise known as correctness. If you need a specific order or certain parts of the code don’t have to run together, it is necessary to make sure that these parts don’t run concurrently. You cannot assume that they don’t run concurrently because you run it many times and it produces the expected results. When you design for concurrency and parallelism, you have to make sure that you consider correctness.
Minimizing Critical Sections
Both Amdahl’s Law and Gustafson’s Law recognized sequential work as an enemy of the overall performance in parallelized algorithms. The serial time between two parallelized sections that needs a sequential execution is known as a critical section. The figure below identifies four critical sections in one of the diagrams used to analyze Gustafson’s Law.
When you parallelize tasks, one of the most important goals in order to achieve the best performance is to minimize these critical sections. Most of the time, it is impossible to avoid some code that has to run with a sequential execution between two parallelized sections, because it is necessary to launch the parallel jobs and to collect results. However, optimizing the code in the critical sections and removing the unnecessary ones is even more important than the proper tuning of parallelized code.
When you face an execution plan with too many critical sections, remember Amdahl’s Law. If you cannot reduce them, try to find tasks that could run in parallel with the critical sections. For example, you can pre-fetch data that is going to be consumed by the next parallelized algorithm in parallel with a critical section to improve the overall performance offered by the solution. It is very important that you consider the capabilities offered by modern multicore hardware to avoid thinking you have just one single execution unit.