No matter how performant any code is, the architecture of the server it is being run on will have an impact on the performance profile. This impact could, hopefully, be minimal, but if the application is threaded and/or latency sensitive this is unlikely unless this has been taken into account.
Unfortunately, especially with Intel’s most recent multi-core processors, processor architecture needs to be taken into account to gain the best performance from an application. If the application being deployed requires 2 cores, or is expected to perform better over 2 cores, which cores are used can have a bearing upon the performance of the application.
This is because of the architectural approach taken on the latest Intel processors. On the Intel Clovertown and Harpertown Xeon processors, the L2 cache is not shared across all cores. Within a single processor there are 4 L1 caches, 1 per core and 2 L2 caches, shared between a pair of cores. In addition the pairing of the L2 cache between the cores is also different between architectures, just to add an additional level of complexity.
There are two types of CPU affinity. The first, soft affinity (also called natural affinity) is the tendency of the scheduler to try to keep processes on the same CPU as long as possible. It is merely an attempt; if it is ever infeasible, the process is migrated to another processor. The O(1) scheduler in 2.6 exhibits excellent natural affinity. On the opposite end, however, is the 2.4 scheduler, which has poor CPU affinity. This behavior results in the ping-pong effect. The scheduler bounces processes between multiple processors each time they are scheduled and rescheduled. It should be noted that Red Hat back-ported the O(1) scheduler into RHEL3 (in addition to many others changes) and that the 2.4 kernel in that release is really a mix of 2.4, late 2.5 and early 2.6 kernel sources.
Hard affinity, on the other hand, is what the CPU affinity system call provides. It is a requirement, and processes must adhere to a specified hard affinity. If a processor is bound to CPU 1, for example, then it can run only on CPU 1.
The first benefit of CPU affinity is optimizing cache performance. The scheduler tries hard to keep tasks on the same processor, but in some performance-critical situations, i.e. a highly threaded application, it makes sense to enforce the affinity as a hard requirement. Multiprocessing computers try and keep the processor caches valid. Data can be kept in only one processor’s cache at a time; otherwise, the processor’s cache may grow out of sync. Consequently, whenever a processor adds a line of data to its local cache, all the other processors in the system also caching it must invalidate that data but this invalidation is costly. But the real performance penalty comes into play when processes bounce between processors as they constantly cause cache invalidations, and the data they want is never in the cache when they need it. Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance.
A second benefit of CPU affinity is if multiple threads are accessing the same data, it can make sense to bind them all to the same processor. Doing so guarantees that the threads do not contend over data and cause cache misses. This does diminish the performance gained from multithreading on SMP, however if the threads are inherently serialized, however, the improved cache hit rate can negate this.
The third benefit is found in real-time or time-sensitive applications. In this approach, all the system processes are bound to a subset of the processors on the system. The application then is bound to the remaining processors. For example in a dual-processor system, the application would be bound to one processor, and all other processes are bound to the other. This ensures that the application receives the full attention of the processor.
There are 2 methods to implement cpu affinity, within the source code of the application itself using the sched_getaffinity system call or by use of the command line tool taskset.
Under Linux it is straight forward to bind an application to one or more cores via the taskset command. Once you know the processor type you are using, and therefore the allocation you require, taskset can be used to either start the application bound to the correct cores or to rebind an already running application. For example:
taskset –c 2,6 <application>
The above taskset command is for a HarperTown based system and is therefore binding an application to core 3 and 7 (taskset start at cpu0 hence the num-1). In order to bind a process to a cpu(s) taskset needs to be run by root.
taskset can also be run on an existing application to change its processor binding(s) if required as follows:
taskset –c 2,6 –p <pid>
To verify that a taskset binding has worked, or to verify what the binding profile of an already running application, run, as any user:
taskset –c –p <pid>
This will return the core(s) being used by the process.