CPU Affinity and taskset

No matter how performant any code is, the architecture of the server it is being run on will have an impact on the performance profile. This impact could, hopefully, be minimal, but if the application is threaded and/or latency sensitive this is unlikely unless this has been taken into account.

Processor Architecture

Unfortunately, especially with Intel’s most recent multi-core processors, processor architecture needs to be taken into account to gain the best performance from an application. If the application being deployed requires 2 cores, or is expected to perform better over 2 cores, which cores are used can have a bearing upon the performance of the application.

This is because of the architectural approach taken on the latest Intel processors. On the Intel Clovertown and Harpertown Xeon processors, the L2 cache is not shared across all cores. Within a single processor there are 4 L1 caches, 1 per core and 2 L2 caches, shared between a pair of cores. In addition the pairing of the L2 cache between the cores is also different between architectures, just to add an additional level of complexity.

CPU Affinity Overview

There are two types of CPU affinity. The first, soft affinity (also called natural affinity) is the tendency of the scheduler to try to keep processes on the same CPU as long as possible. It is merely an attempt; if it is ever infeasible, the process is migrated to another processor. The O(1) scheduler in 2.6 exhibits excellent natural affinity. On the opposite end, however, is the 2.4 scheduler, which has poor CPU affinity. This behavior results in the ping-pong effect. The scheduler bounces processes between multiple processors each time they are scheduled and rescheduled. It should be noted that Red Hat back-ported the O(1) scheduler into RHEL3 (in addition to many others changes) and that the 2.4 kernel in that release is really a mix of 2.4, late 2.5 and early 2.6 kernel sources.

Hard affinity, on the other hand, is what the CPU affinity system call provides. It is a requirement, and processes must adhere to a specified hard affinity. If a processor is bound to CPU 1, for example, then it can run only on CPU 1.

CPU Affinity Benefits

The first benefit of CPU affinity is optimizing cache performance. The scheduler tries hard to keep tasks on the same processor, but in some performance-critical situations, i.e. a highly threaded application, it makes sense to enforce the affinity as a hard requirement. Multiprocessing computers try and keep the processor caches valid. Data can be kept in only one processor’s cache at a time; otherwise, the processor’s cache may grow out of sync. Consequently, whenever a processor adds a line of data to its local cache, all the other processors in the system also caching it must invalidate that data but this invalidation is costly. But the real performance penalty comes into play when processes bounce between processors as they constantly cause cache invalidations, and the data they want is never in the cache when they need it. Thus, cache miss rates grow very large. CPU affinity protects against this and improves cache performance.

A second benefit of CPU affinity is if multiple threads are accessing the same data, it can make sense to bind them all to the same processor. Doing so guarantees that the threads do not contend over data and cause cache misses. This does diminish the performance gained from multithreading on SMP, however if the threads are inherently serialized, however, the improved cache hit rate can negate this.

The third benefit is found in real-time or time-sensitive applications. In this approach, all the system processes are bound to a subset of the processors on the system. The application then is bound to the remaining processors. For example in a dual-processor system, the application would be bound to one processor, and all other processes are bound to the other. This ensures that the application receives the full attention of the processor.

Implementing CPU Affinity under Linux

There are 2 methods to implement cpu affinity, within the source code of the application itself using the sched_getaffinity system call or by use of the command line tool taskset.

Using taskset to assign CPU affinity

Under Linux it is straight forward to bind an application to one or more cores via the taskset command. Once you know the processor type you are using, and therefore the allocation you require, taskset can be used to either start the application bound to the correct cores or to rebind an already running application. For example:

taskset –c 2,6 <application>

The above taskset command is for a HarperTown based system and is therefore binding an application to core 3 and 7 (taskset start at cpu0 hence the num-1). In order to bind a process to a cpu(s) taskset needs to be run by root.

taskset can also be run on an existing application to change its processor binding(s) if required as follows:

taskset –c 2,6 –p <pid>

To verify that a taskset binding has worked, or to verify what the binding profile of an already running application, run, as any user:

taskset –c –p <pid>

This will return the core(s) being used by the process.


About these ads

8 thoughts on “CPU Affinity and taskset”

  1. Very interesting article, but i have a question.

    Citation: “For example in a dual-processor system, the application would be bound to one processor, and all other processes are bound to the other. This ensures that the application receives the full attention of the processor.”

    Is really possible to do that? and if yes, how?

    thank you

    1. Hi,

      As long as you spend quite some time tasksettting processes as required it is possible. However using cpu isolation is a better way to completely dedicate a core to an application

      1. Hi,

        my problem is how to isolate a cpu.
        I tried to exclusively assign a process to a core, but there are some kernel thread (like migration/#) that doesn’t move to other core.

        any suggestions?

      2. Hi,

        Kernels threads are going to be an issue. It might be possible to use cpu isolation to achieve, not sure as that wasn’t something I looked for when I tested cpu isolation on rhel4. Red Hats MRG allows you to move kernel threads arounds AFAIK, and therefore so would a realtime patched kernel. Again not tried that.

        If you using rhel4 there is a patch for rc.sysinit (see below) to get cpu isolation to work (called default_affinity), see
        https://bugzilla.redhat.com/show_bug.cgi?id=240981 for more detail

        This will allow default_affinity= to be set to a bitmask (lowest bit = cpu 0, and so on).

        Examples:
        default_affinity=0x1 (cpu 0 only)
        default_affinity=0x3 (cpu 0 and 1 only)
        default_affinity=0x00002 (cpu 1 only)

        Note that “/usr/bin/taskset” has to be accessible before the filesystems are mounted. If you have separate /usr, copy it to /bin & change the hack.

        — rc.sysinit.orig 2007-05-23 10:37:03.000000000 +0100
        +++ rc.sysinit 2007-05-23 11:29:01.000000000 +0100
        @@ -127,11 +127,23 @@

        touch /dev/.in_sysinit >/dev/null 2>&1

        -[ -x /sbin/start_udev ] && /sbin/start_udev

        # Only read this once.
        cmdline=$(cat /proc/cmdline)

        +# Set default affinity
        +if [ -f /usr/bin/taskset ]; then
        + if strstr “$cmdline” default_affinity= ; then
        + for arg in $cmdline ; do
        + if [ "${arg##default_affinity=}" != "${arg}" ]; then
        + /usr/bin/taskset -p ${arg##default_affinity=} 1
        + fi
        + done
        + fi
        +fi
        +
        +
        +[ -x /sbin/start_udev ] && /sbin/start_udev
        +
        # Initialize hardware
        if [ -f /proc/sys/kernel/modprobe ]; then
        if ! strstr “$cmdline” nomodules && [ -f /proc/modules ] ; then

        I’ve used this patch (note this patch is not mine) and it works great on rhel4_u6 system.

        Cheers,
        Jon

      3. Hi,

        this is a good method, but i have to do in kernel mode, without any user mode app.

        I implemented a kernel module, that execute a kernel thread on a specified core (exclusively).
        For now, i used the field cpus_allowed of task_struct.
        The problem is that i can’t lock the tasklist_lock, because the lock isn’t exported with EXPORT_SYMBOL. Is there an alternative method for lock the tasklist?

        Thank you for your support.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s