Issuu on Google+

TUTORIAL Compiling with ICC


ICC Fast optimisation strategies Biagio Lucini feeds his need for speed with the Intel C Compiler.

LAST TIME If you need a refresher on compiling from source, or would like a closer look at GCC, last month’s GCC tutorial should do the trick. We covered compiler switches for optimisation, generation of assembly code, debugging and much more. If you missed the issue, call 0870 8374722 or +44 1858 438794 for overseas orders. You might also enjoy the GCC 4.0 preview in LXF66.


LXF68.tut_icc 88

LXF68 JULY 2005

When we talk about compilers in Linux we generally think of the GCC suite, and for very good reasons. However, GCC is not the only available option. If you’re concerned about performance and your code is written in C or C++, 99% of the time the answer will be to resort to the Intel C/ C++ Compiler, or ICC. This is a relatively young player, but in the four years that have followed its release in 2001 it has proved hard to beat in most (if not all) benchmarks. Even so, to obtain fast code you need to know how to use ICC properly. In this tutorial, I will walk you through the compilation options that affect the performance of generated code and discuss its compatibility with GCC. For this tutorial, we’re using the latest release of ICC, which at the time of writing is version 8.1.

ICC vs GCC By and large, GCC is a great compiler. So why should you want to use ICC? Well, in fact there are several reasons. First and foremost is the performance issue I mentioned: in almost all the benchmarks executed so far, ICC manages to lead on GCC (and any other C compiler available for Linux), and often by a remarkable margin. True, GCC is quickly closing the gap, but in cases where the execution time really matters, you won’t care about its potential: you’ll just want speed now. What’s really amazing is that to get maximum performance from ICC only three or four compiler options need to be tuned, whereas GCC needs lots of fine-tuning to make it run at its best. Which takes us directly to the second reason: ease of use. Just compare the man pages of ICC and GCC, and you’ll see that where GCC’s page offers a wealth of options that taken individually have little impact on performance, the ICC page offers options that group different transformations together and produce measurable benefits. The third reason to look at ICC, which is related to the previous one, is documentation. All the ICC options are carefully

documented – the same does not always apply to GCC. Fourth in our loose list is built-in parallelism: if you invest several million pounds in the most powerful multi-CPU machine on the market, you also want a compiler that is able to squeeze every single CPU cycle out of it. While ICC supports parallelism natively through the OpenMP standard, there is nothing similar at present for GCC. Last but not least is support. Granted, there are several companies behind GCC that can offer valid support options for mission-critical applications, but no individual or company really controls GCC or can steer its direction. Of course, that’s exactly how it should be in the open source world, but in business people who take strategic decisions like to know that there is a solid company behind a software product and that this company can guarantee 24/7 support. Given those points, our earlier question can be turned on its head: why should you want to use GCC at all? Again, there are good reasons: first, GCC is free as in speech or as in freedom, while ICC might be free for non-commercial use, but only as in beer – and it’s quite pricey if you don’t qualify for the noncommercial licence. Also, GCC runs on almost every architecture and every operating system, while ICC only runs on Linux and Windows, and on i386-compatible (albeit including the new EM64T and AMD64) and Itanium architectures. And GCC is certainly going through an exciting phase of development that’s bound to improve many aspects of the compiler, while the developers of ICC seem to be more concerned with stabilising its existing features.

Console or IDE? See the Installing ICC boxes on page 90 for help with installing the compiler. When you have everything in place, you should be able to execute icc -help. If you see a list of available options, you are ready to use ICC (by the way, keep this in mind as a quick way of getting a reference for the various options). If this

11/5/05 12:25:17 pm

TUTORIAL Compiling with ICC

The man page for ICC, a comprehensive, easy-to-read reference for all supported options.

QUICK TIP To work, ICC needs to set some internal variables to the appropriate values. Because those variables are undefined when a new shell is called, you would have to enter a line like the following each time you opened a terminal: . /opt/intel/intel_cc_80/bin/ You can avoid this by adding these lines to your ~/.bashrc # ICC initialisation. Check whether ICCDIR is already in the PATH. # If not, execute iccvars,sh ICCDIR=/opt/intel_cc_80/bin if [ ! $( echo $PATH | grep $ICCDIR ) ] ; then . ${ICCDIR}/ fi Adapt it by changing the path in the line starting ICCDIR.

does not work, you will need to go back through the installation procedure to identify what has not worked and rectify it. In its early days, ICC was exclusively a command line tool. In the most recent releases, though, Intel has bundled with it a customised version of the Eclipse platform with the C/C++ plugin, allowing you to fully manage the development process from Eclipse. Owing to space constraints, for this tutorial I made the difficult choice not to cover development in Eclipse, but you should find that the command line is quicker to exploit and overall more instructive than its graphical counterpart. I’m sure some people may object; if you’re one of them, email and see if you can make us change our minds. ICC integrates seamlessly with the underlying command line environment. In particular, following the Unix tradition, all of its secrets are explained on its well-organised man page. Whenever you have a problem with the compiler, man icc will show you the way. It’s straightforward to use ICC for simple tasks. Executing icc <options> file1.c file2.c file3.c ... -o myapp we build the application myapp out of the bunch of source files file1.c, file2.c, file3.c and so on. <options> are flags that direct the compiler during the process of generating the code. None, one or more options can be used, provided that they are compatible with each other. -o myapp is just one of those directives: it instructs the compiler to call the resulting executable myapp instead of a.out, which is the default. It is possible to stop before the linking process, and this will produce the object files file1.o, file2.o and so on by using the -c switch, thus: icc -c <other options> file1.c file2.c file3.c ... Stopping before linking enables you to choose different compiler flags for different source files. If we want to stop at the assembly level we just need to replace the -c option with -S. It is often instructive to compare the assembly generated by ICC with that generated by GCC, to see how potential performance gain or loss can be tracked back to the choice of compiler. This of course requires that the developer is able to understand assembly instructions, which are not as easy to read as high-level languages such as C or C++. It is even possible to act on the assembly code and create the object files with those modified assembly files. This would give more control over the allocation of memory and registers, prefetching of variables and so on. But this lies in the lofty realms of very advanced usage, and won’t be needed for general use.

Optimisation options In all compiler benchmarks there is a certain degree of arbitrariness. But whatever your benchmark criteria and methods, the results are pretty definitive: ICC is fast. Although the default

options of ICC are tuned to guarantee good performance, you can improve it further with a little work. The first option to look at is the classic -On, where n is an integer between 0 and 3. If you read last month’s GCC tutorial you should be familiar with the meaning of the various -O switches. Here they are quickly again: ∆ -O0 Disables all optimisations. ∆ -O1 Turns on all optimisations that do not have any significant impact either in terms of the size of the code or in terms of the compilation time. ∆ -O2 Enables all optimisations that are expected to result in an increase in performance and do not restructure the code. ∆ -O3 Performs more aggressive optimisations, reshaping the layout of memory and reworking the flow of instructions. -O2 is the default, and the recommended level of optimisation for many applications. -O1 can improve performance for applications with many lines of source code and many branches whose execution time is not dominated by the code inside loops. Finally, -O3 will only benefit applications that make extensive use of floating point operations; for any other kind of app, the increased size of the executable and the longer compilation time would outweigh the limited impact on performance. As usual, deciding which option is best strictly depends on the code, so try them out. In some cases, it helps to refine the optimisation using other options. Among them are the following: ∆ -Os Optimises for speed, but disables those optimisations that produce little performance gain and have a large impact on the size of the code. ∆ -Ob<n> Controls the inline expansion. n can be equal to 0 (no inlining), 1 (only functions declared with the __inline keyword are inlined: this is the default) or 2 (all functions can be inlined). In numerically-critical applications, it may be important to ensure that precision is preserved during all intermediate steps. For this, you should use -mp. But floating point precision can

INTEL VERSIONS There is a version of ICC for every architecture supported by Intel at hardware level. Those are Itanium (referred to as i64), i386 (i32) excluding the Extended Memory 64 Technology, and Extended Memory 64 Technology itself (i32e). These different versions work in exactly the same way, except for a handful of default options. In any case the ICC installation package will provide two compilers: ICC (which should be used for C programs) and ICPC (for C++ programs), which differ only in the included headers and the linked libraries. I’ve referred to ICC in this tutorial. Finally, distributed in a separate package (and priced separately) is the Intel Fortran Compiler (IFC), which again is available for the three supported architectures and shares many optimisation options with ICC.

LXF68.tut_icc 89


LXF68 JULY 2005


11/5/05 12:25:22 pm

TUTORIAL Compiling with ICC

>> slow compiler speed. So keep in mind that the -mp option slows


down the code by a substantial margin. In most cases -mp1, which guarantees improved precision with a minor impact on speed, would be a better choice.

RPM-based distributions ICC is commercial software and as such is unlikely to come bundled with your distribution. But installation is straightforward if you have an RPM-based distro. ICC comes in the form of a tar.gz file, primarily composed by three RPM packages: the compiler itself; substitute headers, in compliance with the GPL licence; and the integrated debugger. To install ICC, it is just matter of running the script and answering the questions that come up. Beware, though: the script will check for the existence of a valid licence, so you must have one before installing the package.

Processor dispatch

The ICC user guide, a recommended reference for going beyond this tutorial.

There are enough differences between the processors supported by ICC to justify the need for differently optimised code. The supported processors fall into two groups: the Itanium family and the evolution of the classic i386. This latter group ranges from the increasingly rare original Pentium all the way up to the new EM64T, which despite being binary compatible is very different from the Pentium and Celeron that many of us have on our desks right now – EM64T processors operate on 64-bit data, while the non-EM64T ones work at 32-bit. This wide spectrum of options requires different optimisation strategies. Recognising this, ICC provides a switch that allows you to tune the code for a given architecture. These are the optimising switches for the i32/i32e version of the compiler: ∆ tpp5 Pentium processor. ∆ tpp6 Pentium Pro/II/III processors. ∆ tpp7 Pentium IV processor And for the ia64: ∆ tpp1 Itanium processor family. ∆ tpp2 Itanium II processor. While producing optimal code for a given processor, the -tpp switch generates code that is general enough to run on all supported processors within the bounds of binary compatibility. Code that runs on just one category of processors can be built with the switch -x followed by a label (without blanks). Permitted labels are: ∆ K Pentium III and compatible processors. ∆ W Pentium IV and compatible processors. ∆ N Also Pentium IV, but enables further optimisations in addition to processor-specific ones. ∆ B As N, but more suited to the Pentium Mobile. ∆ P As N, but also supports the Streaming SIMD Extensions 3 (SSE3) as found on all EM64T processors Trying to run code compiled with the -xP option on a Pentium IV that does not support the SSE3 instruction set is bound to cause a runtime error. You’ll know you’ve done it when you receive an ‘Illegal instruction’ exit message. If your code will mainly be run on a Pentium IV that supports SSE3 but might occasionally be run on processors that don’t support it, the code should be compiled with the -axP switch. -ax acts like -x and

INSTALLING ICC Non-RPM systems Although the basic ICC compiler is contained in an RPM package, it is not the end of the world if your system does not have RPM installed, although in this case installation is not officially supported. You can use Alien, a valuable application that converts different package types, but with ICC being a complex package it will probably be a frustrating experience. A better solution is to use rpm2cpio piped with cpio. rpm2cpio intel-icc8-8.1-028.i386.rpm | cpio -id will create the directory subtree ./opt/intel_cc_80. You can take the intel_cc_80 directory and move it to the desired final location, such as /opt: mv -a opt/intel_cc_80 /opt We will assume that the compiler tree will be located at /opt. If not, just replace /opt with the path of your choice. There is still some work to do before you can use the compiler. First, copy the licence file (which should be obtained from Intel) to /opt/intel_cc_80/licenses (make it sure that this is readable by all potential users, and not just by root). Then you must edit the scripts located in /opt/intel_cc_80/bin. In particular, make sure that the string <INSTALLDIR> is replaced by /opt/intel_cc_80. This must be done for the files icc, icpc and (or iccvars.csh if the default shell is tcsh): just open them with your favourite text editor and perform a global replacement. Now you should be ready to use ICC.


LXF68.tut_icc 90

LXF68 JULY 2005

accepts the same labels. The difference between the two is that ax produces two versions of the code: one is highly optimised for the specified target and the other is generic enough to run on all compatible architectures. These two versions are packaged into a single executable, and the decision on which should be run is performed at runtime, after autodetection of the processor type. The advantage of -ax is that you don’t get the ‘Illegal instruction’ error message; the disadvantage is that because a decision must be made at runtime there is some performance loss. Despite the fact that Intel obviously does not want to advertise this, ICC also produces the fastest available code on AMD-branded CPUs. To know which optimisation options are best for AMD CPUs we must look at the instruction set supported by the processor. Execute cat /proc/cpuinfo and look at the flags line. If it contains the string sse2 (as is the case for the Opteron and the Athlon 64), from the point of view of ICC your processor is equivalent to a Pentium IV (without SSE3 of course). If it contains sse but not sse2, it is like a Pentium III. With only mmx, ICC considers your processor to be equivalent to a Pentium II or Pentium Pro.

Advanced optimisation Loading shared libraries produces slim executables and is an elegant solution, but it can have an impact on performance and some portability problems. If these are an issue for you, I’d recommend you use the -static compiler flag. An inappropriate layout of the data in memory can cause performance degradation – to prevent this, structures should be aligned on natural boundaries, which means that an 8-bit-wide object should start from a position in memory that has an offset of a multiple of eight. In order to prevent alignment-related problems, we can use the options -align and -Zp16. -align reorders arrays and variables to satisfy alignment requirements, while -Zp16 constrains the alignment on the maximal supported boundary (which can waste some space in memory, but is generally safe). Besides 16, the values allowed in conjunction with -Zp are 1, 2, 4 and 8. Usually, optimising strategies act within procedures. ‘Interprocedural optimisation’ is a technique that enables those strategies to span several subroutines, either within files (the -ip option) or even crossing the files’ physical boundaries (the -ipo option). At the moment, interprocedural optimisation is one of the features that sets ICC apart from GCC, which does not currently support it (although the TREE-SSA optimisation framework, on which GCC 4.0 is based, is likely to speed up the adoption of this technique). Another of the killer features of ICC is feedback-based optimisation. Generic optimisation strategies can have different degrees of success. The problem is that once you decide on an optimisation strategy, you have to stick to it everywhere in the code, regardless of whether it is beneficial in each single case where it can be performed.

11/5/05 12:25:24 pm

TUTORIAL Compiling with ICC

Suppose, for instance, that we decide to inline the functions or to unroll the loops. These operations will be performed irrespective of whether they are good strategies for a given single function or for a given single loop. This means that though broadly speaking the code is accelerated, locally it can suffer from performance degradation. The answer to this problem is a multi-stage compilation with the options -prof_gen and -prof_use. The first step is to compile with the -prof_gen option. Next, we will run the code on a typical input set or on multiple typical input sets of data. The code compiled with -prof_gen will run slower because many of the optimisation options are disabled. This run is used to gather information about parts of the code; and will allow ICC to identify where the program spends most of the running time. After the test run is performed, we recompile the program with -prof_use. In this phase, information is gathered and used to optimise the code, with particular attention paid to the critical regions. In my experience, this process gains up to a further 20% in speed over code that’s already carefully optimised.

Going parallel Nowadays desktop users can afford hardware that just a few years ago was outside their budget. That includes symmetric multiprocessing systems (or SMPs for short), which are characterised by several CPUs on the same board. There are also special processors that pack more than one processing unit (physical multi-core CPUs) or use special techniques to simulate more than one CPU (such as the Hyperthreading technology, which in fact provides a logical multi-core CPU). The Linux OS is perfectly capable of managing the presence of several CPUs by efficiently scheduling processes and distributing them evenly across the available computational units. But if you want a single process to access several CPUs simultaneously, you’ll have to write the source code in a special way – unless you have an autoparallelising compiler. And guess what? Unlike GCC, ICC is an autoparallelising compiler. This means that whenever it detects that it’s safe to execute two independent statements of the main application at the same time, it marks them for scheduling on different processors. In order for this magic to happen, we have to pass the -parallel flag to the compiler. The resulting code is known as multithreaded or parallel. It’s also possible to specify with special instructions understood by the compiler which regions of the code should be parallelised, amending possible flaws in the analysis performed by the autoparalleliser. There are different high-level extensions to the standard languages for parallelism. ICC supports the OpenMP standard ( out of the box, which is particularly efficient on SMP systems (as opposed to systems where cooperating processors are hosted by different physical machines, such as Beowulf clusters). When a program has been written following the OpenMP specifications, it must be compiled with the -openmp flag for exploiting multiprocessing or -openmp_stubs on single processor machines.

GCC compatibility If you have decided to use ICC on Linux, you must keep in mind that the underlying operating system – not just the kernel, but the apps and the libraries too – will have been compiled with GCC. So it’s not unlikely that you will be forced to link binaries compiled with two different compilers. The good news is that this is perfectly possible and generally trouble-free, thanks to the effort put in by Intel to provide a compiler that is fully compatible with GCC. In fact, it is possible to specify the release of GCC that you want to have compatibility with. This is done through the option -

Part of the ICC documentation is an easy-to-follow tutorial on performance issues.

gcc-version=nnn, where nnn is 320 (for GCC 3.2.x), 330 (for GCC 3.3.x) or 340 (for GCC 3.4.x). Compatibility with versions of GCC older than 3.2.0 is not supported. In the case of C++, it is also possible to instruct ICC to include headers provided by g++ (the C++ compiler included with GCC) via the option -cxxlib-gcc. For compatibility reasons, ICC also provides flags that are identical in meaning and use to some of GCC’s flags. This is the case of the -march and -mcpu options, supported by ICC as equivalents to -x and –tpp, respectively. Some open source apps just assume that you will use GCC to compile them and some developers mercilessly use GNU extensions to the C/C++ languages and GCC-defined macros. In many cases this won’t stop you from using ICC, which implements many built-in functions and macros typical of GCC. The -no-gcc flag (which should not be used with -cxxlib-gcc) will remove GCC constructs. One exception to this happy ICC–GCC compatibility is the Linux kernel, which requires a wrapper layer and a patchset to be compiled with ICC. But this is one of the very few cases where it’s not beneficial to use ICC. You may be pleased to hear that Intel engineers are working ceaselessly to make it possible to build the kernel with ICC out of the box.

Fast, faster, fastest The key point of this tutorial was to illustrate the most useful compiler flags of ICC for producing code that will in all likelihood be faster by a considerable margin than identical code obtained with GCC. With the usual caveat that there are no universal rules, we can sum up with three compilation commands referring to two different situations. If you want to head straight for speed, try icc -O3 -tpp7 -xW -align -Zp16 -ipo -static myapp.c where I’ve assumed that you have a Pentium IV – if not, you should be able to use what you’ve learned in this tutorial to adapt this line for other processors. If you have some time on your hands and want to spend it gaining even more speed, first use icc -O3 -tpp7 -xW -align -Zp16 -static -prof_gen myapp.c and then icc -O3 -tpp7 -xW -align -Zp16 -ipo -static -prof_use myapp. c This last command produces the fastest executable, which can outperform by 10-20% the already fast executable obtained without the use of profiling. Further performance improvements on multiprocessor machines can be obtained with the -parallel option or with the -openmp flag, the latter requiring speciallywritten code. Enjoy using ICC. It gives you an amazing amount of control over how applications run of your machine, and its speed and precision really are exciting. LXF

LXF68.tut_icc 91

LXF68 JULY 2005


11/5/05 12:25:26 pm

Linux Format 68 tut_icc