# THE PARALLEL UNIVERSE





# Intel<sup>®</sup> Parallel Studio XE 2013 10 Feature Highlights for Accelerated Performance

by James Reinders



Design

Parallel

Verify

.

Build

reight

\*3/4, bmps/

Layoutkind.Seq

une



### **Boost performance and accuracy**

This downloadable CodeBook provides "how-to" guidance and a comprehensive resource toolkit to help you efficiently produce fast, scalable, reliable applications throughout the development lifecycle.

#### Look for guidance and techniques for C++ and Fortran developers:

Tools and techniques across the development lifecycle

Technical guides, white papers, articles, and blogs

Features for accelerated performance

And much more

 $\odot$ 

#### DOWNLOAD THE FREE CODEBOOK NOW



©2012, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries \*Other names and brands may be claimed as the property of others.

# CONTENTS

| Letter from the Editor                                                                                                                                                                                                |   |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---|
| Putting Intel <sup>®</sup> Parallel Studio XE 2013 to Work                                                                                                                                                            |   |
| for the "New Normal," BY JAMES REINDERS.                                                                                                                                                                              | 4 |
| Intel <sup>®</sup> Parallel Studio XE 2013:<br>10 Feature Highlights for Accelerated Performance,<br>BY JAMES REINDERS                                                                                                | 6 |
| Get up to speed fast on the components and new feature sets in the Intel® Parallel Studio XE 2013 suite—and consider the potential for your applications.                                                             | U |
| Using Intel <sup>®</sup> Software Development<br>Tools to Analyze HMMER, BY WALTER SHANDS                                                                                                                             | 8 |
| Explore techniques for developing applications like HMMER for the latest generation of multicore processors—from thread and memory error checking to performance and code optimization.                               |   |
| Pointer Checker: Easily Catch Out-of-Bounds                                                                                                                                                                           |   |
| Memory Access, by KITTUR GANESH                                                                                                                                                                                       |   |
| Pointer Checker is designed to catch any out-of-bounds memory accesses before memory<br>corruption occurs. Find out how to use Pointer Checker effectively, and to balance the<br>trade-offs of security and runtime. |   |
| New Parallel Programming Features in                                                                                                                                                                                  |   |
| Intel® (Visual) Fortran Composer XE,                                                                                                                                                                                  |   |
| BY STEVE LIONEL                                                                                                                                                                                                       |   |
| This overview of two new features, DO CONCURRENT and coarrays, brings insight into<br>achieving excellent parallelism results with Fortran.                                                                           |   |
| Using the Intel® Math Kernel Library and Intel® Compiler                                                                                                                                                              |   |

#### Using the Intel<sup>®</sup> Math Kernel Library and Intel<sup>®</sup> complier to Obtain Run-to-Run Numerical Reproducible Results,

#### BY TODD ROSENOUIST AND SHANE STORY.

How do you balance demands for accelerated performance with reproducible results and runtime consistency? These techniques can help you generate reproducible results within applications under a manageable set of constraints.

Sign up for future issues | Share with a friend

© 2012, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Core, Cilk, VTune, VPro, Xeon and Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others. The Parallel Universe is a free quarterly magazine. Click here to sign up for future issue alerts and to share the magazine with friends.



26

nPsrcdata = getBmpt

James Reinders explores the development capabilities of the mature parallelism tool suite, Intel® Parallel Studio XE 2013.

port("ippi-6.1.dll")) public static extern ippstatus i

coquential,CharSet=CharSet,Ar

new IppiSize( bm

#### LETTER FROM THE EDITOR

### PUTTING INTEL® PARALLEL STUDIO XE 2013 TO WORK FOR THE "NEW NORMAL"

#### Has parallelism changed everything or nothing?

On one hand, parallelism is everywhere and parallel programming is the "new normal." On the other hand, writing, debugging, and tuning an application remains our work to do as programmers. Did we just raise the bar to create the new normal? Perhaps. We expect more from our programming and, in turn, we need more from our tools.

At Intel, we have invested heavily to support this new normal with a wealth of new capabilities in the latest edition of Intel<sup>®</sup> Parallel Studio XE.

The launch of Intel<sup>®</sup> Parallel Studio XE 2013 updates a mature toolset for application development with support we need for the new normal–spanning many aspects of software development.

In this issue, we look at some of the top new features and capabilities of the Intel Parallel Studio XE 2013 product. In the 10 Feature Highlights article, we highlight efforts which are significant new capabilities in their own right. Each could have a whole issue dedicated to it filled with interesting examples and tales on how they work. We've selected three to dive into in this issue beyond the information in the 10 Feature Highlights article. One covers "Pointer Checker," and one discusses Fortran capabilities. Another article covers a new run-to-run (and processor-to-processor) numerical reproducible results capability. This capability helps deal with the inherently non-associative nature of floating point numeric representations with a new unequaled set of options in the latest Intel Parallel Studio XE.

We also have a real-world usage case covered in Using Intel<sup>®</sup> Software Development Tools to Analyze HMMER. This study makes use of event-based sampling analysis in Intel<sup>®</sup> VTune<sup>®</sup> Amplifier XE and the optimization features of the Intel<sup>®</sup> Composer XE compiler to build and analyze hmmsearch and hmmbuild, components of SPECint<sup>\*</sup>.

I think you'll find this issue full of exciting new capabilities. Maybe the end result is a "new normal"—but it is an exciting new place to be.

#### **James Reinders**

Director of Parallel Programming Evangelism at Intel Corporation. James is a co-author of a new book *Structured Parallel Programming* from Morgan Kaufmann, 2012. His other books include *Intel® Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism*, available in English, Japanese, Chinese, and Korean.

### **Intel<sup>®</sup> Parallel Studio XE 2013:** 10 Feature Highlights for Accelerated Performance

#### by James Reinders, Director of Parallel Programming Evangelism

Intel<sup>®</sup> Parallel Studio XE 2013 not only delivers the latest optimizations and new processor support, but it also includes a number of highly innovative features that are likely to surprise and delight you.

The suite plugs seamlessly into Microsoft Visual Studio\* and the GNU toolchain, thereby preserving investments in your development environment of choice.

1. Processor Support Updated to Include the Latest Intel<sup>®</sup> Processors

New support includes AVX2, TSX, and FMA3. This extends our support to both the newly released 3rd Generation Intel® Core™ vPro™ processor (codenamed Ivy Bridge) microarchitecture, as well as the forthcoming Haswell microarchitecture. This enables you to take advantage of the latest performance enhancements in the newest Intel® products, while preserving compatibility with prior Intel and compatible processors.

#### 2. Support for Intel<sup>®</sup> Many Integrated Core (Intel<sup>®</sup> MIC) Architecture

Used for more than a year on prototype and preproduction systems, support for Intel® MIC architecture is now available in our products. No additional new tools are needed for the first Intel® Xeon Phi™ coprocessor (codenamed Knights Corner). Instead, we have integrated this support in tools you already know and use. The power of these familiar tools is now available to help generate, debug, and optimize code for the Intel® MIC architecture.

#### 3. Advanced Numerical Reproducibility Capabilities

The most praised new feature by beta testers. An innovative new "Conditional Numerical Reproducibility" capability offers unique controls over nonassociative floating-point operations, allowing run-to-run and processor-to-processor reproducibility options often with very low performance penalties. Increased options for floating-point arithmetic reproducibility with Intel® Math Kernel Library, special Intel support in OpenMP\*, and new capabilities in Intel® Threading Building Blocks open up new possibilities.

#### 4. Additional Profiling Data and Easier to Use

Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE offers new and powerful bandwidth and memory access analysis to reduce time spent puzzling over cryptic performance data.

#### **5. Pointer Checker**

A new compiler-based diagnostic tool allows you to find code that accesses memory addresses beyond the allocated addresses. This helps with security hardening and finding difficult memory corruption issues. With Intel Parallel Studio XE 2013, accelerated application performance is often just a recompilation away. Rebuild with the latest compilers and link in the latest libraries to benefit from the latest processors.

I have chosen 10 features to highlight from this powerful Intel tool suite.

#### 6. New Threading Assistant: Intel® Advisor XE

Intel® Advisor XE assists in producing scalable, maintainable C, C++, C#, and Fortran code. Simplifies adding parallelism to threaded or unthreaded applications, and allows you to evaluate alternatives before investing in implementation.

#### 7. Fortran Standards Support

Intel<sup>®</sup> Fortran supports widely used features of the Fortran 2003 standard and key parts of the 2008 standard, including coarrays. As a leader, Intel is committed to supporting Fortran with our products. Of course, we maintain a rich backward compatibility with decades of Fortran support including VAX Fortran\*, Compaq Visual Fortran\*, Fortran 95, Fortran 90, Fortran 77, and Fortran 66, as well as library support for BLAS, LAPACK, ScaLAPACK, sparse solvers, fast Fourier transforms, vector math, and more.

#### 8.C++ Performance Guide

Everyone can appreciate the new C++ Performance Guide, featuring a quick five-step process for increasing performance.

#### 9. C and C++ Standards Support

Outstanding support for C and C++ are now accompanied by leading support for many of the new C++11 and C11 features. We also maintain our extensive support for prior standards including C99, and industry-leading support for IEEE 754-2008 Decimal Floating-Point Arithmetic.

#### 10. Find and Eliminate Errors with Intel® Inspector XE

Intel® Inspector XE provides an efficient way to increase your application reliability to ensure performance in C, C++, C#, and Fortran. The new heap growth analysis feature offers an important new way to find memory leaks.

DOWNLOAD A FREE 30-DAY EVALUATION



### The power of this suite stems from four key components:

#### 1. Optimized C++ and Fortran Compilers and Libraries:

Intel® Composer XE is a highly optimizing performance-oriented developer tool that includes Intel® C++ and Fortran compilers, and threading, math, multimedia, and signal processing performance libraries. Intel® Cilk™ Plus, Intel® Threading Building Blocks, and OpenMP\* support provide parallelism models to make it easier to take advantage of today's and tomorrow's high-performance computing systems. Industry-leading Intel® Math Kernel Library and Intel® Integrated Performance Primitives include a wealth of routines to improve performance and reduce development time.

#### 2. Innovative Threading Assistant for Linux\* and Windows\*:

Intel® Advisor XE is a threading assistant for C, C++, C#, and Fortran developers. It helps find regions with the greatest performance potential from parallelism and highlights critical synchronization issues. With Advisor XE, you can evaluate alternatives before investing in implementation, estimate the speed-up, identify correctness issues and select the options with the best return on investment. The "magic" here is in the ability to evaluate approaches before committing to coding and debugging. This is a remarkable tool when considering how to add parallelism into your code.

- 3. Optimize Serial and Parallel Performance: Intel® VTune™ Amplifier XE is the premier performance and thread profiler to tune application performance. Use it to profile C, C++, C#, Fortran, assembly code, and Java code, and receive rich performance data for hotspots, threading, locks and waits, DirectX\*, bandwidth, and more.
- 4. Deliver More Reliable Applications: Intel® Inspector XE 2013 is an easy-to-use memory and threading error detector for serial and parallel applications on Windows\* and Linux\* Static analysis for C, C++, and Fortran developers is included in Intel® Studio XE products. The ability to pinpoint active and latent problems before shipping an application to customers is strongly supported by this acclaimed and unique Intel capability. □

TO LEARN MORE, VISIT



#### **TECHNICAL SPECIFICATIONS AT A GLANCE**

| Processor Support                     | Validated for use with multiple generations<br>of Intel and compatible processors<br>including, but not limited to: Intel® Xeon®<br>processors, Intel® Core® processors, and<br>Intel® Xeon Phi® coprocessors.                                              |
|---------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Operating Systems                     | Windows* and Linux*. Compiler and<br>library components are also available as<br>Apple OS* X add-ons for Apple's XCode*<br>development environment.                                                                                                         |
| Development Tools<br>and Environments | Compatible with compilers from vendors<br>that follow platform standards (e.g.,<br>Microsoft, GNU, Intel). Can be integrated<br>with GNU toolchain; Microsoft Visual<br>Studio* 2008, and 2010, and next-<br>generation tools.                              |
| Programming<br>Languages              | Extensive support for C, C++ and Fortran<br>development. Additional support included<br>for programs that also include Java or<br>.NET languages such as C#.                                                                                                |
| Support                               | All product updates, Intel® Premier<br>Support services, and Intel® Support Forums<br>are included for one year. Intel Premier<br>Support gives you access to confidential<br>support, technical notes, application notes,<br>and the latest documentation. |
| Community                             | Join the Intel <sup>®</sup> Support Forums community to learn, contribute, or just browse: http://software.intel.com/en-us/forums                                                                                                                           |
| System<br>Requirements                | For details on hardware and software<br>requirements:<br>www.intel.com/software/products/<br>systemrequirements/                                                                                                                                            |



by Walter Shands, Software Development Engineer

# Using Intel® Software Development Tools to Analyze HMMER

This paper will highlight the features of Intel® Parallel Studio XE 2013 by using them to build and analyze HMMER (http://hmmer.janelia.org/). HMMER is a set of applications, which includes two, hmmsearch and hmmbuild, which are components of SPECint. We make use of eventbased sampling analysis in Intel® VTune" Amplifier XE to find out which code paths, context switches, or threading inactivity cause performance problems in hmmsearch. And, we'll utilize the code optimization features of the Intel® Composer XE compiler to improve the performance of hmmsearch on Intel® Xeon® E5 processors. In addition, we will show you how to use Intel® Inspector XE to locate memory and threading errors introduced into hmmsearch. **hmmsearch is used to search** a protein sequence database for homologs of protein sequences using profiles called hidden Markov models. globins4.hmm contains the profiles and uniprot\_trembl.fasta is a 10 GB sequence database.

hmmsearch is available in an MPI version, but we restricted our experiments to the non-MPI flavor. We ran hmmsearch on a computer with an 8-core Intel® Xeon® E5-2680 hyperthreaded processor at 2.7 GHz with 23.4 GB of memory. We ran the application using GCC and the Intel® C compiler, in both cases using the settings provided by the configure script. The initial GCC default switches were:

> gcc -std=gnu99 -O3 -fomit-frame-pointer -malign-double -fstrict-aliasing -pthread -msse2

The application requires support for the SSE2 instruction set at a minimum to support an algorithm optimized using intrinsics oriented toward SSE2.

The default Intel® compiler flags were:

#### icc -03 -ansi\_alias -pthread

A challenge in porting applications from one compiler to another is making sure that there is support for the compiler options you use to build your application. The Intel C compiler supports many of the options that are valid on other compilers you may be using, such as GCC. The compiler generates object files that are compatible with GCC-generated object files, so you can compile part of your application using the Intel compiler and the rest using GCC.

The **-fomit-frame-pointer** option is set when you specify option **-O1**, **-O2**, or **-O3** when using the Intel C compiler (so there is no need to include it). The -malign-double option aligns double, long- double, and long-long types for better performance for systems based on IA-32 architecture and is available in the Intel C compiler.

We started the application with this command line:

./hmmsearch globins4.hmm ../../
uniprot\_trembl.fasta

The next step is to locate the hotspots in the application using Intel VTune Amplifier XE. This profiler tool uses low overhead techniques to quickly find multicore performance bottlenecks, without needing to know the processor architecture or assembly code. Note that we do not need to add code to the application to collect data.

To view source code lines of hmmsearch in VTune Amplifier XE, we need to include symbols in the release build—so we add the **\_g** flag. We added the **\_fno\_inline\_functions** flag as well; this allows us to see all of the code in question in the VTune Amplifier XE source view.

The VTune Analyzer XE hotspots analysis shows where most of the CPU activity is occurring in the application and the amount of CPU activity on the threads over time. (Figure 1)

The VTune Amplifier XE hotspots view tells us that the function consuming the most CPU time is p7\_MSVFilter, and double-clicking on the function name displays the SSE intrinsics calls used in optimizing the performance of the function. The assembly view shows us that the Intel compiler utilized vector instructions, but is not taking advantage of the 256-bit registers or AVX instructions on the Intel Xeon processor. (Figure 2)

It's possible that we could compile the original C code for p7\_ MSVFilter with the Intel compiler and help the compiler vectorize the function for the instruction set available on the target machine, so that the function is not limited to using 128 bit registers.

The thread timeline view shows that there is not much CPU time used in the worker threads, but a large amount is used in one thread. This turns out to be the thread that is reading the sequence database file. (Figure 3)

"To achieve more significant performance gains, the problem of serialization of the application due to the file read has to be solved."



Figure 1

| Source | Assembly   |           | 1.20 - | 0 9      | 8    |                        |        | CPU time                                 |
|--------|------------|-----------|--------|----------|------|------------------------|--------|------------------------------------------|
| Line   | Source     | Second a  | c^_    | Address  | Line | Assembly               | CPU 1  | 1 stack(s) selected. Viewing = 1 of 1 =  |
| 146    | sv =       | mn_add    | 4      | 0x42ffc9 | 147  | vpsubusbx (%r15), %xmm | 9.488  | Current stack is 100.0% of selection     |
| 147    | SV =       | mn sut    | 25     | 8x42ffce | 147  | add \$0x10, %r15       | 20.201 | 100.0% (129.447s of 129.447s)            |
| 148    | xEv =      | 80 83X    | 0      | 0x42ffd2 | 150  | vmovdqux (%rbx), %xms8 | 5.710  | hmmsearch1p7_MSVFilter - msvfilter.c     |
| 149    |            |           |        | 0x42ffd6 | 148  | vpmaxub %xmml1, %xmml, | 0.120  | hmmsearch/p7_Pipeline - p7_pipeline.c:5  |
|        | Selected 1 | I row(s): | 29     |          |      | Highlighted 2 row(s):  | 29.68  | hmmsearchipipeline thread - hmmsearc     |
| 6      |            | 3         | 6 2    |          |      | - HI                   | 1 10 2 | libothread-2.12 solstart_thread+0xd0 - [ |

#### Figure 2



Figure 3

The application creates a number of threads equal to the number of HW threads on the machine plus one, which in the case of a hyperthreaded machine is equal to the number of hyperthreads plus one. In this case, there are 17 threads running. If we use the hmmsearch -cpu 4 flag to limit the threads to five threads, VTune Amplifier XE shows that the application scales well–unlike the situation with 17 threads. (Figure 4)

Evidence of this is the 67.418-second runtime with 17 threads, which is worse than the 62.561-second runtime with four threads.

We can see that the top thread is the one reading the data file by filtering the results by thread in the five-thread hotspot display. (Figure 5)

|                                                                                                                                                                                                                                      | r040hs r025hs r030hs                                                                                                                                                                                                                    | r030hs r027hs ×                                                                                                                                        | r026/m r024/m r025/m ) •                                                                                                                                                                                       |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Motspots - Hotspots 2 🖉                                                                                                                                                                                                              |                                                                                                                                                                                                                                         |                                                                                                                                                        | Intel VTune Amplifier XE 2013                                                                                                                                                                                  |
| Grouping: Punction / Call Stack<br>Function / Call Stack                                                                                                                                                                             | CPU Time*                                                                                                                                                                                                                               | •<br>Module                                                                                                                                            | CPU Function/CPU Stack - CPU Time  Viewing 4 1 of 1  selected stack(s)                                                                                                                                         |
| tr p?_JSVFilter<br>Bisgarcti, ReadBock<br>Bip7,Veetbilter<br>Bihaader, fasta<br>Bied, herm, Forward<br>Hip7, ForwardParser<br>Biloadbud<br>Bip7, oms, Reuse<br>Bilogf<br>Biesp<br>Biesp<br>Biogf<br>Biesp<br>Selected 1 row(u):<br>4 | 131.313           47.093           14.751           11.311           6.381.8           6.130.6           2.571.8           2.5995           1.9995           1.1405           0.751.           1.1405           0.751.           1.14.3 | Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch<br>Immisearch | 100.0% (131.333) of 131.313)<br>hmmsearch(p7_MSVFilter - msvfilter.c<br>hmmsearch(p7_Fipeline - p7_pipeline.c535<br>hmmsearch(pipeline_thread - hmmsearch<br>libpthread-121.schtart_thread<br>libc-212.aoldone |
| CPU Usage                                                                                                                                                                                                                            | s 25s 30s 35s<br>▼ Threa⊈ [44]                                                                                                                                                                                                          | 40s 45s 50s                                                                                                                                            | 55s 60s<br>♥ Threads<br>♥ Running<br>♥ CPU Usage<br>♥ CPU Usage<br>↓<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►<br>►                                                                          |

Figure 4

#### 📓 🕼 🖄 🖄 🕼 🕼 🕼 🕼 r050hs r042hs r049hs r038hs r039hs r040hs r029hs r010hs r030hs r027hs X r035ha r036hs 10176 Intel VTune Amplifier XE 2013 Hotspots - Hotspots 🔏 🥹 Analysis Type 🛛 Summ. up 📫 Top-down Tree Analysis Target Botte PU Function/CPU Stack - CPU Time Thread / Function / Call Stack . g 4 1 of 13 > selected stack(s) Thread / Function / Call Stack CPU Time 76.7% (47.093s of 61.394s) 47.093 III sqascii ReadBl search/sgascii\_ReadBlock - esl\_sgio\_as B header\_fasta 11.311s wearchiesl\_sqio\_ReadBlock - esl\_sqio.c. **H** loadbuf 2.571: hmmsear\_ earchiserial\_master - hmmsearch.c451 0.150 Best\_sq\_GrowTo hmmsear... searchimain - hm tearch c:289 R pthread\_cond\_broadcast 0.100 bothres\_ search/\_start end\_fasta 0.080 III p7\_alidisplay\_Print 0.030 E est\_sgio\_IsAlignment 0.030 sear Bp7\_tophits\_Domains 0.010 Maktado 0.010 Selected 1 ro 61.394 Thread Q# ad (D) CPU Time ne\_thread () CPU Usage thread (Dy9e19 HA CPU Time e\_thread (0. CPU Usage Thread (Al) Module: [Al Process: [Al Call Stack Mode Inline Model

Figure 5

12

For more information regarding performance and optimization choices in Intel<sup>®</sup> software products, visit http://software.intel.com/en-us/articles/optimization-notice

If we use the VTune Amplifier XE Locks and Waits feature on the run with 17 threads it shows us a large number of transitions, indicated by yellow lines from the thread reading the sequence database file to worker threads. (Figure 6)

hmmsearch uses a producer consumer model. This is where a producer thread (labeled Thread (0xa0) in the graphic) puts data to be processed on a queue that worker threads (labeled pipline\_thread in the graphic) remove when the producer thread signals them with a broadcast message, resulting in a thread transition from the producer thread to the worker thread.

By zooming in, we can see that the amount of thread running time (dark green) is less than thread waiting time (light green), indicating lost time to do productive work. (Figure 7)

Compare this with a zoom-in on the thread view for hmmsearch using only four threads. Note that thread transitions from the thread reading the data file, the top thread, typically result in productive work to the worker thread. (Figure 8)

However, when using 17 threads in hmmsearch, many thread transitions do not result in work being done. (**Figure 9**)

Zooming in even closer on the 17 thread case, we can see these thread transitions are the result of a pthread\_cond\_broadcast call that tells the worker threads that a block of data is ready on the work queue to be processed. Only one thread at time can grab the block of data—so the

other threads must wait again. (Figure 10)

When only five threads are used, only about two threads are waiting to get a block of data to process, and only one thread goes unsatisfied. (**Figure 11**)

All of this indicates that with more than four threads, the hmmsearch pipeline threads become starved for data. In other words, the thread reading the data file cannot provide data fast enough to keep up with computation in the worker threads.

From our analysis using VTune Amplifier XE, we know that the most time-consuming code is the MSV algorithm, which has been optimized with SSE intrinisics in p7\_ MSVFilter in the file msvfilter.c. The intrinsic-optimized code also contains some optimizations over and above vectorization, so it will be faster.

















To see if the Intel compiler can effectively vectorize the nonintrinsic optimized code, we compiled the application to use the unoptimized C code in the function p7\_GMSV in the file generic\_msv.c. VTune Amplifier XE again shows that the MSV algorithm is the hotspot. (Figure 12)

VTune Amplifier XE also shows that the most time-consuming part of the MSV algorithm is a single loop that is not taking advantage of AVX instructions or YMM registers on the Intel Xeon processor. (Figure 13)

The runtime of hmmsearch using this code is about four minutes and 30 seconds.

```
# CPU time: 4137.39u 5.02s
01:09:02.41 Elapsed: 00:04:30.08
```

If we use the –opt-report flag for the Intel compiler, it will tell us what inlining, loop, memory, vectorization, and parallelization optimizations have been done for each function. For the p7GMSV function, it tells us the loop was not vectorized.

generic\_msv.c(80:7-80:7):VEC:p7\_GMSV: loop
was not vectorized: existence of vector
dependence

By restructuring the code, we can enable the compiler to vectorize the loop and generate code that takes advantage of Intel Xeon architecture. The optimization report from the compiler indicates that the two loops resulting from the restructuring were vectorized:

generic\_msv.c(88:7-88:7):VEC:p7\_
GMSV: LOOP WAS VECTORIZED

```
generic_msv.c(108:7-108:7):VEC:p7_
GMSV: LOOP WAS VECTORIZED
```

In addition, the VTune Amplifier XE assembly view shows that AVX instructions are being used along with the larger YMM registers. (Figure 14)

#### The resulting runtime of the application is close to half of the

original runtime.

5

```
# CPU time: 2207.74u 4.96s
00:36:52.69 Elapsed: 00:02:28.16
```

We can use Intel Inspector XE to check hmmsearch for threading and memory errors. It gives detailed insight into application memory and threading behavior to improve application reliability, and its powerful thread checker and debugger make it easier to find latent errors on the executed code path. Intel Inspector XE also finds intermittent and nondeterministic errors, even if the error-causing timing scenario does not happen.

| Thread (0xa0            | <b>\$</b> |                                                             | 🛛 🖾 Running | Figure 10 |  |
|-------------------------|-----------|-------------------------------------------------------------|-------------|-----------|--|
| pipeline_thre           |           |                                                             | V Waits     |           |  |
| sipeline_thre           |           |                                                             | Transitions |           |  |
| sipeline_thre           |           |                                                             | Thread Conc |           |  |
| pipeline_thre           |           |                                                             | Inread Conc |           |  |
| pipeline_thre           |           | Transitions                                                 |             |           |  |
| sipeline_thre           |           | Thread (0xa055) to pipeline_thread (0xa074) (33.770s to 33. | 771s)       |           |  |
| sipeline thre           |           | Sync Object: Condition Variable 0x4c3005d1                  |             |           |  |
| pipeline_thre           |           | Source File: esl_workqueue.c                                |             |           |  |
| vipeline_thre           |           | Source Line: 395                                            |             |           |  |
| sipeline_thre           |           | Signal Source File: esl_workqueue.c                         |             |           |  |
| sipeline thre           |           | Signal Source Line: 312                                     |             |           |  |
| speline_thre            | i ii      | Whiting                                                     |             |           |  |
| pipeline_thre           |           | Waiting<br>Start: 33.770s Duration: 1.817ms                 |             |           |  |
| pipeline thre           | 100       | Sync Object: Condition Variable 0x4c3005d1                  |             |           |  |
| sipeline thre           |           | Source File: esl_workqueue.c                                |             |           |  |
| sipeline thre           |           | Source Line: 395                                            |             |           |  |
|                         |           | Signal Source File: esl_workqueue.c                         |             |           |  |
| ٤                       |           | Signal Source Line: 312                                     |             |           |  |
| to filters are applied. |           | ode: on V Call Stack Mode: Only user                        |             |           |  |

| Grouping: Sync                                  | Object / Function / Call Stack |              |              |         | ~         | Object Creation 🗸                                                                                                                                                                                                                                                                                                                                          | Figure 11 |
|-------------------------------------------------|--------------------------------|--------------|--------------|---------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------|
|                                                 | 1                              | Wait<br>Cou. | Spin<br>Time | Module  | Objec     | 1 stack(s) selected. Viewing ⊲ 1 of 1 ▷<br>CL Transitions<br>Thread (0x9f70) to pipeline_thread (0x9f86) (25.103s to 25.103s)                                                                                                                                                                                                                              |           |
| Condition Variab                                | ble 0x4c3005d1                 | 38           | Oms          |         | Condition | Sync Object: Condition Variable 0x4c3005d1                                                                                                                                                                                                                                                                                                                 |           |
| V Mutex 0x5e12es                                | Selected 1 row(s):             | 38           | 0ms<br>0ms   | 25.1029 | Mutex     | hmm Source File: esl_workqueue.c<br>hmm Source Line: 395<br>Signal Source File: esl_workqueue.c<br>Signal Source Line: 312<br>hmm<br>Waiting<br>Start: 25.100s Duration: 2.965ms<br>Sync Object: Condition Variable 0x4c3005d1<br>Source File: esl_workqueue.c<br>Source Line: 395<br>Signal Source File: esl_workqueue.c<br>25.11 Signal Source File: 312 |           |
| Thread (0x9f                                    |                                |              |              |         |           | N Maring                                                                                                                                                                                                                                                                                                                                                   |           |
| pipeline_thre<br>pipeline_thre<br>pipeline_thre |                                |              |              |         |           | Waits                                                                                                                                                                                                                                                                                                                                                      |           |
| E pipeline_thre                                 |                                |              |              |         |           |                                                                                                                                                                                                                                                                                                                                                            |           |
| pipeline_thre                                   |                                |              |              |         |           | ☑ Thread Conc                                                                                                                                                                                                                                                                                                                                              |           |
| Thread Con                                      |                                |              |              |         |           | Concurre                                                                                                                                                                                                                                                                                                                                                   |           |
|                                                 |                                | _            |              |         |           |                                                                                                                                                                                                                                                                                                                                                            |           |

"The Intel® C compiler and libraries create faster code, Intel® VTune® Amplifier XE finds bottlenecks, and Intel® Inspector XE pinpoints memory and threading errors before they happen. All this is of critical importance when developing applications like HMMER."

#### 🗳 🕼 🖄 🖆 🕞 🕨 🗳 🕪 r050he r042he r049he r039he r039he r039he r039he r039he r040he × r064pe Intel VTune Amplifier XE 2013 Hotspots - Hotspots 🦧 📀 alysis Type 🛛 Summary 🚳 Bottom-up 🔹 Top-down Tree is Target A PU Function/CPU Stack - CPU Time a 4 1 of 2 > selected stack(s) Function / Call Staci CPU Time\* Module Function (Fu 99.9% (472.987s of 473.677s p7\_MSVFilter + p7\_Pipeline + pip 473.6779 nsearch p7\_GMSV mmsearchlp7\_GMSV - generic\_msv.c GViterb 22.401s nsearch p7\_GViterbi nmuearch/p7\_MSVFilter - muvfilter.c:56 of Floom 0.800+ immsearch p7\_FLogsum search/p7\_Pipeline - p7\_pipeline.c555 addbuf 3.782s mmsearch addbuf ip7\_GForward mmsearch p7\_GForward 2.157 2148+ mmsearch seebuf netchar 1.1824 mmsearch nextcha header\_fasta 0.8271 hmmsearch header\_fasta p7\_gmx\_GrowTo 0.757s mmsearch p7\_gmx\_GrowTo 0.720 Selected 1 row(s) 473.6775 Q+ 10. 156 201 2 - Ru 10,171,c2 CPU Time CPU Usage AL CPU Time CPU UN Thread: [A] Module [A]

#### Figure 12

|      | Assembly 🔳 🖽 🗔      | 0 49 49 1             | 8          |      |                                        |          |    |
|------|---------------------|-----------------------|------------|------|----------------------------------------|----------|----|
| Une  | Source              | CPU Time              | Address    | Line | Assembly                               | CPU Time |    |
| 79   |                     | and the second second | ∎ 0x4259cc | 82   | vmaxss %xmm7, %xmm6, %xmm6             | 0.1205   |    |
| 80   | for (k = 1; k <= gr | 29.4445               | 0x4259d0   | 82   | vaddssl 0x8(%rcx,%rdx,4), %xmm6, %xmm9 | 1.0915   |    |
| 81   | (                   |                       | 8x4259d6   | 82   | vmovssl %xmm9, 0xc(%r12,%rsi,4)        | 25.8895  |    |
| 82   | MRK(1,k) = MSC(     | 232.5386              | 0x4259dd   | 83   | vmovssl 0x14(%r14,%r11,4), %xmm8       | 0.4095   | 12 |
| 83   | XMX(1,p7G_E) = ESL  | 206.038s              | 0x4259e4   | 83   | vnaxss txmm9, txmm8, txmm8             | 25.0365  |    |
| 84   | )                   |                       | 0x4259e9   | 82   | vaddssl 0xc(%r14,%r11,4), %xmm1, %xmm1 | 99.663s  |    |
| 85   |                     |                       | 0x4259f0   | 83   | vmovssl %xmm8, 0x14(%r14,%r11,4)       | 0.1725   | 1  |
| 86 / |                     |                       | 8x4259f7   | 82   | vmovssl 0xc(%r15,%rsi,4), %xmm10       | 27.1275  |    |
| 87   | for (k = 1; k <= gt |                       | 0x4259fe   | 82   | vmaxss %xmm11, %xmm10, %xmm10          | 0.1605   |    |
| 88   | (                   |                       | 0x425a03   | 82   | vaddssl 0x10(%rcx,%rdx,4), %xmm10, %xm | 0.0905   |    |
| 89   | if (MMX(1-1,k-1) >  |                       | 0x425a09   | 82   | vmovssl %xmm13, 0x18(%r12,%rsi,4)      | 25.5485  |    |



Intel Inspector XE finds memory leaks, corruption, and inconsistent memory API usage, as well as data races, deadlocks, and memory accesses between threads.

As with Intel VTune Amplifier XE, we don't need to create a special build or add code to the application to collect data.

Because there is significant overhead in detecting memory and threading bugs, we launch hmmsearch using a smaller sequence database file, as well as an application option that reduces the number of threads.

When we run Intel Inspector XE in the Detect Memory Problems mode, a few uninitialized memory accesses are exposed. (Figure 15)

Right-clicking on a line in the Detect Memory Problems pane brings up a description of an uninitialized memory access problem: (Figure 16)

Intel Inspector XE running in Locate Deadlocks and Data Races mode did not detect any issues. (Figure 17)

In order to increase application performance, we can take advantage of Intel<sup>®</sup> Cilk<sup>®</sup> Plus in the Intel compiler. Cilk Plus is an extension to C and C++ that offers a quick, easy, and reliable way to improve the performance of programs on multicore processors. It is an open standard and will soon be available in GCC 4.7. Cilk Plus, included in the Intel<sup>®</sup> C/C++ compiler, allows you to improve performance by adding parallelism to new or existing C or C++ programs using only three keywords: cilk\_for, cilk\_ spawn, and cilk sync.

Analysis Target Analysis Type 🖩 Collection Log 👖 Summary 📣 Bottom-up 🐟 Top-down Tree 🔝 generic - - -CPU Tim **CPU Time** Address Line Line Source Assembly 4.8455 vinsertf128 \$0x1, %xmm11, %ymm10, %ym XPO((1,p76 E) = ESL MAX(XP 0x425a56 84 93 85 0x425a5c 93 vaddps tymm12, tymm2, tymm2 0.6805 86 •/ 0x425a61 vextractf128 \$0x1, %ymm2, %xmm13 8.6125 91 vmovssl %xmm2, 0xc(%r14,%r12,1) vextractpsl \$0x1, %xmm2, 0x18(%r14,%r 87 0x425a67 4.5205 91 for (k = 1; k <= gn->H; k 9.126s 88 0x425a6e 91 0.500s vext/sctpsl \$0x2, %xmm2, 0x24(%r14,%r 0x425a76 8.007s 89 91 1f (MMCK(1-1,k-1) > (XMCK(1 vextractps1 \$0x3, %xmm2, 0x30(%r14,%r 50 55.4525 0x425a7e 4.3475 91 0x425a86 91 vmovssl %xmm13, 0x3c(%r14,%r12,1) 4.7195 62 else Brd25a84 91 vextractpsl \$0x1, %xmm13, 0x48(%r14,% 5.1856 MMX(1,k) = MSC(k) + XMX 93 74.768s 0x425a95 91 vextractosl \$0x2, %xmm13, 0x54(%r14,% 4.6945 91 vextractpsl \$0x3, %xmm13, 0x60(%r14,% Highlighted 25 row(s): 0x425a9d 4.7165 54 Selected 1 row(s): 109.4415

#### Figure 14

| _     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     |                               |                               |           |
|-------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------|-------------------------------|-------------------------------|-----------|
| r024n | ml3 🛞 //034mi3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                     |                               |                               |           |
| ۳. μ  | ocate Memory Problems                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                     |                               | Intel Inspec                  | tor groza |
| -     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | ummary              |                               |                               | 0103      |
| Probl | and the second se | enninar y           | © Filter                      | 3                             | Sort -    |
| D.    | Problem Sources                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | Modules Object Si   | State Seve                    | rity                          |           |
| 4 1   | Uninitialized memory access p7_trace.c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | hmmse               | New Error                     |                               | 9 item(s) |
| 5 1   | Uninitialized partial memor esl_sq.c; p7_alidi                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | hmmse               | New Prob                      | lem                           |           |
| 6 1   | O Uninitialized partial memor p7_oprofile.c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | hmmse               |                               | d partial memory access       | 2 item(s  |
| 7 1   | Uninitialized partial memor p7_tophits.c                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | hmmse               | ALC: NOT THE REAL PROPERTY OF | ory leak                      | 1 item(s  |
| Code  | Locations                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | de Locations / Time | line © Uninit                 | dalized memory access         | 2 item(s  |
| DD    | Ascription Source Function Mo                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | dule Object Size    | Offset Uninit                 | salized partial memory access | 4 item(s  |
| X A   | illocation site 🖹 esl_sq.c:1732 sq_init hm                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | msea                | Sour                          | ce                            |           |
| 173   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 1993-9419           | easel                         | c.                            | 2 item(s  |
| 171   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     | esl_s                         |                               | 1 item(s  |
| 173   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     | hmm                           | search.c                      | 1 item(s  |
| 173   |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     |                               | display.c                     | 2 item(s  |
| XB    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     | p7_hr                         | nm.c                          | 1 item(s  |
| 195   | and the second of the building of the                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                     | += hmm p7_op                  | rofile.c                      | 1 item(s  |
| 96    | sg namelen = strlen(sg->name);                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |                     | = sq n p7_to                  | phits.c                       | 2 item(s  |
| 97    | <pre>sq acclen = strlen(sq-&gt;acc);</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                     | - sq a p7_tr                  | sce.c                         | 2 item(s  |
| 98    | <pre>sq_desclen = strlen(sq-&gt;desc);</pre>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | n                   | += sq_d Modu                  | ile.                          |           |
| 99    |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                     |                               | search                        | 9 item(s  |

#### Figure 15

Help: Uninitialized Partial Memory Access

#### Uninitialized Partial Memory Access

Occurs when a read instruction references a block (2-bytes or more) of memory where part of the block is uninitialized.

| ID | Code Location                                                                      | Description                                                                                                                                  |
|----|------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| 1  | Allocation site                                                                    | If present, represents the location and associated call stack from which the memory block<br>containing the offending address was allocated. |
| 2  | Read                                                                               | Represents the instruction and associated call stack responsible for the partial uninitialized<br>access.                                    |
|    |                                                                                    | If no allocation or deallocation is associated with this problem, the memory address might be i<br>stack space.                              |
| un | person<br>signed char age;<br>ar firstInital;<br>ar middeInital;<br>ar lastInital; |                                                                                                                                              |

| Ele View Help                                                                                                                                                                                                                       |                                                                                |                                |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------|--------------------------------|
| r024mi3 r034mi3 r035t/3 x                                                                                                                                                                                                           |                                                                                |                                |
| Locate Deadlocks and Data Races O Target Analysis Type R Collection Log Summary                                                                                                                                                     |                                                                                | Intel Inspector 024m<br>07034m |
| Problems                                                                                                                                                                                                                            | 2 Filters                                                                      | Sort + C                       |
| No Problems Detected<br>Intel Inspector XE 2011 detected no problems at this analysis scope. If this result is<br>unexpected, try rerunning the target using an analysis type with a wider scope. Press<br>F1 for more information. | Severity<br>Problem<br>Source<br>Module<br>State<br>Suppressed<br>Investigated |                                |

#### Figure 17



We use Cilk Plus to replace the code that manages threads, mutexes, condition variables, and the work queue with the added benefit of better scheduling. However, we must still synchronize threads on the data file read, which results in serializing a portion of the application.

In the Intel VTune Amplifier XE Hotspots graphic of an hmmsearch run, you can see that because of the synchronization resulting from mutexes around the code reading the sequence database file, the CPUs are not fully utilized. But the Cilk Plus implementation has a shorter runtime at 58.272 seconds compared to the original runtime of 67.418 seconds. (Figure 18)

If we run a VTune Amplifier XE locks and waits analysis we can see that there are still many thread transitions. (Figure 19)

If we zoom into the thread pane in the locks and waits analysis, we see that the thread transitions are between worker threads, and that they involve the mutex that protects the file read, which is now carried out by each worker thread. (Figure 20)

#### Figure 18





|    | Q#Q#Q=Q#              | 15.165                              | 15.185 | 15.25               | 15.225                          | 15.215.25075s26s               | 15.285     | 15.35                   | \$    | Thread              |
|----|-----------------------|-------------------------------------|--------|---------------------|---------------------------------|--------------------------------|------------|-------------------------|-------|---------------------|
|    | Thread (0x11e18)      | and the second second second second |        | doorst vonderstaten | and a section of the section of |                                |            | denter released and     | - T   | Running             |
|    | Cilk Worker (0x11e33) |                                     |        |                     |                                 |                                |            |                         |       | P Waits             |
|    | Cilk Worker (0x11e3d) |                                     |        |                     | - 11                            |                                |            |                         |       | PI 14 Transitions   |
|    | Cilk Worker (0x11e34) |                                     |        |                     |                                 |                                |            |                         |       |                     |
|    | Cilk Worker (0x11e31) |                                     |        |                     |                                 |                                |            |                         |       | Thread Conc         |
|    | Cilk Worker (0x11e35) |                                     |        | 6 i i f             |                                 |                                |            |                         |       | AL Concurre.        |
|    | Cilk Worker (0x11e3b) | -                                   | 411    |                     |                                 | -                              |            |                         |       |                     |
| 8  | Cilk Worker (0x11e30) |                                     |        |                     |                                 | Transitions                    |            | and some drawn and some |       |                     |
| ٤. | Cilk Worker (0x11e38) |                                     |        |                     |                                 | Sync Object:                   |            |                         | 11630 | (15.251s to 15.251s |
| F  | Cilk Worker (0x11e32) | -                                   |        |                     |                                 | Source File: h                 |            | 69193                   |       |                     |
|    | Cilk Worker (0x11e2f) |                                     |        |                     | 1.000                           | Source Line: 1                 |            |                         |       |                     |
|    | Cilk Worker (0x11e3c) | 6                                   | - 11   |                     | 1 1 1                           | Signal Source                  |            | arch.c                  |       |                     |
|    | Cilk Worker (0x11e37) |                                     | - 11   |                     | 16 111                          | Signal Source                  |            |                         |       |                     |
|    | Cilk Worker (0x11e36) |                                     |        |                     |                                 |                                |            |                         |       |                     |
|    | Cilk Worker (0x11e39) |                                     | - 14   |                     |                                 | Waiting                        |            |                         |       |                     |
|    | Cilk Worker (0x11e3a) |                                     |        |                     |                                 | Start: 15.205                  |            |                         |       |                     |
|    |                       |                                     |        |                     |                                 | Sync Object:                   |            | :5/93                   |       |                     |
|    |                       | 1.1                                 |        |                     |                                 | Source File: h<br>Source Line: |            |                         |       |                     |
|    | Thread Concurrency    |                                     |        |                     |                                 | Signal Source                  |            | arch e                  |       |                     |
|    |                       |                                     | _      | -                   |                                 |                                | Line: 1542 | art from the            |       |                     |

#### Figure 20

#### S (100) > S (10)

| Analysis Target                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                  | and a second |                                                            | 🔹 Top-down Tree                                                                   |             |                                                                      |                                                                                            |                                            |                                             |                                                                                               | -                                          |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|-----------------------------------------------------------------------------------|-------------|----------------------------------------------------------------------|--------------------------------------------------------------------------------------------|--------------------------------------------|---------------------------------------------|-----------------------------------------------------------------------------------------------|--------------------------------------------|
| mong: Amilion/Calls                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | tack                                                                                                             |                                                                                                                |                                                            |                                                                                   |             |                                                                      |                                                                                            |                                            |                                             |                                                                                               |                                            |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |                                                                                                                  | -                                                                                                              | Hardware firer                                             | e Court                                                                           |             | Filled Pi                                                            | peline Sots                                                                                | Unfilled Pipeli                            | ne Slots (Stalls)                           |                                                                                               |                                            |
| Function / C                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | off Stack                                                                                                        | CPU_CLX_UP                                                                                                     | NHALTED                                                    | INST_RETIRED.ANV                                                                  | CPI<br>Rate | Retired <sup>(3)</sup><br>Pipeline<br>Slots                          | Cancelled<br>Pipeline<br>Slots                                                             | Back-end II<br>Bound<br>Pipeline<br>Slots  | Front-end III<br>Bound<br>Pipeline<br>Slots | Module                                                                                        |                                            |
| p7,MSI/Fiter                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                                                                                                                  |                                                                                                                | 391,664,000,000                                            | 1,629,496,000,000                                                                 | 0.283       | 0.797                                                                | 0.005                                                                                      | 0.094                                      | 0.308                                       | hmmiearch                                                                                     |                                            |
| specie,ReadBlock                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                                                                                                                  | 1                                                                                                              | 105,490,000,000                                            | 257,152,000,000                                                                   | 0.410       | 0.522                                                                | 0.036                                                                                      | 0.273                                      | 0.548                                       | hmmcaarch                                                                                     |                                            |
| Instruction of the second seco | emachine is capable a<br>execution port (for e<br>rdware Event Count i<br>OTU CLK UNHALTED                       | I supporting. This<br>sample, more mult<br>where Handware ().                                                  | opportunity cost<br>toly operations an<br>went Type is IDQ | LOPS NOT DELIVERED                                                                | cycle B     | atency operation<br>on the execution<br>hardware them                | one like chvides and<br>on unit can support<br>t Count where Har                           | Ememory operate<br>Q.<br>Indware Event Typ | e a UOPS 15545                              | , as can too many oper<br>D.ANY ) / ( Clocktocks w                                            | rations                                    |
| retired per cycle than the<br>bong directed to a single                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | enachine is capable a<br>execution port (for e<br>rdivare Event Count :<br>CPU_CLK_LININA_TED<br>READ > 6.05.).) | I supporting. This<br>sample, more mult<br>where Handware ().                                                  | opportunity cost<br>toly operations an<br>went Type is IDQ | results in slower execution<br>riving in the back and per<br>UOPS NOT DELIVERED ( | cycle B     | ntroduce bubbl<br>atoncy operatio<br>an the executo<br>randware them | es in the pipeline to<br>one like divides and<br>on unit can support<br>at Count where Har | Ememory operate<br>Q.<br>Indware Event Typ | e a UOPS 15545                              | done concerning useful<br>, as can be many oper<br>0.40° ) / (Clediteds w<br>re Event Type is | rations<br>here<br>,<br>uneing<br>landware |

One of the other powerful features of Cilk Plus is the C/C++ language extension for array notations. This Intel-specific language extension provides data parallel array notations, which enable compiler parallelization and vectorization with less reliance on alias and dependence analysis.

To achieve more significant performance gains, the problem of serialization of the application due to the file read has to be solved. Reading the data into memory prior to computation is not realistic when using the uniprot\_trembl.fasta data file, because we would exceed memory capacity on our machine, although if enough memory was available it would speed up subsequent computations using the same data.

Further performance gains can be achieved by taking advantage of Intel compiler options. Since the Intel compiler default instruction set is SSE2 and the target machine is Intel Xeon, it would be a good idea to take advantage of AVX instructions and larger register size by using the **\_\_xhost** switch that will generate an instruction set up to the highest level supported on the compilation host.

Another important compiler option is **ipo**, which enables interprocedural optimization between files. This is also called multifile interprocedural optimization (multifile IPO) or whole program optimization (WPO). When you specify this option, the compiler performs inline function expansion for calls to functions defined in separate files.

For help on finding out what to do to help the Intel compiler vectorize or parallelize loops we can use the **-guide** flag, which provides a report without producing objects or executables. The guided auto-parallelization feature of the Intel compiler is a tool that offers selective advice, resulting in better performance of serially coded applications. The advice typically falls under three broad categories: source code modification, use of pragmas, and addition of compiler options.

Here is one of the suggestions after using the option in hmmsearch:

esl\_vectorops.c(161): remark #30536: (LOOP) Add -fargument-noalias option for better type-based disambiguation analysis by the compiler, if appropriate (the option will apply for the entire compilation). This will improve optimizations such as vectorization for the loop at line 161. Adding the **-parallel** switch allows the Intel compiler to detect simply structured loops that may be executed in parallel, and automatically generates multithreaded code for them. If you use guided auto-parallelization options along with **-parallel**, the compiler may suggest advice on further parallelizing opportunities in your application:

msvfilter.c(106): remark #30525: (PAR)
Insert a "#pragma loop count min(1024)"
statement right before the loop at line
106 to parallelize the loop. [VERIFY]
Make sure that the loop has a minimum
of 1024 iterations.

We can also use the VTune Amplifier XE hardware event counter collection to get insight into bottlenecks in application code affecting performance. VTune Amplifier XE highlights collected data indicative of performance problems that should be investigated. Here is one example of an hmmsearch run. (Figure 21)

#### Conclusion

Intel® Software Development Tools help you boost application performance and increase the code quality, security, and reliability needed by high performance computing and enterprise applications. The Intel C compiler and libraries create faster code, Intel VTune Amplifier XE finds bottlenecks, and Intel Inspector XE pinpoints memory and threading errors before they happen. All this is of critical importance when developing applications like HMMER for the latest generation of multicore processors.

# Learn how Intel<sup>®</sup> Advisor XE can help improve parallelization productivity.

**BY RAVI VEMURI** 

What do space exploration, oil and natural gas exploration, Hollywood movies, and military operations have in common? Modeling, simulation, exploration, storyboarding, and reconnaissance are some of the phrases that come to mind. They are intended to reduce the cost of wrong choices, failures, and missteps, and help projects succeed and be more productive.

Software parallelization likewise can also benefit from parallelism reconnaissance in which code is evaluated for suitability for parallelization. Until now, there have been limited tools support to do this. However, Intel® Advisor XE 2013 changes this and helps the world of parallelization leapfrog forward. Intel® Advisor XE is the newest component of the Intel® Parallel Studio XE suite of products.

Software parallelization is potentially destabilizing to code, risky, expensive, and complex. Current trial and error approaches are not productive and there is considerable risk of dead ends. Embarking on code parallelization based on measured data (for example, hotspots) is perhaps better, but is likewise mostly a hit or miss. Code may or may not scale well. Stability issues due to incorrect parallelization also may lurk and surface long after the code is productized, and become costly to fix.

Intel<sup>®</sup> Advisor XE is built to help you find where to add parallelism to your code. Use it to discover the parallel performance (scalability) and code/data sharing issues (correctness) of possible parallel code regions. It lets you model several different regions within your program at once for parallel scalability and correctness. The results help you make judicious choices about which regions of code to not parallelize (to avoid dead ends), and which regions of code to actually parallelize to reap the multicore performance benefits.

Using this methodology helps you fix data sharing issues *before they happen*. Even as you prepare the code for parallelization by fixing the correctness issues, you can continue to use your existing test frameworks to validate your program—as it remains functionally unchanged and correct.

Use of Intel<sup>®</sup> Advisor XE in your parallelization efforts is very likely to reduce risk and increase the reward. Moreover, the tool empowers everyone in the software organization with the skill to productively parallelize, instead of the current situation where just the architects and senior engineers have this capability.

You can see how exciting the potential is for your applications. Please explore the product in greater detail at the **Intel® Advisor XE product page**, and let us know what you think. □

# POINTER CHECKER: Easily Catch Out-of-Bounds Memory Accesses

by Kittur Ganesh, Technical Consulting Engineer

This article introduces a powerful new feature called Pointer Checker, which precisely and easily isolates elusive bugs in programs. Found in the Intel® C++ Composer XE 2013 product, its integration into the compiler adds powerful functionality in a way that slides seamlessly into build systems. Clever implementation and powerful error reporting provide precise information about latent program defects. We are excited that during beta testing of this new feature, customers reported that this tool found numerous defects.

#### Although C/C++ pointers have well-defined semantics,

many applications could still make out-of-bounds memory accesses which can go undetected, risking data corruption and increasing vulnerability to malicious attacks. The Pointer Checker provides full checking of all memory accesses through pointers. A pointer-checked enabled application will therefore catch out-of-bounds memory accesses before memory corruption occurs.

With the advent of multicore processors, there is a need to program for data and thread parallelism where data is frequently created, stored, shared, and accessed in memory through pointers. The C and C++ languages define good semantics for memory access through pointers, but they also permit the use of these pointers without any restrictions. This provides no built-in protection against accessing or writing most user data in memory. This means you can perform any number of arbitrary operations on the pointers—resulting in severe unforeseen errors in the program whose effects often appear random due to unintentional modification of data—causing out-of-bounds (OOB) memory accesses which may often go undetected. Although pointers have well-defined lower and upper bounds, languages (and therefore the compilers) typically don't enforce bounds checking due to performance and speed concerns. This paves way for potential buffer overflows and overruns in various parts of the application code—causing data corruption, erratic program behavior, breach of system security, etc.—and is the basis for many software vulnerabilities to malicious attacks.

The launch of Intel<sup>®</sup> Parallel Studio XE 2013 brings a key new feature: the Pointer Checker, which performs bounds checking—providing full checking of all memory accesses through pointers—and identifies any out-of-bounds access in Pointer Checker-enabled code. This article presents a comprehensive overview and usage model of Pointer Checker, enabling you to quickly get started using this key debugging feature on your critical applications.

#### **Overview**

The Pointer Checker is a key feature of Intel Parallel Studio XE 2013. The main functionality of Pointer Checker is to find buffer overflows or overruns occurring in applications developed in high-level C and C++ languages on Windows\* or Linux\* operating systems. A buffer overflow or a buffer overrun is an anomaly where a program, while writing data to a buffer, overruns the buffer's boundary and overwrites adjacent memory. This is a special case of violation of memory safety. For example, consider an array as the buffer as shown in the short code snippet in **Figure 1**.

```
char *buf = (char *)malloc(5);
for (int i=0; i<=5;i++) {
  buf[i] = 'A' + i;
}
```

#### Figure 1

A buffer overflow occurs when you try to put more items in the array than what the array can hold. It occurs generally from writing or a store operation. On the other hand, a buffer overrun occurs when you are iterating over the buffer and keep reading past the end of the array. It generally occurs from reading or a load operation. Additionally, simple coding errors are often very hard to locate and rectify. For example, pointers are invariably masked by casting to a void pointer and then recasting to other pointers, making it very difficult to identify the cause of errors in the application. As mentioned earlier, since a pointer has a well-defined lower and upper bound, Pointer Checker performs bounds checking for all memory accesses through pointers ensuring that a pointer is within bounds before its use for either a read or a write operation.

The Pointer Checker feature can be enabled via *compile time switches*. When you build your application with the Pointer Checker-enabled option, it will identify and report out all out-of-bounds memory accesses occurring in the application, including subscripted array accesses. In addition, the Pointer Checker can also detect *dangling pointers*, meaning pointers that point to memory that has been freed. When you build your application with the dangling pointer detection-enabled option, using a dangling pointer in an indirect access will also cause the Pointer Checker to report out an out-of-bounds error. Another useful feature that Pointer Checker offers is to check bounds for *arrays without dimensions*, which is especially important since applications are integrated with many different modules developed by different developers who often extern shared data.

#### SEE FULL ARTICLE 🛛 🕥



#### Minimize frustration and maximize tuning effort with Amdahl's Law BY SHANNON CEPEDA

I recently had a question from a customer who had introduced a successful optimization to a hot function in his application, but did not see as much improvement in the overall application as he expected. This is a fairly common occurrence in the iterative process of performance tuning. Usually it happens for one of two reasons.

 Introducing an improvement in one area resulted in inefficiencies somewhere else. This is par for the course with performance tuning, and part of the reason why the process is **iterative**. It can be hard to anticipate whether a code change you are making in one function will decrease performance somewhere else down the road, and so landing in this situation from time to time is unavoidable. Although you may not be able to always prevent it, using good documentation practices and a tool like Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE to quantify performance changes can help you see when it is happening...

SEE THE REST OF SHANNON'S BLOG:

#### Visit Go-Parallel.com

 $(\mathcal{D})$ 

Browse other blogs exploring a range of related subjects at Go Parallel: Translating Multicore Power into Application Performance. THE PARALLEL UNIVERSE



# New Parallel Programming Features in Intel® (Visual) Fortran Composer XE



by Steve Lionel, Developer Products Division **Fortran programmers** have been doing parallel processing for many years using methods outside the Fortran standard such as auto-parallelization, OpenMP\* and MPI. Fortran 2008, approved as an international standard in late 2010, brought parallel programming into the language for the first time with not one, but two language features. (Of course, you can't be parallel with just one.)

This article provides a brief overview of these two new features, DO CONCURRENT and coarrays. The former is pretty easy to get one's head around; the latter is not.

#### DO CONCURRENT

Back in the early 1990s, an attempt was made at extending the Fortran 90 language for high performance computing and parallel processing. Called High Performance Fortran or HPF, it attempted to build on Fortran 90's array syntax in a way that permitted array operations to be done in parallel. HPF introduced the FORALL and WHERE constructs, PURE procedures with no side effects, and a number of intrinsic procedures for operations such as scatter/ gather. While HPF was not widely adopted, some pieces of it were incorporated into the Fortran 95 standard, approved in 1997.

Of the various HPF features that persisted, none was perhaps more misunderstood than FORALL. Here's an example of a FORALL construct:

REAL :: A(10, 10), B(10, 10) = 1.0

FORALL (I = 1:10, J = 1:10, B(I, J) /= 0)
A(I, J) = REAL(I + J + 2)
B(I, J) = A(I, J) + B(I, J) \* REAL(I \* J)
END FORALL

The parenthesized list after the FORALL keyword is called the *forall-header*. It has one or more *forall-triplets* that specify the range of values taken on by the index-name. In this example we have two forall index names, I and J, which are each specified to take values from 1 to 10. The increment, if not specified by the third element in the triplet, is 1, just as in Fortran 90 array notation. The last part is the mask expression that determines the conditions under which the FORALL construct body (the two assignments) is executed.

Many Fortran programmers looked at FORALL and saw a loop, or in this case, two nested loops, perhaps with an IF at the top that skips to the next iteration, if the expression is false. But FORALL is **not** a loop construct, it is a "masked array assignment." If you try to think of this as a loop, you might expect each iteration to execute both assignments, and that these could be done in parallel. But that's not how FORALL was defined. Instead, the first assignment is executed completely, across all combinations of all the index names, filtered by the mask. Then, the second assignment is executed completely, again across all combinations and filtered by the mask. Inside a FORALL construct, an assignment statement may reference functions if they are PURE, but the only statement types allowed in a FORALL are assignment statements, WHERE constructs, or other FORALLs.

FORALL was a noble experiment, but the rules were too restrictive to be amenable to doing the assignments in parallel and it did not meet the needs of the Fortran community. So, Fortran 2008 brings what I call "FORALL Done Right": DO CONCURRENT.

A DO CONCURRENT construct looks like a blend of traditional DO and FORALL. In fact, the beginning of a DO CONCURRENT uses the FORALL header syntax. For example:

```
DO CONCURRENT (I=1:N)
T = A(I) + B(I)
C(I) = T + SQRT(T)
END DO
```

As with FORALL, the mask is optional. If present, it reduces the set of active combinations of the index names to those where the mask expression is true. Unlike FORALL, each range of a DO CONCURRENT is an iteration and is executed independently for all the active index combinations.

There are some restrictions on what you can have in a DO CONCURRENT. For example, you can't RETURN or GO TO out of the construct, and you can't reference a variable that is defined or made to be undefined by another iteration. You can even do I/O in a DO CONCURRENT, so long as a record written by one iteration is not read by another. As with FORALL, any procedure called from within the construct must be PURE (which guarantees that it has no side effects). Note that it is the programmer's responsibility to ensure that there are no dependencies between loop iterations—the compiler is not required to check these for you.

DO CONCURRENT is supported as of Intel<sup>®</sup> [Visual] Fortran Composer XE 2011 and the compiler will attempt to execute the construct in parallel if you have enabled auto-parallelization (/Qparallel or -parallel). However, there is no guarantee that any particular DO CONCURRENT will be run in parallel, and, of course, the order in which the iterations run is unpredictable. As a side effect, use of DO CONCURRENT can also help with automatic vectorization, as you are guaranteeing that there are no loop-carried dependencies.

#### Coarrays

If you are an MPI programmer, you know the basic drill: collect some data, call MPI\_SEND to send it to a copy of your program running on another "node," and then use MPI\_RECV to get results back. (This is a simplification, of course.) Wouldn't it be nice to be able to "reach out and touch" the other copies of your program using normal Fortran syntax, and not have to worry about adding calls to move data around?

Coarray Fortran, first proposed in the 1990s as an extension of Fortran 90 called F- - (F minus minus), provides simple syntax for adding parallelism to a Fortran program. (The syntax is simple, though the definition and implementation is not.) It was implemented by Cray for itsT3E and X1 supercomputers in the early 2000s, and was added, in a modified and somewhat reduced form, into the Fortran 2008 standard. Intel released the first full implementation of Fortran 2008's coarrays for mainstream computers in the Intel (Visual) Composer XE 2011 release for Linux\* and Windows.\*

The fundamental concepts of Coarray Fortran are these:

- > Image: Multiple copies of your application run in parallel; each is called an *image*.
- Coarray: Variables become coarrays when they are given the CODIMENSION attribute. Somewhat confusingly, scalars can also be coarrays – the standard defines a coarray as any entity with a non-zero corank, and these can be scalars or arrays. Codimensions (and coindices) are denoted with square brackets [].

Coarrays are split up across all the images of your application, so that a portion of each coarray resides in the local memory of an individual image. This property is associated with the Partitioned Global Address Space (PGAS) parallel programming concept. Here, coarrays exist in a shared "address space," but image-specific segments are individually addressable. Let's look at a simple example.

We will declare an array A with dimension 10x20 and with one codimension:

#### real, dimension(10,20), codimension[\*] :: A

It helps if you think of codimensions as additional dimensions, and indeed the Fortran standard limits the sum of the number of dimensions and codimensions to fifteen. (Intel® Fortran supports 31 as an extension.) The last upper *cobound* in the codimension must be **\***; at runtime this takes on the value of the number of images. If when run there are eight images, the cobounds of A are 1:8.

As with dimensions, you can have multiple codimensions with lower and upper bounds, and as with dimensions, only the last one may have \* as an upper bound. So we might have:

#### integer, codimension[4,2:6,3:\*] :: B

When you reference a coarray, you can do so with or without the coindices, which are enclosed in square brackets. If no coindices are present, you are referencing your image's piece of the coarray. If the coindices are present, you are specifying the coindex of the image you want.

Now, at this point you might be asking what happens if there aren't enough images to fill up the coindices, just as you would with a regular array that's an error. Unlike a regular array, the "shape" of a coarray may be ragged. Using the B example above, 20 images are needed to fill in each "layer" of the coarray. If there are, say, 39 images, there is a coindex [3,6,4], but not [4,6,4]. (Remember that Fortran does things in column-major order where the left subscript varies the fastest.)

Intrinsic procedures are provided to allow you to find the number of images, index of your own image, and the cobounds of any coarray.

What makes coarrays so nice is that they are integrated thoroughly into the Fortran language. You can use a coarray in most places where a regular variable is allowed, such as:

- > Expressions and assignments
- > Arguments to procedure calls
- > I/O statements

This makes a program using coarrays look clean. For example, consider a Jacobian solver that breaks up the problem into blocks of data. Most of the calculation involves an image's local block, but at the edges of each block it needs to consider values from "halo cells," those on the edge of adjacent image's chunks. Here's what such code might look like using coarrays: (**Figure 1**) "With Intel® [Visual] Fortran Composer XE you get coarray support in a "shared memory" mode, running on a single system. To build a coarray program just add the –coarray (or /Qcoarray) compiler option and then run the executable as normal. No special configuration is required."

my\_subgrid( 0, 1:my\_M) = my\_subgrid( my\_N, 1:my\_M)[my\_north\_P,me\_Q] my\_subgrid( my\_N+1, 1:my\_M) = my\_subgrid( 1, 1:my\_M)[my\_south\_P,me\_Q]

my\_subgrid( 1:my\_N, my\_M+1) = my\_subgrid( 1:my\_N, 1 )[me\_P, my\_east\_Q] my\_subgrid( 1:my\_N, 0 ) = my\_subgrid( 1:my\_N, my\_M )[me\_P, my\_west\_Q]

#### Figure 1

Fortran defines additional coarray behaviors that ease programming. For example:

- You can have ALLOCATABLE coarrays (and in fact this is the most common usage), where every allocation is a synchronization point, to make sure that all images have allocated their coarrays consistently and completely.
- > All images can do I/O. Normally, each has its own set of unit numbers, but the language says that "standard output" (unit 6) is preconnected on all images. While an implementation is not required to "merge the streams," Intel Fortran does, so all standard output writes get displayed on the console where the image is run. "Standard input" (unit 5) is preconnected on image 1 only.
- > Every image has an implicit synchronization point at its start and again at its end.

The language provides several methods of synchronization among images. The SYNC ALL statement causes all images to wait until all of them have executed that SYNC ALL the same number of times. SYNC MEMORY makes sure that all memory updates have completed before continuing. SYNC IMAGES is like SYNC ALL, but you restrict the synchronization to a specified set of images.

There are also locks, declared using the LOCK\_TYPE defined in intrinsic module ISO\_FORTRAN\_ENV, and LOCK and UNLOCK operations on these. Lastly, there is the ability to do atomic (uninterrupted) reads and writes of integer and logical variables through the ATOMIC\_ DEFINE and ATOMIC\_REF intrinsic procedures (these last are newly supported as of Intel<sup>®</sup> Fortran Composer XE 2013). With Intel® [Visual] Fortran Composer XE you get coarray support in a "shared memory" mode, running on a single system. To build a coarray program just add the –coarray (or /Qcoarray) compiler option and then run the executable as normal. No special configuration is required. To add support for a "distributed memory" model across a cluster requires that you also have a license for Intel® Cluster Studio (in addition to having a cluster.) Yes, this applies to Windows clusters too. (Support for distributed-memory coarray applications on WIndows was added in Update 6 of Intel® Visual Fortran Composer XE 2011).

For further reading about Fortran 2008, including coarrays and DO CONCURRENT, you can refer to the following documents from the Fortran standards committee:



### Using the Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL) and Intel<sup>®</sup> Compilers to Obtain Run-to-Run Numerical Reproducible Results

by Todd Rosenquist, Technical Consulting Engineer, Intel<sup>®</sup> Math Kernal Library and Shane Story, Manager of Intel<sup>®</sup> MKL Technical Strategy

Floating-point applications from Hollywood to Wall Street have long faced the challenge of providing both great performance and exactly the same results from run to run, or in other words, reproducible results. While the main factor causing a lack of reproducible results is the non-associativity of most floating point operations, there are other contributing factors such as runtime, selectable optimized code paths, non-deterministic threading and parallelism, array alignment, and even the underlying hardware floating-point control settings.

In this article for Intel<sup>®</sup> software tool users and programmers, we outline how to use the Intel<sup>®</sup> Math Kernel Library (Intel<sup>®</sup> MKL) and Intel<sup>®</sup> compiler features to balance performance with the reproducible results applications require. These new reproducibility controls in Intel<sup>®</sup> Parallel Studio XE 2013 help make consistent results from run to run possible:

#### INTEL® SOFTWARE TOOLS REPRODUCIBILITY CONTROLS

Intel® MKL 11.0

26

mkl\_cbwr\_set() MKL\_CBWR (environment variable)

Intel<sup>®</sup> Composer XE 2013

-fp-model or /fp KMP\_DETERMINISTIC\_REDUCTION=yes After many years of seeing software performance increase with processor clock speed, the last half-decade has seen the flattening of clock rates and the increasing availability of multicore systems. With each successive generation of microprocessors, improvement in software performance requires the use of newly added instructions to exploit the capabilities of the processor, as well as threaded algorithms designed to leverage the growing number of computational cores. To keep up with these changes, many developers turn to software tools. Optimizing compilers exploit opportunities for instruction and data-level parallelism and can automatically thread computationally intensive portions of a program. Software libraries provide tools to thread your code or allow you to extract parallelism automatically through calls to highly optimized, threaded functions. Many software programmers have adopted and use these high performance tools to extract greater levels of performance. In doing so, the likelihood of generating inconsistent results from run to run has grown.

Let's consider two scenarios. Artists in animation studios work every day with advanced modeling tools that allow them to move their actors through a virtual world. These modeling tools include physics engines that can simulate the real-world behavior of clothes, hair, or fluids, and therefore will naturally use floating-point models similar to those used in science and engineering applications. While accuracy and precision may not always be the first concern, especially in early stages of the process, getting the same results can be of the utmost importance. If a cloak follows a slightly different trajectory each time the artist runs through a multi-second sequence, the artist has lost some control over the creative process. Which trajectory will be used when the scene goes through further rendering and post-processing steps? The problem would be compounded by the fact that a single scene may have many such models that may interact to produce completely unpredictable results.

A second scenario involves mathematicians on Wall Street who develop algorithms for various applications from options pricing to risk analysis. In this field, getting results quickly means money—and sometimes a lot of money. The "quants" who develop these algorithms are faced with a balancing act between getting the answer quickly and the simulation time required to provide the most reliable answer. An increase in the performance of an algorithm can mean a decision sooner or a better decision in the same amount of time—a win in either case. However, optimized floating-point calculations that are a part of these models can often introduce rounding error. This means that if an earlier decision must be revisited and the model run again, it is possible that the result might be slightly different. The uncertainty can result in questions or issues later that programmers would prefer to avoid.

These are just two of many scenarios<sup>1</sup> encountered over the last few years by users of Intel MKL. This is a popular library of highly optimized parallel floating-point math functions that has been successfully used by customers in many application areas for over 15 years. For application programmers who demand reproducible results, there have not been any guarantees and only the limited option of running a sequential version of the library.

### "Floating-point applications from Hollywood to Wall Street have long faced the challenge of providing both great performance and exactly the same results from run to run, or in other words, reproducible results."

So, what exactly is the reproducibility problem? The issue is rooted in the way floating-point numbers are represented, the order in which they are operated on by the computer, and the rounding errors that may be introduced. It is a well-known fact that for general floatingpoint numbers represented in an IEEE single or double precision format<sup>2</sup>, the mathematical associative property does not in general hold.<sup>3</sup> In simpler terms, (a + b) + c may not equal a + (b + c).

It may help to consider a specific example. With pencil and paper,  $2^{63}$  + 1 + -1 =  $2^{63}$ . If, instead we do this same computation on a computer using double precision floating-point numbers, we get  $(2^{-63} + 1)$  + (-1)  $\approx$  1 + (-1) = 0 since  $(2^{-63} + 1)$  rounds to 1, or possibly  $2^{-63} + (1 + (-1))$  $\approx 2^{-63} + 0 = 2^{-63}$  through a slight modification in the order of operations. Clearly 0 does not equal  $2^{-63}$ , so the order of operations not only influences how and when rounding occurs but also the final computed result. Compilers typically refer to this ordering ambiguity as re-association.

Introducing application-level parallelism further increases the likelihood of producing nonreproducible results. The reason is a direct carryover from the order of operations argument just described. Whenever work is distributed among multiple threads or processes, any change in the order of operations within a computational dependency chain may result in a difference not only in the intermediate results, but also in the final computed results. Straightforward array element sum and product reduction operations are simple examples when the array elements have been distributed across multiple threads; partial sums or products are computed and then combined across threads into a single value. Any change in how the arrays are distributed, or the order in which a thread-specific sum or product is combined with another, may influence the final reduced sum or product. More broadly, how to handle parallelism in a consistent and predictable way falls under the category of deterministic parallelism.<sup>4</sup>

When you consider that a typical application may do millions of floating-point operations, it becomes readily apparent how the order of operations influences the final computed results.

#### Intel Math Kernel Library

Intel MKL 11.0 introduces Conditional Numerical Reproducibility functions to help users obtain reproducible floating-point results from Intel MKL functions under certain conditions.<sup>5</sup> When using these new features, Intel MKL functions are designed to return the same floating-point results from run to run, subject to the following limitations:

- Input and output arrays in function calls must be aligned on 16-, 32-, or 64-byte boundaries on systems with SSE/ AVX1/AVX2 instructions support respectively.
- > Control over the number of threads must remain the same from run to run for the results to be consistent.

The application-related factors within a single executable program that affect the order in which floating-point operations are computed include code path selection based on runtime processor dispatching, data array alignment, variation in number of threads, threaded algorithms, and internal floating-point control settings. Up until now, users were unable to control the library's runtime dispatching and how its functions were internally threaded. However, they were able to manage the number of threads, check the floating-point settings, and take steps to align memory when it is allocated.<sup>6</sup>

Intel MKL does runtime processor dispatching in order to identify the appropriate internal code paths to traverse for the Intel MKL functions called by the application. The code paths chosen may differ across a wide range of Intel® processors and IA-compatible processors, and may provide varying levels of performance. For example, an Intel MKL function running on an Intel® Pentium® 4 processor may run an SSE2-based code path. On a more recent Intel® Xeon® processor supporting Intel® Advanced Vector Extensions (AVX) that same library function may dispatch to a different code path that uses AVX instructions. This is because each unique code path has been optimized to match the features available on the underlying processor. This feature-based approach to optimization, by its very nature, amplifies the reproducibility challenges already described. If any of the internal floating-point operations are done in a different order, or are re-associated, then the computed results may differ.

 $\bigcirc$ 

#### SEE FULL ARTICLE

"With each successive generation of microprocessors, improvement in software performance requires the use of newly added instructions to exploit the capabilities of the processor, as well as threaded algorithms designed to leverage the growing number of computational cores."



 $\bigcirc$ 

### **RESOURCES AND SITES OF INTEREST**

 $(\mathcal{D})$ 



#### Go Parallel

**The mission** of Go Parallel is to assist developers in their efforts toward "Translating Multicore Power into Application Performance." Robust and full of helpful information, the site is a valuable clearinghouse of multicore-related blogs, news, videos, feature stories, and other useful resources.

#### "What If" Experimental Software

What if you could experiment with Intel's advanced research and technology implementations that are still under development? And then what if your feedback helped influence a future product? It's possible here. Test drive emerging tools, collaborate with peers, and share your thoughts via the What If blogs and support forums.

#### Intel<sup>®</sup> Software Network

**Check out a range** of resources on a wide variety of software topics for a multitude of developer communities ranging from manageability to parallel programming to virtualization and visual computing. This content-rich collection includes Intel<sup>®</sup> Software Network TV, popular blogs, videos, tools, and downloads.

#### Step Inside the Latest Software

See these products in use, with video overviews that provide an inside look into the latest Intel<sup>®</sup> software. You can see software features firsthand, such as memory check, thread check, hotspot analysis, locks and waits analysis, and more.

Intel<sup>®</sup> Inspector XE

Intel<sup>®</sup> VTune<sup>™</sup> Amplifier XE

#### Intel<sup>®</sup> Software Evaluation Center

## $\bigotimes$

#### The Intel<sup>®</sup> Software Evaluation Center

makes 30-day evaluation versions of Intel® Software Development Products available for free download. For high performance computing products, you can get free support during the evaluation period by creating an Intel® Premier Support account after requesting the evaluation license, or via Intel® Software Network Forums. For evaluating Intel® Parallel Studio, you can access free support through Intel® Software Network Forums ONLY.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

#### Sign up for future issues | Share with a friend

The Parallel Universe is a free quarterly magazine. Click here to sign up for future issue alerts and to share the magazine with friends.



# Announcing Intel® Parallel Studio XE 2013 Your performance toolset.

Get the mature toolset with an incomparable breadth and depth of features for developers—and accelerate application performance. Intel® Parallel Studio XE combines industry-leading compilers, performance and parallel libraries, error checking and performance profiling tools for C/C++ and Fortran.

#### DISCOVER THE PERFORMANCE IMPACT FOR YOUR APPLICATIONS

-



©2012, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. \*Other names and brands may be claimed as the property of others.

(intel)

Intel

Paralle

# INTEL® SOFTWARE ADRENALINE

### Introducing Intel<sup>®</sup> Software Adrenaline

Enter a world of doers, dreamers, and software industry luminaries.

This new magazine puts you on the forefront of the software industry. Roadmaps, R&D, industry pioneers and game changers, news, and more.

# FREE SUBSCRIPTION softwareadrenaline.intel.com



9 2012, Intel Corporation. All rights reserved. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or ther countries. "Other names and brands may be claimed as the property of others.