## Introduction

On modern CPUs it is possible to carry out identical floating point operations (e.g. addition) on multiple arguments at once provided the arguments are stored consecutively in memory. These are often called SIMD instructions (Single Instruction, Multiple Data). The first instruction sets that offered this possibility were the Intel SSE and SSE2 instructions. These allowed you to operate on 4 single precision numbers (SSE) or 2 double precision numbers (SSE2) at once. To do this, the CPU would load the arguments into special 128-bit wide registers (called xmm registers) and operate on those. In 2011 Intel shipped the first CPUs that supported the AVX instruction set. These instructions doubled the capacity by introducing 256-bit wide ymm registers allowing the processor to work on 8 single precision or 4 double precision numbers simultaneously. At that point the number of registers had already been doubled from 8 to 16 by AMD. In 2015 Intel will ship the first CPUs that support the AVX512 instruction set which again doubles this capability. It provides 512-bit wide zmm registers allowing the processor to work on 16 single precision or 8 double precision numbers simultaneously. The number of registers will be doubled again from 16 to 32. Wikipedia provides a more detailed description of the SSE/SSE2, AVX/AVX2, and AVX512 instruction sets.

## Vectorized Math Routines

These instruction sets make it interesting to create versions of transcendental functions that operate on multiple arguments simultaneously. This will greatly speed up the execution of these routines by taking advantage of the implied parallelism of the SIMD instructions. A typical example where you want to use such a routine is when you want to calculate Boltzmann factors for an updated electron temperature for every frequency in the mesh. This task is carried out in tfidle(). The scalar code for that would look something like this (note that the original code was more optimized than this, we use this to keep the example simple):

for( long i=0; i < rfield.nflux_with_check; ++i ) rfield.ContBoltz[i] = exp(-rfield.anu(i)/phycon.te_ryd);

The vectorized math routines require both the arguments and the results to be stored consecutively in memory, so in order to use a vectorized version of the exp() function, we need to create an array of arguments first. After that we can call the vectorized version of exp() called vexp():

for( long i=0; i < rfield.nflux_with_check; ++i ) rfield.vexp_arg[i] = -rfield.anu(i)/phycon.te_ryd; vexp(rfield.vexp_arg, rfield.ContBoltz, 0, rfield.nflux_with_check);

Note that there will be some extra overhead due to the fact that we need to allocate and initialize an array of arguments. Normally the speed of the vectorized math array will more than make up for that. You should however realize that allocating arrays can cost a substantial amount of CPU time (the malloc() routine is amazingly slow). Below we will discuss some strategies for dealing with that.

In general the vectorized version of func() will be called vfunc() and will be provided both in single and double precision versions (they are overloaded, so both versions will be called vfunc()). In Cloudy we rarely use single precision math functions. This is a remnant of old standards which only provided the double precision math functions. Nearly always the single precision version of a given math function will be faster (since less digits precision need to be produced). This is certainly the case for the vectorized version since it has the added advantage that it can calculate twice as many values simultaneously. So you are strongly encouraged to use single-precision versions when you can get away with the reduced precision and/or reduced exponent range.

The vectorized math functions always use the following API:

vfunc(const double arg1[], const double arg2[], ..., double res1[], double res2[], ..., long ilo, long ihi); vfunc(const realnum arg1[], const realnum arg2[], ..., realnum res1[], realnum res2[], ..., long ilo, long ihi);

In most cases there will only be one argument and one result vector. The arguments ilo and ihi give the lower and upper array index to process, following the usual C++ convention. So the call above would be equivalent to the following scalar code:

for( long i=ilo; i < ihi; ++i ) res[i] = func(arg[i]);

or more general

for( long i=ilo; i < ihi; ++i ) res[i] = func(arg1[i], arg2[i], ...); for( long i=ilo; i < ihi; ++i ) func(arg[i], &res1[i], &res2[i], ...);

depending on what the exact syntax of func() is.

There is a second API that looks more similar to a scalar call of the math function:

vfunc(double res[], double arg0, double arg1, double arg2, double arg3, ...); vfunc(realnum res[], realnum arg0, realnum arg1, realnum arg2, realnum arg3, ...);

Note that the arguments are no longer arrays and can be located in arbitrary places in memory. For double precision arguments there are versions with 4 or 8 arguments, while for single precision arguments there are versions with 4, 8, or 16 arguments. If you want to call such a routine with a number of arguments that is not a multiple of 4, just pad it with safe dummy arguments, e.g.:

double y[4]; vexp(y, a, b, c, 0.);

where the last argument is just a dummy. Remember that this dummy argument doesn't cost us anything extra as all arguments are processed simultaneously! When there are two arguments the syntax is as follows:

double z[4]; vhypot(z, x0, y0, x1, y1, x2, y2, x3, y3);

These routines come in versions with 4x2 arguments for double precision, and 4x2 or 8x2 arguments for single precision.

The vectorized math routines in Cloudy only support the AVX/AVX2 and AVX512 instruction sets. No special code was created for SSE/SSE2 versions since at the time these routines were made, AVX-capable hardware was already on the market for more than 4 years and should be available widely. Support for AVX/AVX2 or AVX512 instructions will be detected automatically by the compiler and configure script. For legacy non-supporting platforms there will be an automatic fallback to scalar routines.

## Memory Alignment and Allocating Arrays

The SIMD instructions read 16 bytes (SSE/SSE2), 32 bytes (AVX/AVX2), or 64 bytes (AVX512) of memory at once. For most SIMD instructions the memory chunk they read needs to be aligned on the same boundary, otherwise a segfault will happen. So for an AVX instruction the memory address needs to be a multiple of 32 bytes, and for an AVX512 instruction even a multiple of 64 bytes. Standard memory allocation routines like malloc() or *new* do not guarantee that the pointers they return are suitably aligned. Usually they return pointers that are aligned on a 16-byte boundary. The vectorized math routines in Cloudy have code protecting against incorrectly aligned data, so in principle you do not need to worry about this. But this protection comes at a cost. In the worst case, the math routine needs to allocate intermediate arrays that are correctly aligned and copy the results over to the final array resulting in a performance penalty. So for optimal performance it is prudent to think about the alignment of the arrays that are passed to the vectorized math routine. There are two things that can happen:

- The input and output arrays have the same alignment (this means that the address of arg[i] and val[i] have the same value modulo 32 or 64), but arg[ilo] and val[ilo] are not aligned on a 32 or 64-byte boundary. This is solved using a technique called loop peeling and is a minor problem, especially if you are operating on large arrays. Usually no special action is required in this case.

- The input and output arrays have different alignment. This problem cannot be solved with loop peeling, so in this case the vectorized routines will allocate a temporary output array with the same alignment as the input array. When the calculations are done, the results will be copied from the temporary array to its final destination. This gives noticeable overhead and is best avoided, though this is still faster than using a normal scalar loop.

To avoid problems with different alignment of the input and output loop, several options are available.

#### avx_ptr

If the input and output arrays are temporary, the most convenient solution is to use an avx_ptr:

avx_ptr<double> arg(n), val(n); for( long i=0; i < n; ++i ) arg[i] = ...; vexp(arg.ptr0(), val.ptr0(), 0, n);

The avx_ptr class will hold a pointer to a correctly aligned array on a 32- or 64-byte boundary. The pointer is obtained from a static pool of arrays. The array will be automatically released once the avx_ptr variable goes out of scope. The static pool is used to avoid the considerable overhead from calling malloc() or *new*. The memory given to you by avx_ptr will **not** be initialized to zero, thus avoiding the overhead of initializing the memory twice: first to zero and then to the value you really want.

This pool only has a limited supply of arrays, hence **this class is designed for temporary arrays only**. Use an avx_ptr inside a routine where the arrays can be released quickly, much like you would use an array declared on the stack. If you need permanent storage, use one of the other solutions below. You can index an avx_ptr just like any other array and it comes with bounds checking using the -DBOUNDS_CHECK macro. The array as declared above can be indexed between 0 and n. The avx_ptr class also has a second constructor:

avx_ptr<double> arg(n1, n2), val(n1, n2); for( long i=n1; i < n2; ++i ) arg[i] = ...; vexp(arg.ptr0(), val.ptr0(), n1, n2);

This will allocate an array that you can index between n1 and n2, and arg[n1] will be correctly aligned on a 32- or 64-byte boundary. Indexing this version outside the range [n1, n2) is a bug. Only the memory between n1 and n2 is allocated.

#### vector_avx

The vector<> class allows you to overload the memory allocator (note that the valarray<> class does not!). You need to declare the array as you would normally, but with one extra argument, e.g.:

vector< realnum, allocator_avx<realnum> > arg(n);

This assures that the memory of the vector is correctly aligned on a 32- or 64-byte boundary. This can be used for allocating temporary or permanent storage, though for temporary storage an avx_ptr is usually preferred. Once we start using C++11, the following shorthand will be available to do the same thing:

vector_avx<realnum> arg(n);

Note that the vector class will initialize the array to zero before handing it to you. So only use this method when you plan to reuse the array several times to mitigate the overhead of the extra initialization. To pass the pointer to the internal storage of the vector<> to the math routine, you need to use get_ptr():

vexp(get_ptr(arg), get_ptr(val), ilo, ihi);

#### stack arrays

You can allocate correctly aligned arrays on the stack using:

ALIGNED(CD_ALIGN) arr[100];

This solution is suitable for allocating temporary arrays, though it should not be used for large arrays since that could lead to stack overflows. The memory allocated this way will not be initialized before it is passed on to the user.

## Dealing With Invalid Arguments

The vectorized math routines we implemented in Cloudy do not fully comply with the IEEE 754 standard. Full compliance would be complex and significantly slow down the vectorized routines. In this section we will describe the general approach we take in Cloudy. In the next section we will give full details for each of the individual math routines.

- We completely ignore error conditions due to underflow to zero or inexact results.
- If an invalid argument is passed to the routine (this is NaN or a finite number that would yield an invalid result, i.e. NaN or +/-Inf) we will throw a domain_error() exception as a rule. This exception is caught by the main program which will print an informative message.
- Gradual underflow slows down the routines, so is in general not supported. Subnormal numbers as input will be accepted, but unless we can treat them without loss of speed, we will treat them as if they had underflowed to zero. Subnormal numbers as output will normally be flushed to zero.

## Description of Individual Routines

The behavior of the math routines sometimes depends on whether the raising of a floating point exception flag is trapped (resulting in a crash of the program due to a floating point exception or FPE) or not (in which case the error condition is ignored and the program continues executing). When we need to discriminate between these two cases, a trapping environment will be marked TE below, and a non-trapping environment NTE. For brevity will we denote a domain_error() as DE. A signaling NaN will be marked SNaN and a quiet NaN as QNaN. For each routine the maximum absolute value of the relative error of the result in u.l.p. (unit of least precision) is given. These were determined by comparing te result with the linux glibc math library result.

#### vexp, vexp10, vexpm1

input | vexp<double> | vexp10<double> | vexpm1<double> | vexp<realnum> | vexp10<realnum> | vexpm1<realnum> | -------------------------------------------------------------------------------------------------------------------- SNaN | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | QNaN | DE | DE | DE | DE | DE | DE | +Inf | DE | DE | DE | DE | DE | DE | -Inf | 0. | 0. | -1. | 0.f | 0.f | -1.f | minarg | -708.396418532 | -307.652655569 | -37.4299477502 | -87.33655548f | -37.92978287f | -17.32868004f | x < minarg | 0. | 0. | -1. | 0.f | 0.f | -1.f | maxarg | +709.782712893 | +308.254715560 | +709.782712893 | +88.72283173f | +38.53183746f | +88.72283173f | x > maxarg | DE | DE | DE | DE | DE | DE | maxerr/ulp | 2.5 | 8.5 | 1.0 | 3.0 | 6.5 | 1.0 | -------------------------------------------------------------------------------------------------------------------- note: these routines flush subnormal results to zero.

#### vlog, vlog10, vlog1p

input | vlog<double> | vlog10<double> | vlog1p<double> | vlog<realnum> | vlog10<realnum> | vlog1p<realnum> | --------------------------------------------------------------------------------------------------------------------- SNaN | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | QNaN | DE | DE | DE | DE | DE | DE | +/-Inf | DE | DE | DE | DE | DE | DE | minarg | 0. | 0. | -1. | 0.f | 0.f | -1.f | x <= minarg | DE | DE | DE | DE | DE | DE | maxerr/ulp | 1.0 | 1.5 | 1.0 | 1.0 | 2.0 | 1.0 | --------------------------------------------------------------------------------------------------------------------- note: subnormal numbers as input are treated as zero and result in DE.

#### vsqrt, vhypot

input | vsqrt<double> | vhypot<double,double> | vsqrt<realnum> | vhypot<realnum,realnum> | ------------------------------------------------------------------------------------------------- SNaN | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | QNaN | DE | DE | DE | DE | +/-Inf | DE | DE | DE | DE | minarg | 0. | N/A | 0.f | N/A | x < minarg | DE | --- | DE | --- | maxerr/ulp | 2.0* | 2.0* | 2.0* | 2.0* | ------------------------------------------------------------------------------------------------- note 1: the sqrt() function can be implemented using either a hardware instruction or a software implementation. On some hardware the software implementation is faster. Both are supplied, but by default the hardware implementation is used. note 2: the software implementation of sqrt() will treat subnormal numbers as zero, for the hardware implementation the behavior depends on the settings in the hardware control register. note 3: some combinations of arguments to hypot() can result in an overflow (e.g., when both x and y are 0.9*DBL_MAX). This will result in: TE:FPE, NTE:QNaN. * maxerr/ulp pertains to the software implementation.

#### vasinh, vfast_asinh

input | vasinh<double> | vfast_asinh<double> | vasinh<realnum> | vfast_asinh<realnum> | --------------------------------------------------------------------------------------------- SNaN | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | TE:FPE, NTE:DE | QNaN | DE | DE | DE | DE | +/-Inf | DE | DE | DE | DE | minarg | N/A | 0. | N/A | 0.f | x < minarg | --- | DE | --- | DE | maxarg | N/A | 1.340780792994e+154 | N/A | 1.844674297e+19f | x > maxarg | --- | DE | --- | DE | maxerr/ulp | 1.5 | 1.0 | 2.0 | 1.0 | --------------------------------------------------------------------------------------------- note 1: vfast_asinh() is a faster version of vasinh() with a more restricted domain. Within the valid domain both routines will yield the same results. note 2: no other hyperbolic routines are implemented as they are not needed at this time.