The multi_arr class in Cloudy

Introduction and history

When Cloudy was converted from Fortran to C, we were faced with the problem of how to effectively implement multi-dimensional arrays that could easily be passed from one routine to another. What we came up with is the concept of allocating arrays of pointers to arrays of pointers, etc., to the data (for conciseness I will call that the ARPA method below). This exploits the fact that in C (as well as C++) the following two ways of accessing data are indistinguishable:

double *d;
double x = *(d+2)

and

double d[];
double x = d[2];

This could then be generalized to multi-dimensional arrays in the ARPA method as follows:

double **p;
p = (double**)malloc(n*sizeof(double*));
for( i=0; i < n; ++i )
    p[i] = (double*)malloc((i+1)*sizeof(double)); /* note: the size may depend on i */
/* .... */
double x = p[i][j];

This would be interpreted by the compiler as

double x = *(*(p+i)+j);

This is fundamentally different from a two-dimensional array that was declared as such (I will call this the regular method):

double p[N][N];
/* .... */
double x = p[i][j];

Taken at face value this looks the same as the previous example, but under the hood it is really different. This is how the compiler would execute the latter statement:

double x = *(p+i*N+j);

Note that there is only one indirection in this statement! However, the size of such static arrays is fixed at compile time.

The ARPA method has several advantages. The arrays are more flexible: you can for instance allocate triangular matrices as shown in the example above. This then allows precise bounds checking. Accessing p[i][i+2] in a triangular matrix would be a bug, but only in the ARPA representation would purify or valgrind notice this (unless i >= n-2, in which case it would be caught in the regular representation as well). You can also easily pass them as arguments to other routines. In the regular method, the compiler needs to know the value of n in order to calculate the offset when indexing the array. This value would then have to be passed along with the base pointer as a routine argument. This is not only cumbersome, but also dangerous. If you change the array dimensions, but forget to adjust the routine argument as well, indexing would be done wrong. Bounds checking may not catch that; in general there is no guarantee.

But there is also a big disadvantage to the ARPA method. The computer needs extra indirections to calculate the data address. This can lead to poor cache performance and added overhead, especially for high-dimensional arrays. To make matters worse, the compiler has no way of knowing that the arrays of pointers are never changed. We never intend to change them, but the compiler cannot make that assumption (after all it is perfectly legal in C/C++ to change them). If we write:

for( i=0; i < n; ++i )
    for( j=0; j < i+1; ++j )
        p[i][j] = some_routine(i,j);

and p is global data (as is usually the case in Cloudy), then the compiler has to assume that some_routine may modify the pointers in the ARPA structure (or even modify p itself!), so it has to do the whole indirection again after each call to some_routine. Initial work on the big H2 model (which has lots of big multi-dimensional arrays) has shown that the overhead involved in accessing the arrays is indeed substantial.

When we moved to C++, we had the opportunity to create an entirely new container class called multi_arr for allocating and accessing multi-dimensional arrays. It is designed to reduce the overhead of accessing the data and thus speed up the code, but it should also retain the advantages of the ARPA method of precise bounds checking and easy and secure passing as routine arguments. Below I will describe this class in more detail. It has the added advantage that it enables us to easily change the memory layout of the multi-dimensional array. Only a handful of interface routines in the class definition itself need to be changed, the rest of the code remains untouched. Inlining of these interface routines assures that this approach is nevertheless efficient. This allows us to experiment and find the optimal representation. It even allows us to use different representations on different platforms if that works better!

The design philosophy

The primary goal of the multi_arr class is to provide multi-dimensional arrays with an efficient means of accessing the data. To achieve this, two internal representations are currently supported called C_TYPE and ARPA_TYPE. In a C_TYPE array the memory layout is guaranteed to be the same as in a regular C/C++ multi-dimensional array. This means that the data is allocated as a single monolithic block and indexing is done in almost the same way as in regular arrays.

There are two ways you can do the indexing. Let's assume we have a 3-dimensional array s[d0][d1][d2]. We can then define strides s2 = d2, and s1 = d1*d2. Indexing can be done in the following two ways:

s[i1][i2][i3] = *(s + (i1*d1 + i2)*d2 + i3)

or

s[i1][i2][i3] = *(s + i1*s1 + i2*s2 + i3)

They have the exact same number of operations, but it turns out that the second method is faster (presumably because there is more freedom to reorganize and pipeline the expression due to the shallower parse tree). So the second method has been chosen to index C_TYPE multi_arr objects. Note that as the array strides are determined when the array object is created rather than at compile time, they must generally be fetched from memory when the array is indexed.

In an ARPA_TYPE array the method of accessing the data is unchanged from what was described above, i.e. we have an array of pointers to an array of pointers, etc..., to the data. However, the allocation is done differently. For each level of indirection the pointer array is allocated as a single block of memory. Also the data is allocated as a single block (but with a different layout than a C/C++ array!). This means that a 3-dimensional ARPA_TYPE multi_arr of dimensions 10x10x10 will only allocate 3 monolithic blocks of memory, as opposed to the 1 + 10 + 10*10 = 111 blocks in the old algorithm! It turns out that this greatly speeds up access to the array elements (presumably due to better cache performance). It even rivals the C_TYPE algorithm described above in speed! Since the ARPA_TYPE array still has the advantage of saving memory on arrays that are not square (e.g. the triangular matrix described above) it is the default layout on all platforms.

But this is only part of the story! You often do repetitive tasks on consecutive array elements. This can be done very efficiently if the compiler knows that elements s[i1][i2][i3] and s[i1][i2][i3+1] are actually adjacent in memory. If the compiler understands this, it can simply increment the pointer &s[i1][i2][i3], rather than calculating &s[i1][i2][i3+1] from scratch (which is computationally much more demanding). To the human reader this seems obvious, but it is actually surprisingly hard for a compiler to understand this. Even when the index formula shown above is inlined, the compiler may well mis this point. It is therefore important for the programmer to help the compiler and show explicitly in the code that the memory is adjacent. For this purpose random-access iterators have been implemented as part of the multi_arr class. They will be discussed in more detail below. They can significantly speed up your code, especially in tight loops that are executed many times over.

Modern processors are very much geared towards efficiently accessing contiguous memory locations and then performing simple repetitive operations on those data. This is called vector processing. Cloudy currently performs very poorly on this point. There is virtually no code (with the exception of the Lapack routines) that can exploit these optimizations. We are therefore missing out on a lot of potential of modern processors. Using multi_arr structures, and especially iterators, may well open up possibilities to vectorize the code. This is a point that needs to be researched in more detail in the future.

An additional advantage of the multi_arr class is that the dimensions of the array are stored as part of the multi_arr so that they are automatically passed from one routine to another and there can be no bugs by accidentally passing the wrong dimensions from the caller to the callee.

These are the things to remember from this section:

1. The multi_arr class provides multi-dimensional arrays with an efficient means of accessing the data.

2. The iterators are an integral part of the design philosophy which allow even more efficient access and possibly enable vector optimizations in the future.

Declaring and allocating the array

With the multi_arr class you can allocate n-dimensional arrays, with n in the range 2 ... 6. It would be possible to support higher-dimensional arrays by extending the class definition, but currently there is no need for that. You can use it to allocate arrays of any arbitrary type. This obviously includes basic types like bool, long, float, and double, but also (possibly user-defined) structs and classes. The multi_arr class template has four arguments to define the internal representation. Normally only the first two are supplied. The 3rd and 4th argument have default values that are set by compiler switches. They determine the memory layout and whether bounds checking is enabled. These compiler switches will be discussed further on.

Let's look at an example of a declaration. Let's assume you want to use a 3-dimensional array of double precision floating point numbers. You would declare that as follows:

multi_arr<double,3> arr;

Another example involving a user-defined class is

multi_arr<transition,6> H2Lines;

This declares our intention to use a multi-dimensional array and provides an interface, but so far nothing has been allocated yet! The simplest way to allocate storage is as follows:

arr.alloc(10,12,20);

or even shorter

multi_arr<double,3> arr(10,12,20);

This allocates a 10 x 12 x 20 block of doubles. Now let's return to our previous example of the n x n triangular matrix we used above. How would we implement that? The format for defining that is very similar to old scheme of doing things that we used in the ARPA method.

multi_arr<double,2> p;
p.reserve(n);         // declare the first dimension
for( i=0; i < n; ++i )
    p.reserve(i,i+1); // declare that p[i] contains i+1 elements
p.alloc();            // this actually allocates the memory

Note that the memory is not available until after the call to p.alloc(). The general format for n-dimensional arrays is

p.reserve(n);         // declare the first dimension
for( i=0; i < n; ++i )
{
    p.reserve(i,i+1); // declare that p[i] contains i+1 elements
    for( j=0; j < i+1; ++j )
    {
        p.reserve(i,j,20); // declare that p[i][j] contains 20 elements
        for( k=0; k < 20; ++k )
        /* etc .... */
    }
}
p.alloc();            // this actually allocates the memory

If you look carefully, you will see that the structure of the calls to p.reserve looks very similar to the old ARPA allocation scheme. In fact, if you consistently replace calls like

p[i][j] = (T**)MALLOC(n*sizeof(T*));

or

p[i][j] = new T*[n];

by

p.reserve(i,j,n);

you will get the new scheme of reserving space for the multi_arr. This similarity is by design: the multi_arr class wants to know exactly how much data you intend to use. This has two reasons. The first reason is saving memory. In the new multi_arr allocation scheme, the smallest amount of memory will be allocated that can hold all the requested data. For an ARPA_TYPE array this is exactly the sum of all the number of elements that were requested in the reserve calls. For a C_TYPE array it is the smallest rectangular block that can hold all the data. The second reason is bounds checking. We promised that bounds checking would be just as versatile as in the old ARPA scheme. In order to achieve that, the multi_arr needs to know exactly how many elements are intended for actual use. This implies that the multi_arr needs to retain the geometry information which leads to some memory overhead.

The memory allocated by a multi_arr will be automatically freed when the object goes out of scope. If you need to free the allocated memory explicitly, use the clear() method:

arr.clear() // frees all the allocated memory

This will free everything, including all the memory that was allocated to store the geometry information. Hence, after the call to clear(), the multi_arr will be in a pristine state as if it had just been declared.

These are the things to remember from this section:

1. With multi_arr you can allocate n-dimensional arrays of any arbitrary type with n = 2 ... 6.

2. The memory is not physically allocated until after the call to the alloc() method.

3. Use the reserve() method to indicate precisely how many data elements are needed. This enables precise bounds checking and saving memory.

4. Allocated memory will automatically be freed when the multi_arr goes out of scope. Use clear() to explicitly deallocate the memory.

Indexing the array

You can access array elements using the familiar arr[i][j][k] notation:

double x = arr[i][j][k];
arr[i][j][k] = 2.*exp(-x);

If you have a multi_arr of a class, you can access the class members just like you would expect:

realnum E = H2Lines[0][0][0][0][0][0].EnergyK;
H2Lines[0][0][0][0][0][0].EnergyK = 1.f; // probably the wrong value....

Note that operator[] returns a reference to the data element, so you can take the address of this location:

double *p = &arr[i][j][k];  // works, however iterators are preferred!

Use this feature only when you have to supply raw pointers to routines (typically library routines, see also the section on library routines below)! In all other cases the use of iterators or const_iterators is preferred.

Note also that the C convention of getting a pointer to the first element of an array by omitting the index does not work:

double *p = arr[i][j];  // ERROR! This will not give you &arr[i][j][0]...

The compiler will generate an error message if you try this.

Using iterators

In the STL the notion of pointers to data elements has been replaced by iterators. To enable using standard STL algorithms on a multi_arr, this class tries to conform as much as possible to the STL standards. It therefore also supplies iterators and const_iterators (but not reverse_iterators) for pointer-like access to data elements. They behave very much like the standard iterators. So I will be quite brief here. If you feel unsure about any of this, please consult your favorite C++ book on the subject.

You can initialize an iterator using the ptr() method:

multi_arr<double,3>::iterator p = arr.ptr(1,2,0);

This notation is quite cumbersome, so it is common practice to use a typedef as shorthand:

typedef multi_arr<double,3>::iterator md3i;
   ...
md3i p = arr.ptr(1,2,0);

Many of the plausible combinations for fundamental types are already predefined in container_classes.h. They all have names looking like the one above. The name starts with an "m" (for multi_arr), followed by a single letter for the fundamental type ("b" for bool, "l" for long, "r" for realnum, and "d" for double), then a single digit for the dimension (i.e. a digit in the range 2 to 6), and finally the type of iterator ("i" for a normal iterator, and "ci" for a const_iterator).

Iterators provide all the functionality that you would expect of a pointer. You can e.g. use iterators for efficient access to data in a loop. Using iterators as shown below will generally speed up the code significantly since it avoids calculating the array index over and over inside the body of the loop. Especially in tight loops over arrays with high dimension this can become a significant overhead! This is the way to efficiently access data in a loop:

md3i p = arr.ptr(i,j,0);
for( long k=0; k < n; ++k )
  *p++ = func(k);

or alternatively

md3i p = arr.ptr(i,j,0);
for( long k=0; k < n; ++k )
  p[k] = func(k);

The memory layout of a multi_arr may change in future editions, so user code should not make any assumptions about the layout. The only exception is that the user may safely assume that for the default memory layout the last index runs over contiguous memory. This allows for efficient iterator access. The example above was OK since arr[i][j][0] through arr[i][j][n-1] are guaranteed to be adjacent. However the next example is not OK:

md3i p = arr.ptr(i,0,0);
for( long j=0; j < n; ++j )
  for( long k=0; k < n; ++k )
    *p++ = func(j,k); // WRONG!

This is wrong because p[i][0][n-1] and p[i][1][0] are not guaranteed to be adjacent. Bounds checking will catch this mistake.

A const_iterator is also supplied for read-only access:

md3ci p = arr.ptr(i,j,0);
double sum = 0.;
for( long k=0; k < n; ++k )
  sum += p[k];
  p[k] = 0.; // ERROR, you cannot write to a const_iterator

Use const_iterators wherever possible to explicitly state that you do not want to change the contents of the array. Using const_iterators is mandatory when the multi_arr was passed as a const reference to the routine and altering its contents is not allowed.

You can also dereference an iterator like this:

multi_arr<transition,6>::const_iterator p = H2Lines.ptr(0,0,0,0,0,0);
realnum E = p->EnergyK;

Note that there are currently no begin() and end() methods available in multi_arr. These may be added in a future update.

Copying arrays

Bounds checking

Saving/restoring state

Passing multi-dimensional arrays to C/C++ library routines

Interactive debugging

The current generation of interactive debuggers have trouble parsing multi_arr variables. The work-around is as follows: you need to insert ".p_ptr<n>" just in front of the first square bracket, where <n> is the number of dimensions of the array.

As an example, the variable Transitions[ipISO][nelem][ipHi][ipLo].Emis->dampXvel should be written in the debugger watch window as Transitions.p_ptr4[ipISO][nelem][ipHi][ipLo].Emis->dampXvel to make it possible to interrogate its value.


Peter van Hoof, 20 May 2008