Peter hosted the meeting, during mid-Feb of 2007, at the Royal Observatory of Belgium
We could do the grids and optimizations by a separate script/program. On the other hand doing it in MPI should be quick to do and easy.
Vectorization would require us to break up large complicated loops into multiple simple loops. These could then be vectorized or parallelized or both.
Issues with MPI/OpenMP
shared vs distributed memory. The current implementation of OpenMP does not support distributed memory clusters. Future upgrades may.
NUMA systems. OpenMP was never defined with NUMA systems in mind. Threads/CPUs share memory and the physical allocation is determined by which CPU touches the memory first. Keeping this optimal as far as access speed is concerned is going to be a nightmare.
The first priority should be to remove redundancy from the code. E.g. create a loop out of the various ionization solvers for each element. Another example would be malloc'ing multidimensional arrays. Create a single routine for allocating multidimensional arrays and have the rest of the code call that routine. Then change allocating so that multidimensional arrays become one large block.
The way we determine opacities, ionization, etc, gets in the way of parallelizing over the frequency grid. So the best approach still seems to be to parallelize things like iso-electronic sequences? We can profile test suite runs and try to identify routines that are worth paralellizing, but even for the big H2 models the profile is fairly flat, The matrix solvers only used up roughly 30%. This could become more if other parts of the code started using basic BLAS routines to evaluate intergrals etc.
It seems worthwhile to insert SuperLU into the code since that gives paralellization with little effort. But this is likely only to give us 10 - 20% speedup (more for big Fe II models).
- How to handle exceptions in parallel sections?
- New input parser / new initialization scheme.
Initialization is intimately linked to the input parser and they are currently hard to disentangle. We should probably rewrite the input parser. It should parse the input and pass the input on in the form of data structs. Then the initialization should be done based on the information gathered. This could then make dynamic decisions about what elements to include, what reactions, what size of data structs, etc. We should aim to have clearly defined and uniform methods in each class that handle initialization, re-initialization and destruction. There should be multiple levels of initialization, once per core-load, when input is parsed, at the start of a new iteration, etc.
A new syntax is probably also necessary.How radical this change should be is not yet clear.
- Use of STL, malloc overhead.
- Performance issues (CPU cache use, memory vs speed).
- New / enhanced physics.
- Update Hazy.
We need to find a format that is accessible to everybody. It would make sense to have a text-based format so that you can generate meaningful differences between versions. Wiki may be one option, another would be LaTeX, but that would be a culture shock to Word users. Another possibility would be Docbook (which would be a culture shock for everybody). It appears doable to convert the current Word version of Hazy into Docbook, possibly via RTF.