data base format working document

Summary of meeting 2011 July, Belfast

Files should be readable by both our free format methods and HPC Fortran. Space delimiters make the numbers easy to parse with our methods, Fortran, and make it easily readable to a human. All files are allowed to have comments embedded starting with a hash sign ("#"). Adding comments describing each field in the file is strongly encouraged to make the files self-documenting.

(How are these files linked together? Do they have a fixed naming scheme? Should there be a species file? In practice will they live in species-specific directories as in Chianti? Should we have separate references from the comments area in all of the above. Maybe, e.g., REFER can act as a special kind of comment and we can store a vector of references strings in the code? Maybe those should be collected in a species file instead so it's easier to avoid duplicates? Should a species file contain an optional nickname, e.g. "water" for H2O? --RLP)

(References: ADS codes work well in astrophysics, and are the primary basis for the Hazy bibliography;  DOI codes are more general.)

energy levels file

We will only store energy level data, and will derive line wavelengths from these, (after correcting for the index of refraction). This means that we need experimental energies, since theoretical energies are uncertain by roughly 10 percent.

  • version number
  • fortran format string specifying the format in this file
  • level index number (starting from 1), level energy in wavenumbers, statistical weight, an optional string giving the level designation, all with fixed width for efficient random access (Should the level designation be completely free format? What about when we can list unambiguous quantum numbers? Is something like {n:0;v:0;J:0} worth considering? -RLP ) (Should we have optional autoionization probabilities here? -RLP)
  • an end of data sentinel (field of stars ****, empty line, or -1 as in Chianti)
  • comments area giving provenance of the data
  • not now implemented, but possible wavelength if different from energy separation

transition probability file

We will keep the various contributors to Aul separate, if we have them, but will use the total if that is all we have.

  • version number
  • fortran format string specifying the format of this file
  • index for lower, upper level, transition probability, string specifying type of transition (E1, M1, E2, 2E1, etc). There can be several such line for a given lower and upper transition, for different contributors to total transition probability. All entries have fixed width. (Nickname for transition? -RLP)
  • an end of data sentinel (field of stars ****, empty line, or -1 as in Chianti)
  • comments area giving provenance of the data

The reading program would begin with a zero-initialized matrix of radiative transitions between each pair of indices. When a transition is encountered that probability will be added to the total accumulated so far. This way various contributors to the total probability are entered on separate lines.

collision data file

We will work with thermally averaged collision strengths for now. (What does "for now" mean? Is the end goal rate coefficient tables? -RLP)

We will store these on a logarithmically spaced temperature grid that is specified as part of this file. (What does logarithmically-spaced mean? Why restrict this? -RLP)

We will also use data from papers and will have to use the temperature grid they supply.

Cloudy is applied to the temperature range 2.7 K to 1010 K and we need data over this full range. There is no need for the tabulated data to extend to temperatures so high that the ion will have a tiny abundance due to collisional ionization. There is no lower bound to the kinetic temperature that a photoionized plasma can have, so the data should extend to the lowest possible temperatures. If the required temperature is beyond the low or high end of the table we will use the last tabulated value.

For ions we will nearly always have electron collision strengths, and will have proton collision strengths for some transitions. For a very few there are collisions with alpha particles. For molecules H2, H0, He, and electrons play these roles. For hydrogen-deficient environments we may also consider CO, but most likely no data exist for that. Entries in all caps are keywords that should be typed explicitly in the data file.

  • version
  • TEMP number of temperatures, followed by the temperatures
  • COLLIDER type of collider, electrons protons
  • type of data, collision strength or deexcitation rate
  • lower and upper level index, followed by collision strengths or rates
  • end of data sentinel, field of stars ****, empty line, or -1
  • TEMP number of temperatures, followed by the temperatures (optional, followed by another block of data as above, typically from a different source) n
  • repeat type of collider to sentinel for as many colliders as we have (Can we have multiple tables from different sources for the same collider? -RLP)
  • provenance

(A higher level question: should the database be versioned separately from Cloudy? - RLP)


Desiderata

There are various requirements for a data file format, e.g.

  1. Human readable
  2. Concise
  3. Accessible by common visualization tools (e.g. Excel, gnuplot)
  4. Human editable
  5. Computer readable
    1. in required languages (C, Fortran, ...)
    2. using portable coding without license constraints
  6. Computer editable without causing unintentional changes (e.g. due to rounding errors while processing)
  7. Automatically convertible to and from other required formats
  8. Data portable between systems (e.g. big-endian/little-endian incompatibility)
  9. Fast to read in data in full
  10. Fast to randomly access data
  11. No unnecessary data fragmentation for large (e.g. 1Mb) block size HPC file systems
  12. Managed data provenance
  13. Forward and backward compatibility

Note that some of these are obviously mutually incompatible (e.g. editable and provenanced), so choices must be made based on likely usage patterns. If a suitable interconversion tool is available, it may be possible to combine the benefits of distinct file formats.

Probably the best means of ensuring forward and backward compatibility is to define a underlying format which does not depend on the details of the data to be included. E.g. using -1 as a sentinel artificially restricts the format to be unable to include negative numbers, while a blank line makes it difficult for a human to be certain of the structure of hierarchical data sets. Examples of text file formats which achieve this are XML (although that is verbose), YAML and JSON.

Data provenance is probably best achieved by the use of SHA hash keys to sign data, in a similar manner to modern DVCS packages.

Current Implementation

The implementation of these ideas led to the Stout format.


External references


Return to DeveloperPages

Return to main wiki page