Checksum Data Files?

I'm just thinking out loud while going through recent changes to the trunk. I wonder if it would be a good idea to abandon magic numbers in data files for checksums. Can it be done in a platform independent way? Pseudocode for MD5 can be found on wikipedia: Thoughts?

Ryan Porter [2007-01-12]

I'm sure it can -- perl does it, seems likely most of the software linked to from that wiki is portable, and checking Google for "gnu md5" pulls out more source. Probably the biggest issue is making sure we use code with a compatible license...

Robin Williams [2007-01-12]

The "magic numbers" are really version numbers, not magic numbers in the sense of a Unix or today's mac, where they indicate what program is to execute it.

One example is he_transprob.dat - its magic number is 60725, showing that it is from 2006 July 26. If the path were misset to an old version of the code, or if the user did not have the correct atomic data, they might get the file for 06.02. Its magic number is 31218. The error message would say that we were expecting a file from 2006 and got one from 2003. that would be a major clue to what was misset.

Checksums would guard against data corruption. Is that a real worry? I have never dealt with a query from someone with corrupt data, but every year get queries from someone who cannot figure out the path/atomc data setup.

The downside to a checksum is that we lose the simple clear statement of the file's date and what was expected. A naïve user (most cloudy users!) would be mystified by an error saying that we expected 7a34e60... and got a6c210d...

If there were data file corruption then most likely problems would be encountered in parsing the file. Most data files end with a second copy of the magic number so corruption would probably affect that read and it would be caught.

I may be missing something here, but what is the downside to version numbers, and benefit of checksums?

Gary Ferland [2007-01-13]

As Gary already said, the main purpose of the magic number is to assure that we have a data file that is compatible with what the code expects. We would still need that, even with an MD5 sum in place.

There are some hairy issues here. The EOL character depends on OS, so we would have to carefully work around that to assure that everybody gets the same checksum. That probably means reading the file line by line and calculating the checksum that way. In order not to upset the code too much we would probably have to read the file twice, first calculate the checksum and then do the normal IO. On UNIX systems that is not a big deal, the second time around the file would still be in buffer cache. I am not sure if Windows is that smart.

Finding an MD5 implementation should not be hard, we can probably rip something out of FreeBSD.

But the real question is if we need this. Corruption can happen (disks can slowly fail and corrupt files, bad memory can corrupt IO buffers when writing files, files can get mangled, etc). There recently was a case of a corrupted source file on the discussion board. But we usually have quite a few sanity checks in place, so it seems quite unlikely that any corruption would slip through that unnoticed (assuming that the asserts are not disabled!). But then a wise name Murphy once said that everything that can go wrong, will go wrong...

We have a fairly hefty amount of data files. Checksumming that on every load could be a noticeable overhead. If we wanted to play it safe, we would need this. But I am not sure how many problems we are realistically going to solve with this...

Peter van Hoof [2007-01-13]

My thinking was that checksums would be BOTH a magic number AND a guarantee of file integrity. You would not need a separate check that the file is compatible what the code expects because the checksum can do that just as easily as a magic number can. It's just a different number. At any rate, I admit it might be a stretch to argue that we need this.

Ryan Porter [2007-01-13]

The problem with using checksums only is that that would not work with Cloudy generated data files (the binary atmosphere files and the grain opacity files). It is unpredictable what the checksum would be, so you loose all version control. So you would then need both a magic number and a checksum.

Peter van Hoof [2007-01-14]