Changes between Version 1 and Version 2 of DataParser

2017-10-20T16:17:20Z (8 months ago)

Add a few sections


  • DataParser

    v1 v2  
    77== Basic assumptions ==
     9The data parser will read the file on a line-by-line basis. It allows comments to be embedded in the file. All comments must start with the hash symbol (#). Any text after the hash symbol up to the end of the line is considered part of the comment and will be ignored. If a line has a hash symbol in the first column, it will be completely ignored and automatically skipped when you read the next line. Comments may also be appended at the end of a line containing data. Note that if a line has a comment starting after the first column, but contains only whitespace leading up to the hash symbol, will be considered a blank line. Such a line may be considered either a comment or an end-of-file (EOF) marker, depending on how the file was opened (see next section for more details).
     11The parser assumes that all data items (which we will call tokens) are separated by whitespace. So it is assumed you can retrieve the next token from the line by skipping leading whitespace and then copying non-whitespace characters until the next whitespace character is found. Our definition of whitespace matches that of the isspace() routine provided by the system. This includes the normal space character and the horizontal tab (\t) which are both commonly used as separators. Other common separators (such as the comma) are not supported.
     13The carriage return character (\r) is automatically stripped from the line in case an MSDOS file is read on a *NIX system. Actually, anything after \r until the end of the line, marked by a newline (\n), will be deleted as well. For a properly formatted MSDOS file there should be no text between \r and \n.
     15The parser tries to enforce strict checking of the contents of the file. This includes checks that all the text that was parsed is actually used in the conversion to a data item. As an example, if you are reading a long variable, and the next token contains the text "2.05", it will store the value 2 in the variable. But it will subsequently detect that the remainder ".05" was not used in the conversion and abort (complaining that it found trailing junk). This has two advantages: you can detect storing data in the wrong type of variable (as in this example where a realnum or double should have been used instead of a long) and you can detect corruption in files (which may lead to text looking like "2.54e-025e-02", which would also be detected).
     17When parsing text tokens, they are always considered case sensitive. This is done for performance reasons.
    919== Opening a data file ==
     21You obviously need to open the data file before you can read it. This is usually done in the constructor, as shown below:
     23DataParser d( "sample.dat", ES_NONE );
     25Opening the file is done using open_data(), so the search path applies, just like it would have done if you had called open_data() directly. So in our example the file "sample.dat" will be searched along the search path. If it cannot be found, the code will abort, just like open_data() would have done. There is also a mandatory second parameter in the constructor, which indicates the style for including EOF markers in the file. There are 3 possible choices:
     27ES_NONE: there are no special in-file EOF markers.
     28ES_STARS_ONLY: a field of stars (***) is an in-file EOF marker.
     29ES_STARS_AND_BLANKS: both a blank line and a field of stars are in-file EOF markers.
     31An in-file EOF marker implies that all lines following that line are considered free-format comments and should not be parsed. A field of stars is a line containing at least 3 stars starting in the first column. A blank line is a line containing only whitespace plus optionally a comment that starts ''after'' the first column. In the {{{ES_NONE}}} case, a field of stars will have no special meaning. In the {{{ES_NONE, ES_STARS_ONLY}}} cases, a blank line will be considered a comment and will be automatically skipped.
     33The !DataParser class also has an open() method for opening the file. It can be used as follows:
     35DataParser d; "sample.dat", ES_NONE );
     38The parameters have the exact same meaning as before. Normally this is not needed, but can be handy if you need to parse multiple files. You can then reuse the variable, as in the example below:
     40DataParser d; "sample.dat", ES_NONE );
     42// ... parse data ... "sample2.dat", ES_NONE );
     45Note that there is no need to close the file. The second call to open() will automatically close the first file. The same happens when the variable d goes out of scope. The destructor will close the file.
     47In rare cases you may need to parse a file that could potentially be absent. This can be done as follows:
     49DataParser d( "sample.dat", ES_NONE, AS_TRY );
     50if( !d.isOpen() )
     52    // ... do some error handling ...
     56    // ... parse data ...
     59The parameter AS_TRY tells open_data() not to abort if opening the file fails (and not even print any error messages). The method isOpen() can then be used to test if opening the file was successful and react accordingly. The open() method obviously also allows this optional third parameter to be added.
     61In even rarer cases, you may need to close the file before the variable d goes out of scope. The only plausible use case for this I can think of, is if you want to delete a temporary file after parsing it. This can be done with the call:
     64remove( "tempfile.dat" );
     67== Reading lines of data ==
     69After you opened the file, you can start reading it. Parsing is done line by line, so you first need to read in a line:
     70{{{ "sample.dat", ES_NONE );
     74The getline() method returns a boolean which indicates if there are more lines to read. This makes it very convenient to parse a file using the following loop:
     76while( d.getline() )
     78    // ... parse one line ...
     81The getline() method will automatically skip comment lines and also automatically strip any comment at the end of the line. This means you need not (and should not) worry about comments in data files while parsing.
     83If you are using in-file EOF markers, you will need some extra code:
     85while( d.getline() )
     87    if( d.lgEOFMarker() )
     88        break;
     89    // ... parse one line ...
     92The reason for this is that in-file EOF markers are not automatically handled by getline(), so you need to add code yourself to break out of the loop. This allows additional checks to be done, like in the Stout files where the mandatory presence of a field of stars is enforced. This could not be done if getline() would take over the task of lgEOFMarker() itself...
    1194== Checking the magic number ==
     96For most data files, the first task after opening the file is to check whether the magic numbers are OK. Since this is such a common task, a special method has been created to do this:
     97{{{ "sample.dat", ES_NONE );
     100static const long yr=2007, mon=11, day=18;
     101d.checkMagic( yr, mon, day );
     103The checkMagic() method will check the numbers in the file against those supplied as arguments. Four versions of checkMagic() exist, with one, two, three, or four parameters, all of type long. Internally the code will read one, two, three, or four tokens of type long from the line and then compare them to the parameters that were supplied. If any one of those doesn't match (or if reading failed), an informative error message will be printed and the code will abort.
     105Note that you need to call getline() first. For versatility, the checkMagic() method does not assume that the data are on a new line (i.e. it does not call getline() itself), nor does it assume that there are no more data after the magic numbers (i.e., it does not call checkEOL() after parsing the magic numbers -- checkEOL() is discussed below). But it does assume that all magic numbers are on the same line.
    13107== Reading the data ==