How to manipulate climate data files

Many dataloggers dump their collected data directly into Microsoft Excel. The users of dataloggers which don't dump the data into excel often transfer the data for processing in excel, which has become the standard tool for scientific data reduction.

However, there is an entire alternative ecosystem of ancient and free programs which can vastly speed up the ordering, display and analysis of climate data. These programs were developed in the early days of computing, for the big-iron unix operating system of the room-filling computers of the IBM era of dominance. The unix way of doing things has since been reinvigorated by the appearance of the linux operating system, which runs on the smallest computers and costs nothing. However, all the programs described here will also run on MSWindows, but need to be installed (they come ready to use with most linux distributions).

It may be easier to get started by manipulating climate data in excel but if you routinely have to inspect and report climate records as a person responsible for the climate in a museum, or any other building, it will soon become much quicker to use programs which are initially puzzling but which can be automated and chained to provide very rapid manipulation and display of climate data from varied sources.

The starting point is to get the data into plain text format. This is the only durable and universal format, since most datalogger manufacturers' programmers assume that data from their particular device never needs to be combined with data from other sources. So the data are stored in strange binary formats readable only by the manufacturer's software, and often for the current version only. I leave it to the reader to extract these data into plain text, because the process is unique to each datalogger's software.

Converting date formats

Once you have the data in plain text format, let us first tackle a common problem: synchronising the date-time format to allow merging of data from two sources, or to concatenate data from one source, but which has been collected by two computers with different user's choice of date format.

The following record fragment is in US style date format: month/day/year.

08/23/06 09:00 23.4 78 
08/23/06 10:00 23.9 75

Having data sets in various formats risks ambiguity, since for nearly half the month's days, the US and European formats are both valid but indicate different dates. Let's suppose you want to convert these data into the European date format, with four digit year, in order to merge it with data from your European meteorological office, to see how the outside weather influences the indoor climate.

The awk programming language will sort this out with a one-liner. Awk is a language designed specifically to manipulate data in rows with repeating patterns. It reads each line in turn from a file and operates on that line with a series of instructions which you write. Awk is a text-only program. No mouse is required, you talk to it in text sentences written into a terminal box, or written into a file for repeated use. The terminal is that black rectangle on your screen into which you type instructions. Yes, there is a terminal program in Windows. It's well hidden and not very versatile, but it will do.

>awk '{print substr($1,4,3) substr($1,1,3)\
 '20' substr($1,7,2), $2, $3, $4}'\
 americandata.txt > europeandata.txt

The word 'awk' calls the program. It reads in the first line of the file 'americandata.txt'. It divides the line into fields separated by spaces. The first field is labelled $1 and so on. Then it operates on these fields, getting its instructions from the expression enclosed in '{...}'. First it prints the day, which is in the fourth to sixth symbols of the first field: '23/'. Then it prints the month, then it prints the literal '20' followed immediately by the two digit year. Finally it prints the remaining fields unaltered. The output is sent using the '>' symbol to the new file europeandata.txt. The original file is not changed. Note that the '\' indicates a line continuation, broken in this display for clarity.

The output from this one line instruction is:

23/08/2006 09:00 23.4 78
23/08/2006 10:00 23.9 75

This is not a robust script because it won't work if the early months and days are single digits. The date field will then not have a consistent length. That difficulty can be fixed quite easily by instructing awk to use the '/' symbol as the field separator. This lumps the two digit year, the time and all the climate data into a single field, but that doesn't matter because it is to be copied unchanged.

>awk -F"/" '{print $2"/"$1"/"$1"20"$3}' \
americandata.txt > europeandata.txt

The '-F' is an option which instructs the program to use the following '/' as the field separator, instead of the default which is a space.

Sometimes months are written out in full. This is not a portable format because the names are different in all languages. A simple awk script will replace the month by a digit:

#convert full month to digit
#example line: 01 April 2005 01:00	16.0 
/\#/ {print $0}
/January/ {print $1,1,$3,$4,$5}
/February/ {print $1,2,$3,$4,$5}
/March/ {print $1,3,$3,$4,$5}
/April/ {print $1,4,$3,$4,$5}
/May/ {print $1,5,$3,$4,$5}
/June/ {print $1,6,$3,$4,$5}
/July/ {print $1,7,$3,$4,$5}
/August/ {print $1,8,$3,$4,$5}
/September/ {print $1,9,$3,$4,$5}
/October/ {print $1,10,$3,$4,$5}
/November/ {print $1,11,$3,$4,$5}
/December/ {print $1,12,$3,$4,$5}

What happens here is that each month name in turn is tested for, by enclosing the word between /.../. The corresponding print statement substitutes the number of the month where the word would go.

Some files have the time in am-pm format. Here one needs to test for the format and add 12 to the hour if there is a 'pm' in the line, for example:

...
/pm/ {print .... $2+12, ....} 
...

If you have many different formats to deal with, you can chain simple scripts to work in turn on the data files, so you don't have to write all possibilities into one monster script, which will be difficult to test.

The ISO standard date-time format

It would be better for durability and portability of the data set to transform the date-time into ISO format. This is a standard which is undeservedly rarely used. Humans don't change habitual ways of expressing weight, length and date easily, even if the new version is much more rational and easy to deal with. The ISO format is logical and has other advantages we will first recognise later. Here is a date in ISO format: 20090226T1416. This is not easily readable, so ISO permits some formatting: 2009-02-02T14:16. (that is all on one line - your browser may divide it). The reason for using 'T' instead of a space to separate the time portion is that the single field thus defined is easy to sort and to merge with other files, using the single date-time field as the synchronising link. It is more difficult to synchronise on two fields.

The awk script for tranforming the date-time into ISO date-time is listed below.

 BEGIN{FS="[/\t :]"} 
{
printf("%4d-%02d-%02dT%02d:%02d",$3,$2,$1,$4,$5)
for (j=6; j <= NF; j++){
   printf(" % 5.1f",$j)
   }
printf("\n")
}

This is too long a script for convenient typing into the command line, so it has been written as a file, called for example, 'date2iso.awk', which can be called from the command line thus:

>awk -f date2iso.awk inputdatafile.txt > outputdatafile.txt

The '-f' option instructs awk to take its instructions from the filename which follows.

Let us parse this script carefully because it introduces some fundamental capabilities.

The BEGIN{...} construction sets starting conditions, before the data file is processed. In this case there is only one instruction. It sets the field separator FS to any of several symbols: '/', tab, space or colon. The square parentheses identify this as a list of alternatives: encountering any of these symbols will start a new field. Note that the hyphen is not listed as a separator, though it is frequently used in dates. This is because it may also be present as a minus sign in the climate data. Dates with hyphen separator need extra processing, described later.

The second pair of curly brackets {...} tell awk what to do with every line of the input file, after it has split the line into its separate fields.

The next line prints the first five fields, which constitute the date and time information, in a different order and with a different format to construct the ISO date-time string without spaces. The format definition is the sequence starting "%4d...". Each '%' symbol is followed by the format of the corresponding variable, which is listed after the closing quote of the format description. So %4d prints the year ($3) as four digits, immediately followed by a hyphen, then the month in the format %02d, which makes a two digit month with a leading zero for single digit months. The '-' and the 'T' are literal characters: they are printed, not interpreted. The grammar of the format instruction is available on the internet on searching for 'printf format'. This format grammar is used in many programming languages with very little variation, so knowing it may be useful in other contexts.

After the date and time there will be a number of fields containing climate data. This script doesn't need to know how many of these there are because awk finds out for itself and holds the number in the 'NF' variable. So the script iterates through the fields, using the idiom 'for(...){...}', printing them all in the same format. The format string specifies one decimal place precision and assigns 5 spaces to the value, to give space for a minus sign (that is specified by the leading space after the %).

Finally, a newline character is inserted to end the row.

An example input file is:

2/2/2009 8:15 -0.3 98
2/2/2009 9:30 -10.3 67
10/2/2009 10:30 0.7 55
10/14/2009 13:30 7.4 45

The output from this script is:

2009-02-02T08:15  -0.3  98.0
2009-02-02T09:30 -10.3  67.0
2009-02-10T10:30   0.7  55.0
2009-14-10T13:30   7.4  45.0

The output is in the ISO standard format and is also nicely aligned for easy scanning by eye.

You can store this with confidence that it will be machine readable by anyone, in any nation, far into the future. But you should add by hand a short description at the top:

# output from demo showing
# awk manipulation of climate data files
2009-02-02T08:15  -0.3  98.0
2009-02-02T09:30 -10.3  67.0
2009-02-10T10:30   0.7  55.0
2009-14-10T13:30   7.4  45.0

The leading hash sign is unix convention for indicating a comment line. Awk will read it however. The script fragment for handling these comments is:

...
/\#/ {print $0}
...

This will read a line containing '#' and print it exactly as written. The '$0' specifies the entire line.

Many dates in data files have a hyphen separating year, month and day. This is a problem because the '-' can also appear in the same line as a minus sign before a temperature value, for example. So the hyphen cannot be used indiscriminately as a field separator. One has to look specifically for a hyphen in the date field, usually the first field. The code for doing this is

$1 ~ /-/

followed by what to do if the hyphen is found. This is rather cryptic code, but one soon gets used to the few strange idioms, and there is a very good reference book listed at the end of this page.

 BEGIN{FS="[/\t :]"} 

$1 ~ /-/ { #date field has hyphens
   n = split($1,dc,"-")
   printf("%4d-%02d-%02dT%02d:%02d",dc[3],dc[2],dc[1],$2,$3)
   for (j=4; j <= NF; j++){
      printf(" % 5.1f",$j)
      }
   printf("\n")
   }

# no hyphen in date field
$1 !~ /-/ {printf("%4d-%02d-%02dT%02d:%02d",$3,$2,$1,$4,$5)
   for (j=6; j <= NF; j++){
      printf(" % 5.1f",$j)
      }
printf("\n")
}

A useful function used in this script is 'split', which takes the first field, which is the date, and splits it at the hyphens to make three sub-fields, which are stored in the array dc. They can be retrieved separately as dc[1],dc[2] and dc[3]. These are then printed with a modified printf statement. If there is no hyphen in the date field, the program drops through to the next test, which is for 'not a hyphen', and follows the same instructions as in the previous code example.

Dealing with different sampling frequency

A problem that arises when trying to merge data from different files, for example indoor data with outdoor climate measurements, is different sampling intervals in the data files, and maybe also irregular times, such as 11:07, instead of round numbers of minutes. An awk script that reduces everything to hourly data, with the minute set to exactly zero, helps with merging data from diverse sources.

The definition of hourly data is ambiguous - most dataloggers will average more frequent measurements during the set interval between data storage times, or the logger will wake up every ten minutes, say, and then record a single spot reading. In the first case the data refers to a time earlier than that recorded by the logger. In the second case the time is exactly right but the data logger will miss variations occuring while it was asleep. Then there is the sudden change of time when summer time starts and ends. The following script averages every record in the hour before the stated time, which corresponds to what most dataloggers do.

# awk script for reducing more frequent data to hourly average
# assumes input like:
# 2009-02-05T08:37 -5.6 65
# where the first field is the ISO date-time,
#   the second field is the temperature, the third the RH
#
# The program accumulates the data values until it hits a minute
# that is less than the previous minute. This signals a new hour.
# it prints the new hour with exactly zero minutes and then prints
# the average climate values from the previous hour.
# Note that already-hourly input data are delayed one hour.

BEGIN {
   prev_minute = 0
   rowcount = 1
   }

{
thisminute = substr($1,15,2)
if(thisminute <= prev_minute){
   # time to average and print
   av_t = sum_t/rowcount
   av_rh = sum_rh/rowcount
   #print ISO date time without minute
   printf(substr($1,1,14))
   #print zero minute then climate values
   printf("00 % 5.1f % 5.1f\n", av_t, av_rh)
   sum_rh = $3
   sum_t = $2
   rowcount = 1
   }

if(thisminute > prev_minute) {
   sum_t += $2
   sum_rh += $3
   rowcount += 1
   }   
prev_minute = thisminute
}

Given this input:

2009-02-02T08:00  -0.7  99.0
2009-02-02T08:15  -0.3  98.0
2009-02-02T08:45  -0.1  96.0
2009-02-02T09:00   0.7  92.0
2009-02-02T09:15   3.3  88.0
2009-02-02T09:30  10.3  77.0
2009-02-02T09:45  12.7  62.0
2009-02-02T10:00  13.5  57.0
2009-02-02T10:15  15.3  67.0
2009-02-10T10:45  45.2  77.0
2009-11-10T11:00  38.3  82.0
2009-12-02T11:15  32.6  89.0
2009-12-15T11:30  30.4  92.0
2009-12-15T11:45  30.4  94.0
2009-12-15T12:00  28.1  44.0
2009-12-15T13:00  22.6  34.0
2009-12-15T14:00  12.4  84.0

the output is:

2009-02-02T09:00  -0.4  97.7
2009-02-02T10:00   6.8  79.8
2009-11-10T11:00  24.7  67.0
2009-12-15T12:00  32.9  89.2
2009-12-15T13:00  28.1  44.0
2009-12-15T14:00  22.6  34.0

Notice that if the input is already hourly, the time gets put forward one hour - which is convenient for dealing with summer time. If you don't want this, you need to write a third test: if(thisminute == prev_minute){print $0} (note the repeated equals sign). This passes the row directly from input to output. Then you must remove the '=' from the first 'if' test in the script.

Merging data from two source files

The next job in this sequence of manipulations is to merge this indoor data (we suppose) with outside climate data from the local met office. You may need to write a script to convert the met office data to ISO date-time, most don't use the ISO format! After this, you could use awk to merge the files but it is much simpler to use the unix utility program join.

>join hourly_inside hourly_outside > alldata

The outside data file is:

2009-02-02T09:00  7.3  43
2009-02-02T10:00  6.4  56
2009-11-10T11:00  4.3  52
2009-12-15T12:00  2.1  66
2009-12-15T13:00  5.7  72
2009-12-15T14:00  8.9  85

and the joined data file:

2009-02-02T09:00  -0.4  97.7  7.3  43
2009-02-02T10:00   6.8  79.8  6.4  56
2009-11-10T11:00  24.7  67.0  4.3  52
2009-12-15T12:00  32.9  89.2  2.1  66
2009-12-15T13:00  28.1  44.0  5.7  72
2009-12-15T14:00  22.6  34.0  8.9  85

That assumes that there are no gaps or inconsistencies in the two files.

Let us try removing one hour (12) from the inside data file, which is more likely than a defect in the met office data.

Join gives this result (with a warning):

2009-02-02T09:00  -0.4  97.7  7.3  43
2009-02-02T10:00   6.8  79.8  6.4  56
2009-11-10T11:00  24.7  67.0  4.3  52
2009-12-15T13:00  28.1  44.0  5.7  72
2009-12-15T14:00  22.6  34.0  8.9  85

So it copes with most errors to give a correct result even with imperfect data. 'Join' takes the first file as the master and appends data from the second file when it matches on the first field.

At this point you will be ready to graph the combined indoor and outdoor data using gnuplot, as described on the plotting with gnuplot page in this treatise on data logging.

After all this exposure to mystical code conventions you will be asking: what's so great about this complicated stuff? I can do it all in excel! Maybe. I haven't tried. The value of 'awk' and 'join' and 'gnuplot' and many other utility programs that work from the command line, is that they can easily be automated. If you get regular or irregular batches of climate data files which you are required to scan for climatic threats, you can spend a few days putting together a group of programs which allow all the data reduction and display to be automated so you just bring up the latest graph on your browser, without any handwork on the original files.

With linux, the necessary programs for data manipulation and automation are built into the distribution. However, many students of climatic processes are locked into institutional software policies, usually based on MSWindows and often restricted to specific programs. The unix utilities are available for Windows in a package called 'cygwin'. Alternatively, awk (I recommend the variant called gawk) and gnuplot can be installed directly on Windows. One can also run linux from a live DVD. This will not affect the computer in any way, but it is slower and less versatile. Finally, linux can be co-installed on a Windows PC, allowing booting either operating system on startup.

References

The standard manual for the awk programming language is: Dale Dougherty and Arnold Robbins, 'Sed and awk', 2nd edition, O'Reilly, 1997. ISBN: 978-1-565-92225-9

The ISO date-time format:
http://en.wikipedia.org/wiki/ISO_8601

 

Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 License.