Monday, November 16, 2015

Computing zonal and weighted statistics from NetCDF data with NCO

This post is a sequel to Concatenate and average NetCDFs where I introduced NCO (NetCDF operators), how to install NCO, and provided examples for the essential operations of record and ensemble concatenation and averaging of variables across multiple NetCDF files. If you are new to NCO I suggest familiarizing yourself with that post and experimenting with some NetCDF files as well as reading the NCO user manual. This post explains how to calculate weighted and non-weighted zonal statistics on variables over different dimensions (e.g. temporal and spatial averages) using the ncwa command line tool. Including several examples as well as information regarding open-source tools for visualizing NetCDF variables and NCO commands for accessing NetCDF metadata. These tools and examples will aid you in your work flow as they did for me!


Calculate zonal means using ncwa

In the previous NCO post we used the record averager (ncra) command to average variables over multiple files over their record dimension, typically time. But what if we need to average or use other aggregating statistics on a variable over space, time, or any other combination of dimensions we desire? The NetCDF weighted averager is just the right tool for the task.

Let's say we have 12 NetCDF files that each hold monthly model output for the water year 1992, in this case they contain simulation results from the Community Earth System Model:

$ ls
model_out_1991-10.nc  model_out_1992-02.nc  model_out_1992-06.nc
model_out_1991-11.nc  model_out_1992-03.nc  model_out_1992-07.nc
model_out_1991-12.nc  model_out_1992-04.nc  model_out_1992-08.nc
model_out_1992-01.nc  model_out_1992-05.nc  model_out_1992-09.nc

Side Note: visualizing data in NetCDFs!


Any NetCDF file can be viewed quickly with the free tool ncview which you can easily get through the aptitude repository:

$ sudo apt-get install ncview

Using ncview is simple

$ ncview NetCDF_File_To_view

Alternatively you could use the NASA's free Panoply software to view and create custom high quality figures from your NetCDF files. Panoply has a lot of options and is great for publication plots but I prefer ncview for taking quick views of a NetCDF file on the fly while doing analysis. The figures below were created with Panoply.



For example, here is a plot of simulated ground temperature in the model_out_1992-06.nc file (June ground temperature).

Now, before we do any weighted averaging lets concatenate the monthly output files along the record dimension (time) using ncrcat to make one NetCDF file TG_concated.nc which contains a time series of the monthly data. For simplicity let's only consider the variable ground temperature (TG) shown above. We can specify the variable we want to subset using the -v option with ncrcat. Alternatively we could have averaged the variables across their record dimension using ncra but then we would no longer have the time dimension in the output file.

ncrcat -O -v TG model_out_*.nc TG_concated.nc

The resulting file now only contains the TG variable, -O just says to overwrite the output file if it already exists. We can get useful information about the ground temperature variable with NCO as well using ncinfo:

$ ncinfo -v TG TG_concated.nc 
<type 'netCDF4._netCDF4.Variable'>
float32 TG(time, lat, lon)
    long_name: ground temperature
    units: K
    cell_methods: time: mean
    _FillValue: 1e+36
    missing_value: 1e+36
unlimited dimensions: time
current shape = (12, 192, 288)
filling off

We can see TG is a 3D variable in time, latitude, and longitude with a corresponding array shape of (12, 192, 288) (months, latitude nodes, longitude nodes). We can leverage this information if we need to aggregate and reduce the dimensionality for plotting or analysis.


Side Note: viewing metadata of NetCDFs!


You might be wondering how to view meta data of NetCDF files-- there are several different ways: 1) ncinfo netcdf_file will print most basic and important information of the contents such as all variables, dimensions, file history, etc.; 2) ncdump -h netcdf_file provides more detailed info for each variable; 3) ncks -m or -M netcdf_file will produce output pertaining to dimensions, -m is useful for finding length and units of dimensions and you can pass it variables as well with -v. The -M option will print out information regarding global attributes. Experiment with these as they are all super useful!



As we viewed using ncinfo our variable of interest is 3 dimensional in time, latitude, and longitude so we can average over these dimensions using ncwa (NetCDF weighted averager) thus reducing its dimensionality, actually the default behaviour of ncwa is to average over all dimensions thus reducing the variable to a scalar. For example just calling ncwa on TG_concated.nc:

$ ncwa -O TG_concated.nc out.nc
$ ncks -v TG out.nc
TG: type NC_FLOAT, 0 dimensions, 5 attributes, chunked? no, compressed? no, packed? no
TG size (RAM) = 1*sizeof(NC_FLOAT) = 1*4 = 4 bytes
TG attribute 0: long_name, size = 18 NC_CHAR, value = ground temperature
TG attribute 1: units, size = 1 NC_CHAR, value = K
TG attribute 2: cell_methods, size = 31 NC_CHAR, value = time: mean time, lat, lon: mean
TG attribute 3: _FillValue, size = 1 NC_FLOAT, value = 1e+36
TG attribute 4: missing_value, size = 1 NC_FLOAT, value = 1e+36

TG = 267.074 K

The resulting scalar value for TG in the output file of 267.074 Kelvin is the global average ground temperature over the 12 months. To avoid averaging over all dimensions you must use the -a option followed by the dimensions you want to average over. If you used ncwa -a followed by the record dimension of the input file this would be equivalent to the default behaviour of the NCO record averager ncra. By specifying dimensions with -a we can get the latitudinal zonal mean ground temperature as a function of time by only averaging over the longitudinal axis:

$ ncwa -O -v TG -a lon,time TG_concated.nc lat_zonal_mean.nc
$ ncinfo -v TG lat_zonal_mean.nc 
<type 'netCDF4._netCDF4.Variable'>
float32 TG(lat)
    long_name: ground temperature
    units: K
    cell_methods: time: mean time, lon: mean
    _FillValue: 1e+36
    missing_value: 1e+36
unlimited dimensions: 
current shape = (192,)
filling off

The new shape of lat_zonal_mean.nc is 192 which is the size of the latitude array in the original file. What we have created is a file with ground temperature that has been averaged over 12 months and over every longitude cell resulting in a singular value for each latitude node. This output file is useful for making a latitudinal zonal mean plot of mean ground temperature:

It is not an error that the line breaks around -60 degrees latitude, this is because there is no land in this stretch of the world and remember we are viewing ground temperature.

The NCO weighted average operator ncwa also supports sub-setting the domain of operation over one or more dimensions using the syntax ncwa -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] in.nc out.nc. The resulting output file will have the weighted average only for the desired dimensional domain. If stride is not given then all data between min and max are included, a stride of 2 will select every other value of the dimension (e.g. every other day if dim is time), 3 every third and so on. You need to include one of the values or both from the min and max parameters, using a comma for an open ended range. There is no limit on how many dimensions and ranges of dimensions you can subset over. For example, say we wanted to calculate the latitudinal zonal mean as before but only for latitudes above the equator and say we wanted to see how the zonal average evolved through time:

$ ncwa -O -v TG -a lon -d lat,0.,90. TG_concated.nc lat_zonal_mean_0_to_90_degrees.nc

The resulting output contains the latitudinal zonal mean over the twelve months of data, the resulting 2D plot shows the mean ground temperature over the latitudinal axis (x axis) versus time (y axis).

The global monthly temperature from the model is out of phase because the order of the concatenated files that we started with was in water year order, that is October through September. I chose not to interpolate the monthly data in the plot, as a result you can see 12 distinct bands for each month starting with October. That is not to say that the concatenation did not keep track of the dates correctly- it did, but this example shows that you have flexibility on which date you would like the time series to start from.

Masking with logical conditionals

In addition to being able to subset our domain of operation by dimensions, ncwa supports using logical conditional statements for selection based on a variable's value and a relational operator ($=, \ne, \gt, \lt, \ge, \le$). There are two methods to do this: the long hard way:) ncwa -m mask_variable -M mask_value -T Fortran_relational in.nc out.nc; or the short simple way:) ncwa -B 'mask_variable relational mask_value' in.nc out.nc. Where -m proceeds the variable used for the masking; -T proceeds a Fortran relational operator ($eq, ne, gt, lt, ge, le$) which correspond to ($=, \ne, \gt, \lt, \ge, \le$) in Fortran; -M proceeds the masking value; and if you use the -B option the valid relational operators are ==, !=, >, <, >=, <=. You should make sure to enclose the conditional expression after -B in quotes.

For example lets say we wanted to look at two variables from the model: soil evaporation rate (QSOIL) and plant transpiration rate (QVEGT). With ncwa we could calculate the evaporation for locations where the transpiration is high or vice versa just to see how this NCO operation works. This will require three steps: first we will concatenate the monthly files this time keeping these two variables, second we will calculate the global mean transpiration, and last we will calculate the mean global mean evaporation for locations where transpiration is above the global mean and plot our results.

$ ncrcat -O -v QSOIL,QVEGT model_out_199*.nc ET_concated.nc
$ ncwa -O ET_concated.nc global_avg_ET.nc
$ ncks -v QVEGT global_avg_ET.nc QVEGT: type NC_FLOAT, 0 dimensions, 5 attributes, chunked? no, compressed? no, packed? no
QVEGT size (RAM) = 1*sizeof(NC_FLOAT) = 1*4 = 4 bytes
QVEGT attribute 0: long_name, size = 20 NC_CHAR, value = canopy transpiration
QVEGT attribute 1: units, size = 4 NC_CHAR, value = mm/s
QVEGT attribute 2: cell_methods, size = 31 NC_CHAR, value = time: mean time, lat, lon: mean
QVEGT attribute 3: _FillValue, size = 1 NC_FLOAT, value = 1e+36
QVEGT attribute 4: missing_value, size = 1 NC_FLOAT, value = 1e+36

QVEGT = 9.58266e-06 mm/s

$ ncwa -O -a time -B 'QVEGT > 9.58266e-06' ET_concated.nc out.nc

The resulting plot of soil evaporation (QSOIL) in the output file looks like:

The logical indexing resulting from -B 'QVEGT > 9.58266e-06' appears to have masked the worlds largest deserts in the output file. This is an obvious result considering there is a dearth vegetation in these regions; although this example does not clearly illustrate the relationship between evaporation and transpiration (which I expect would be a dynamic relationship seasonally as well), with a well thought-out plan the masking option can be quite useful for scientific data analysis.

Weighted averaging

You might be wondering by now how to calculate weighted averages with ncwa, after all ncwa stands for NetCDF weighted averager! Well it's simple the only extra switch that is required is -w followed by the variable to use for the weights. Sometimes it is useful to include the -N option which tells ncwa to only use the numerator of the weights, i.e. integrating the variable over the selected weighting variable such as grid cell area. It is often that variables (at least from model output) will be area-weighted to begin with, if so -w area without -N will produce the same output as would not weighting by area in the first place. However if you use -N -w area -v var ncwa will calculate the area$\times$var product in each cell without ever dividing by the sum of the areas thus integrating the variable over area. Here are two examples to clarify.

For the first example we will calculate the annual mean leaf area index (LAI) as simulated by the Community Earth System Model. For comparison we will calculate the same mean weighted by simulated plant transpiration (QVEGT). The result from the latter will cause grid cells that have high LAI and high QVEGT to stand out while lessening the value of LAI where QVEGT is below average.

$ ncrcat -O model_out_199*.nc concated.nc
$ ncwa -O -a time concated.nc not_weighted.nc
$ ncwa -O -w QVEGT -a time concated.nc QVEGT_weighted.nc

Here is the corresponding plot for the non-weighted mean LAI:

And here is the corresponding plot for the mean LAI weighted by QVEGT:

There are clear differences between the weighted and non-weighted LAI maps, weighting allows us to visualize the locations where both vegetation is dense and where vegetation is transpiring. One observation I made by viewing both plots is there are large regions in northern North America and Asia that transpire quite a bit yet they do not correspond to the highest LAI, most of this region is home to the Boreal or Taiga coniferous forests. Perhaps the Boreal forests are transpiring significantly but are not as dense as say the tropical rainforests of the world. We could of seen the impact of transpiration on LAI more clearly by simply differencing these two plots however I wanted to keep the interpretation of the weighting operation as simple as possible.

Just to clarify the use of the -N option this example shows how to integrate a variable over grid cell area and time. This is straightforward.

$ ncwa -O -N -w area -a time concated.nc Area_integrated.nc

Now let's take a look at the familiar LAI variable after it has been integrated with respect to grid cell area and time:

By integrating with respect to grid cell area (km$^2$) and time the values of LAI are now very huge and not meaningful, to be precise the values are simply the product of LAI and the grid cell area summed over the time length-12 months: $$\mbox{Values for each grid cell above}~=~ \sum_{n=1}^{12} LAI_{i,j}\times GCA_{i,j}$$ where $GCA_{i,j}$ is the grid cell area (km$^2$) for the cell at the $i^{th}$ latitude index and the $j^{th}$ longitude index on the model grid which is 1$^\circ$ resolution. We see that by doing this the values near the equator increased and as we move towards the poles the LAI values decreased as compared to the non-weighted plot of LAI above. This is simply a result of the spherical grid that the model utilizes which has much higher grid cell areas near the equator.

The -N option is equivalent to -y ttl which tells ncwa to take the sum as opposed to the mean over the specified dimensions after -a and since area is not a dimension in the files we are working with we needed to list area as the weight variable. Note, integrating some variables may or may not make sense depending on what you are trying to accomplish and remember some variables may have already been area weighted and take special care to understand what units you are dealing with. For example QVEGT above has units of mm/s. If you weighted QVEGT by grid cell area (which happens to be in $km^2$) and summed over time as shown above for LAI you would get units of $12\cdot mm\cdot km^2/s$. You could then calculate the average volumetric flux by dividing by 12 and converting the $km^2$ to $mm^2$.

Computing other aggregating statistics besides averaging

The ncwa averages by default but also allows additional aggregating statistic calculations by stating them after the -y option. These methods also work for ncra, ncea, and nces. The ncra (NetCDF record averager) is a less sophisticated version of ncwa, you can find more about it and ncea (NetCDF ensemble averager) here. As mentioned the -N option is the equivalent to -y ttl and tells ncwa to sum over the specified dimensions listed after the -a option. Here is a list of all the statistics that ncwa, ncra, ncea, and nces support with the operation keyword followed by a colon and then its description:

  • avg : mean value
  • ttl : sum of values
  • sqravg : square of mean
  • sqrt : square root of the mean
  • avgsqr : mean of sum of squares
  • max : maximum value
  • min : minimum value
  • mabs : maximum absolute value
  • mebs : mean of absolute value
  • mibs : minimum absolute value
  • rms : root-mean-square (normalized by N)
  • rmssdn : root-mean-square (normalized by N-1)

For example,

$ ncwa -O -v QVEGT -a time -d lat,-60.,15. -d lon,280.,330. -y max concated.nc out.nc

Will calculate the max of the monthly simulated QVEGT for the rectangular region around South America.


Final remarks

As we saw the NCO command line tool for weighted averaging (ncwa) is powerful and flexible, it is not limited to weighted averaging but can do a range of aggregating statistics, with or without weights, including subsetting and logical indexing or masking. This makes ncwa one of my favorite NCO tools because it can do so many jobs. Once you get familiar with it I think you will prefer using it over opening NetCDF files in your language of choice and doing the operations there. This is because the NCO syntax is succinct and the inner workings are optimized for speed and efficient memory use. Once you couple NCO with a scripting language of your choice it becomes even more powerful. If you enjoyed this post or have suggestions please comment like subscribe and keep visiting! Cheers and happy NetCDF-ing :)