Saturday, September 26, 2015

Concatenate and average NetCDFs

Concatenate and average NetCDFs

Network Common Data Form AKA NetCDF files are commonly used to hold multidimensional scientific data, the NetCDF project was started in the 80's and is maintained by Unidata. If you work with netCDF files on a routine basis you probably already have a method for merging or aggregating data across multiple netCDF files. Conversely, if you are new to netCDF files this lesson is a good starting point for basic commands that aid in aggregating data across multiple netCDF files. Ultimately, learning NCO should save you a lot of time developing your own methods to work with NetCDF files from scratch. This is a basic introduction to a few basic NCO operators with examples. Everything in this post should work for Debian based Linux distributions and with slight modifications on any system.


NetCDF Operators (NCO) toolkit

  • Command line tools for essential data wrangling tasks on netCDF, HDF, and DAP files
  • Fast optimized algorithms, Open Source
  • Flexible syntax, shell wildcard expansion and extended regex compatibility
  • Can seem like a lot to learn at first- start simple

Get NCO for Debian systems

Get the newest full distribution from a compressed tarfile here. I installed via the aptitude repository and recommend it:

sudo apt-get update
sudo apt-get install nco

Concatenating


Concatenate data that extends multiple files using ncrcat

Commonly when working with netCDF files you will have a sequence of files that correspond to the same record. Commonly time series data that spans multiple files. For example you may have temperature data from a station that records every 15 minutes and each file has one year of data. Lets say you have a netCDF file of this data for every year from 1990 to 2010:

$ ls
temp_1990.nc
temp_1991.nc
temp_1992.nc
temp_1993.nc
temp_1994.nc
temp_1995.nc
temp_1996.nc
temp_1997.nc
temp_1998.nc
temp_1999.nc
temp_2000.nc
temp_2001.nc
temp_2002.nc
temp_2003.nc
temp_2004.nc
temp_2005.nc
temp_2006.nc
temp_2007.nc
temp_2008.nc
temp_2009.nc
temp_2010.nc

To concatenate these files such that the time series values from each file flow together in one file. This is simply done using the ncrcat command which concatenates multiple files along a record dimension- commonly time. All NCO command line operations follow the general syntax: operator -options [option_params] input_file(s) output_file(s). We can run ncrcat to concatenate all the temperature files using the standard shell * wildcard which matches any character any number of times.

ncrcat temp_*.nc temp_full_record.nc

The resulting temp_full_record.nc NetCDF file will contain a complete time series of all data variables indexed to their record dimensions. The alternative to using the * glob expansion wildcard would be to list each input file one by one space delimited. Example:

ncrcat temp_2000.nc temp_2001.nc temp_2002.nc temp_2000-2002.nc

Some time you may be working with very large files or a large number of files and to save resources and time you may only want to concatenate one or a few variables from the files, this is easily done using the -v or --variable (long name) option of ncrcat. For example if the temp NetCDF files contained variables such as humidity, tmax, tmin, dewpoint, solrad, and windspeed and we are only interested in the humidity and windspeed then the following command would do concatenate just these two variables as well as maintaining any associated dimensions or coordinates (e.g. time, latitude, longitude, vertical layers, etc.).

ncrcat -v humidity,windspeed temp_*.nc humidity_wind.nc

Note the variables listed after the -v option must be listed as comma delimited without any white space. Another useful option that complements -v is the exclude option --exclude or -x which will exclude the specified variables from the output file while maintaining all others found in the input files.

Ensemble Concatenation using ncecat

Another useful NCO operator ncecat will concatenate multiple NetCDF files that have the same length record dimension and the same variables. For example, you may have output from five climate model realizations (from the same model) that all ran for the same time period and all have the same dimensions, coordinates, and variables. If you want all of them in one file with variables side by side then use ncecat. A disadvantage is ncecat does require that all the files be of the same length and dimension for each variable and all the files must contain the same variables (names of variables and dimensions as well as values of dimensions must match exactly). In this example we have output files from five model realizations:

$ ls
scenario_00.nc
scenario_01.nc
scenario_02.nc
scenario_03.nc
scenario_04.nc

In a similar way that we used ncrcat above, we can place the variables in each of these files in one file. This time ncecat will place the same variables from each file side by side because each file has the same record dimensions (time is a common record dimension). Do not use ncrcat for concatenating files that have the same record dimensions. If we wanted to concatenate all data from the files for scenario two through four inclusive we could use ncecat with the [] glob expansion:

ncecat scenario_0[234].nc scenarios_02_to_04.nc

Averaging


Average data that extends multiple files using ncra

To average multiple NetCDF files that contain the same variables over the same record dimension (usually time) ncra can be used in an analogous way as ncrcat. It is easy to remember these two because their abbreviations: ncrcat "NetCDF record concatenator" and ncra: "NetCDF record averager". Using the same ten temperature files as shown above, to average all variables in these files over the record 1990-2010 simply run ncra on all the input files:

ncra temp_*.nc temp_full_record.nc

A useful option -d or long version --dimension allows us to take a subset (hyperslab) of data based on a dimensional range of that data. For example you may want to know the 20 year average of the data variables recorded (humidity, tmax, tmin, dewpoint, solrad, and windspeed) for all points north of 45 degrees latitude. This can de done via:

ncra -d lat,45.,90. temp_*.nc temp_full_record.nc

The general syntax for sub-setting is -d dim,[min],[max]. Note that there are no spaces between the dimension name (lat) and min (45.) and max (90.) latitude values. The above example output file will have the record average (usually time) for each variable in the file for only the spatial coordinates above the 45th latitude. Do not confuse this with averaging all the variable values above the 45th latitude. Concatenating operators ncrcat and ncecat can also use this option but for combinations of subsets based on two or more dimensions, e.g. averaging over latitude and longitude you will either need to use the NetCDF kitchen sink (ncks) operator first using the syntax ncks -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] in.nc out.nc. The resulting output file will have the desired dimensional domain. If stride is not given then all data between min and max are included, a stride of 2 will select every other value of the dimension (e.g. every other day if dim is time), 3 every third and so on. With ncks there is no limit on how many dimensions and ranges of dimensions you can subset over. See multislab-- section 3.9 in the NCO user's guide for more info, also in this specific case you would use the ncwa (weighted averager) listing multiple dimensional ranges.

Ensemble averaging using ncea

Recall the example on ensemble concatenation where ncecat combined data of the same record dimension from five model realizations into one NetCDF file. In a similar way ncea will perform an ensemble average over the various model output files. We can take a scenario ensemble average of humidity from the five climate scenarios, the output would be something like a singe spatial array of humidity. Remember humidity is a variable in the climate NetCDF files that were output from the climate model.

ncea -v humidity scenario_*.nc ensemble_avg_humidity.nc

Final remarks

These NCO operators are useful, fast, and free so use them! This post went over the basics of concatenating and averaging data from multiple NetCDF files and some of their fundamentally important options. It is a good start. However, there are more operators and there is much more that the NCO operators can do and even the operators described here offer more advanced options and functionality. For example, many times you might want to average over different dimensions as opposed to time (which is the typical record dimension) and for that the NetCDF weighted averager (ncra) is the tool you want. To learn more about advanced averaging methods using NCO including subsetting, conditional masking, and weighted statistics check out this post.


Useful links

NCO's homepage: http://nco.sourceforge.net/

NCO User's Manual: http://nco.sourceforge.net/nco.html

1 comment: