Monday, November 16, 2015

Computing zonal and weighted statistics from NetCDF data with NCO

This post is a sequel to Concatenate and average NetCDFs where I introduced NCO (NetCDF operators), how to install NCO, and provided examples for the essential operations of record and ensemble concatenation and averaging of variables across multiple NetCDF files. If you are new to NCO I suggest familiarizing yourself with that post and experimenting with some NetCDF files as well as reading the NCO user manual. This post explains how to calculate weighted and non-weighted zonal statistics on variables over different dimensions (e.g. temporal and spatial averages) using the ncwa command line tool. Including several examples as well as information regarding open-source tools for visualizing NetCDF variables and NCO commands for accessing NetCDF metadata. These tools and examples will aid you in your work flow as they did for me!


Calculate zonal means using ncwa

In the previous NCO post we used the record averager (ncra) command to average variables over multiple files over their record dimension, typically time. But what if we need to average or use other aggregating statistics on a variable over space, time, or any other combination of dimensions we desire? The NetCDF weighted averager is just the right tool for the task.

Let's say we have 12 NetCDF files that each hold monthly model output for the water year 1992, in this case they contain simulation results from the Community Earth System Model:

$ ls
model_out_1991-10.nc  model_out_1992-02.nc  model_out_1992-06.nc
model_out_1991-11.nc  model_out_1992-03.nc  model_out_1992-07.nc
model_out_1991-12.nc  model_out_1992-04.nc  model_out_1992-08.nc
model_out_1992-01.nc  model_out_1992-05.nc  model_out_1992-09.nc

Side Note: visualizing data in NetCDFs!


Any NetCDF file can be viewed quickly with the free tool ncview which you can easily get through the aptitude repository:

$ sudo apt-get install ncview

Using ncview is simple

$ ncview NetCDF_File_To_view

Alternatively you could use the NASA's free Panoply software to view and create custom high quality figures from your NetCDF files. Panoply has a lot of options and is great for publication plots but I prefer ncview for taking quick views of a NetCDF file on the fly while doing analysis. The figures below were created with Panoply.



For example, here is a plot of simulated ground temperature in the model_out_1992-06.nc file (June ground temperature).

Now, before we do any weighted averaging lets concatenate the monthly output files along the record dimension (time) using ncrcat to make one NetCDF file TG_concated.nc which contains a time series of the monthly data. For simplicity let's only consider the variable ground temperature (TG) shown above. We can specify the variable we want to subset using the -v option with ncrcat. Alternatively we could have averaged the variables across their record dimension using ncra but then we would no longer have the time dimension in the output file.

ncrcat -O -v TG model_out_*.nc TG_concated.nc

The resulting file now only contains the TG variable, -O just says to overwrite the output file if it already exists. We can get useful information about the ground temperature variable with NCO as well using ncinfo:

$ ncinfo -v TG TG_concated.nc 
<type 'netCDF4._netCDF4.Variable'>
float32 TG(time, lat, lon)
    long_name: ground temperature
    units: K
    cell_methods: time: mean
    _FillValue: 1e+36
    missing_value: 1e+36
unlimited dimensions: time
current shape = (12, 192, 288)
filling off

We can see TG is a 3D variable in time, latitude, and longitude with a corresponding array shape of (12, 192, 288) (months, latitude nodes, longitude nodes). We can leverage this information if we need to aggregate and reduce the dimensionality for plotting or analysis.


Side Note: viewing metadata of NetCDFs!


You might be wondering how to view meta data of NetCDF files-- there are several different ways: 1) ncinfo netcdf_file will print most basic and important information of the contents such as all variables, dimensions, file history, etc.; 2) ncdump -h netcdf_file provides more detailed info for each variable; 3) ncks -m or -M netcdf_file will produce output pertaining to dimensions, -m is useful for finding length and units of dimensions and you can pass it variables as well with -v. The -M option will print out information regarding global attributes. Experiment with these as they are all super useful!



As we viewed using ncinfo our variable of interest is 3 dimensional in time, latitude, and longitude so we can average over these dimensions using ncwa (NetCDF weighted averager) thus reducing its dimensionality, actually the default behaviour of ncwa is to average over all dimensions thus reducing the variable to a scalar. For example just calling ncwa on TG_concated.nc:

$ ncwa -O TG_concated.nc out.nc
$ ncks -v TG out.nc
TG: type NC_FLOAT, 0 dimensions, 5 attributes, chunked? no, compressed? no, packed? no
TG size (RAM) = 1*sizeof(NC_FLOAT) = 1*4 = 4 bytes
TG attribute 0: long_name, size = 18 NC_CHAR, value = ground temperature
TG attribute 1: units, size = 1 NC_CHAR, value = K
TG attribute 2: cell_methods, size = 31 NC_CHAR, value = time: mean time, lat, lon: mean
TG attribute 3: _FillValue, size = 1 NC_FLOAT, value = 1e+36
TG attribute 4: missing_value, size = 1 NC_FLOAT, value = 1e+36

TG = 267.074 K

The resulting scalar value for TG in the output file of 267.074 Kelvin is the global average ground temperature over the 12 months. To avoid averaging over all dimensions you must use the -a option followed by the dimensions you want to average over. If you used ncwa -a followed by the record dimension of the input file this would be equivalent to the default behaviour of the NCO record averager ncra. By specifying dimensions with -a we can get the latitudinal zonal mean ground temperature as a function of time by only averaging over the longitudinal axis:

$ ncwa -O -v TG -a lon,time TG_concated.nc lat_zonal_mean.nc
$ ncinfo -v TG lat_zonal_mean.nc 
<type 'netCDF4._netCDF4.Variable'>
float32 TG(lat)
    long_name: ground temperature
    units: K
    cell_methods: time: mean time, lon: mean
    _FillValue: 1e+36
    missing_value: 1e+36
unlimited dimensions: 
current shape = (192,)
filling off

The new shape of lat_zonal_mean.nc is 192 which is the size of the latitude array in the original file. What we have created is a file with ground temperature that has been averaged over 12 months and over every longitude cell resulting in a singular value for each latitude node. This output file is useful for making a latitudinal zonal mean plot of mean ground temperature:

It is not an error that the line breaks around -60 degrees latitude, this is because there is no land in this stretch of the world and remember we are viewing ground temperature.

The NCO weighted average operator ncwa also supports sub-setting the domain of operation over one or more dimensions using the syntax ncwa -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] in.nc out.nc. The resulting output file will have the weighted average only for the desired dimensional domain. If stride is not given then all data between min and max are included, a stride of 2 will select every other value of the dimension (e.g. every other day if dim is time), 3 every third and so on. You need to include one of the values or both from the min and max parameters, using a comma for an open ended range. There is no limit on how many dimensions and ranges of dimensions you can subset over. For example, say we wanted to calculate the latitudinal zonal mean as before but only for latitudes above the equator and say we wanted to see how the zonal average evolved through time:

$ ncwa -O -v TG -a lon -d lat,0.,90. TG_concated.nc lat_zonal_mean_0_to_90_degrees.nc

The resulting output contains the latitudinal zonal mean over the twelve months of data, the resulting 2D plot shows the mean ground temperature over the latitudinal axis (x axis) versus time (y axis).

The global monthly temperature from the model is out of phase because the order of the concatenated files that we started with was in water year order, that is October through September. I chose not to interpolate the monthly data in the plot, as a result you can see 12 distinct bands for each month starting with October. That is not to say that the concatenation did not keep track of the dates correctly- it did, but this example shows that you have flexibility on which date you would like the time series to start from.

Masking with logical conditionals

In addition to being able to subset our domain of operation by dimensions, ncwa supports using logical conditional statements for selection based on a variable's value and a relational operator ($=, \ne, \gt, \lt, \ge, \le$). There are two methods to do this: the long hard way:) ncwa -m mask_variable -M mask_value -T Fortran_relational in.nc out.nc; or the short simple way:) ncwa -B 'mask_variable relational mask_value' in.nc out.nc. Where -m proceeds the variable used for the masking; -T proceeds a Fortran relational operator ($eq, ne, gt, lt, ge, le$) which correspond to ($=, \ne, \gt, \lt, \ge, \le$) in Fortran; -M proceeds the masking value; and if you use the -B option the valid relational operators are ==, !=, >, <, >=, <=. You should make sure to enclose the conditional expression after -B in quotes.

For example lets say we wanted to look at two variables from the model: soil evaporation rate (QSOIL) and plant transpiration rate (QVEGT). With ncwa we could calculate the evaporation for locations where the transpiration is high or vice versa just to see how this NCO operation works. This will require three steps: first we will concatenate the monthly files this time keeping these two variables, second we will calculate the global mean transpiration, and last we will calculate the mean global mean evaporation for locations where transpiration is above the global mean and plot our results.

$ ncrcat -O -v QSOIL,QVEGT model_out_199*.nc ET_concated.nc
$ ncwa -O ET_concated.nc global_avg_ET.nc
$ ncks -v QVEGT global_avg_ET.nc QVEGT: type NC_FLOAT, 0 dimensions, 5 attributes, chunked? no, compressed? no, packed? no
QVEGT size (RAM) = 1*sizeof(NC_FLOAT) = 1*4 = 4 bytes
QVEGT attribute 0: long_name, size = 20 NC_CHAR, value = canopy transpiration
QVEGT attribute 1: units, size = 4 NC_CHAR, value = mm/s
QVEGT attribute 2: cell_methods, size = 31 NC_CHAR, value = time: mean time, lat, lon: mean
QVEGT attribute 3: _FillValue, size = 1 NC_FLOAT, value = 1e+36
QVEGT attribute 4: missing_value, size = 1 NC_FLOAT, value = 1e+36

QVEGT = 9.58266e-06 mm/s

$ ncwa -O -a time -B 'QVEGT > 9.58266e-06' ET_concated.nc out.nc

The resulting plot of soil evaporation (QSOIL) in the output file looks like:

The logical indexing resulting from -B 'QVEGT > 9.58266e-06' appears to have masked the worlds largest deserts in the output file. This is an obvious result considering there is a dearth vegetation in these regions; although this example does not clearly illustrate the relationship between evaporation and transpiration (which I expect would be a dynamic relationship seasonally as well), with a well thought-out plan the masking option can be quite useful for scientific data analysis.

Weighted averaging

You might be wondering by now how to calculate weighted averages with ncwa, after all ncwa stands for NetCDF weighted averager! Well it's simple the only extra switch that is required is -w followed by the variable to use for the weights. Sometimes it is useful to include the -N option which tells ncwa to only use the numerator of the weights, i.e. integrating the variable over the selected weighting variable such as grid cell area. It is often that variables (at least from model output) will be area-weighted to begin with, if so -w area without -N will produce the same output as would not weighting by area in the first place. However if you use -N -w area -v var ncwa will calculate the area$\times$var product in each cell without ever dividing by the sum of the areas thus integrating the variable over area. Here are two examples to clarify.

For the first example we will calculate the annual mean leaf area index (LAI) as simulated by the Community Earth System Model. For comparison we will calculate the same mean weighted by simulated plant transpiration (QVEGT). The result from the latter will cause grid cells that have high LAI and high QVEGT to stand out while lessening the value of LAI where QVEGT is below average.

$ ncrcat -O model_out_199*.nc concated.nc
$ ncwa -O -a time concated.nc not_weighted.nc
$ ncwa -O -w QVEGT -a time concated.nc QVEGT_weighted.nc

Here is the corresponding plot for the non-weighted mean LAI:

And here is the corresponding plot for the mean LAI weighted by QVEGT:

There are clear differences between the weighted and non-weighted LAI maps, weighting allows us to visualize the locations where both vegetation is dense and where vegetation is transpiring. One observation I made by viewing both plots is there are large regions in northern North America and Asia that transpire quite a bit yet they do not correspond to the highest LAI, most of this region is home to the Boreal or Taiga coniferous forests. Perhaps the Boreal forests are transpiring significantly but are not as dense as say the tropical rainforests of the world. We could of seen the impact of transpiration on LAI more clearly by simply differencing these two plots however I wanted to keep the interpretation of the weighting operation as simple as possible.

Just to clarify the use of the -N option this example shows how to integrate a variable over grid cell area and time. This is straightforward.

$ ncwa -O -N -w area -a time concated.nc Area_integrated.nc

Now let's take a look at the familiar LAI variable after it has been integrated with respect to grid cell area and time:

By integrating with respect to grid cell area (km$^2$) and time the values of LAI are now very huge and not meaningful, to be precise the values are simply the product of LAI and the grid cell area summed over the time length-12 months: $$\mbox{Values for each grid cell above}~=~ \sum_{n=1}^{12} LAI_{i,j}\times GCA_{i,j}$$ where $GCA_{i,j}$ is the grid cell area (km$^2$) for the cell at the $i^{th}$ latitude index and the $j^{th}$ longitude index on the model grid which is 1$^\circ$ resolution. We see that by doing this the values near the equator increased and as we move towards the poles the LAI values decreased as compared to the non-weighted plot of LAI above. This is simply a result of the spherical grid that the model utilizes which has much higher grid cell areas near the equator.

The -N option is equivalent to -y ttl which tells ncwa to take the sum as opposed to the mean over the specified dimensions after -a and since area is not a dimension in the files we are working with we needed to list area as the weight variable. Note, integrating some variables may or may not make sense depending on what you are trying to accomplish and remember some variables may have already been area weighted and take special care to understand what units you are dealing with. For example QVEGT above has units of mm/s. If you weighted QVEGT by grid cell area (which happens to be in $km^2$) and summed over time as shown above for LAI you would get units of $12\cdot mm\cdot km^2/s$. You could then calculate the average volumetric flux by dividing by 12 and converting the $km^2$ to $mm^2$.

Computing other aggregating statistics besides averaging

The ncwa averages by default but also allows additional aggregating statistic calculations by stating them after the -y option. These methods also work for ncra, ncea, and nces. The ncra (NetCDF record averager) is a less sophisticated version of ncwa, you can find more about it and ncea (NetCDF ensemble averager) here. As mentioned the -N option is the equivalent to -y ttl and tells ncwa to sum over the specified dimensions listed after the -a option. Here is a list of all the statistics that ncwa, ncra, ncea, and nces support with the operation keyword followed by a colon and then its description:

  • avg : mean value
  • ttl : sum of values
  • sqravg : square of mean
  • sqrt : square root of the mean
  • avgsqr : mean of sum of squares
  • max : maximum value
  • min : minimum value
  • mabs : maximum absolute value
  • mebs : mean of absolute value
  • mibs : minimum absolute value
  • rms : root-mean-square (normalized by N)
  • rmssdn : root-mean-square (normalized by N-1)

For example,

$ ncwa -O -v QVEGT -a time -d lat,-60.,15. -d lon,280.,330. -y max concated.nc out.nc

Will calculate the max of the monthly simulated QVEGT for the rectangular region around South America.


Final remarks

As we saw the NCO command line tool for weighted averaging (ncwa) is powerful and flexible, it is not limited to weighted averaging but can do a range of aggregating statistics, with or without weights, including subsetting and logical indexing or masking. This makes ncwa one of my favorite NCO tools because it can do so many jobs. Once you get familiar with it I think you will prefer using it over opening NetCDF files in your language of choice and doing the operations there. This is because the NCO syntax is succinct and the inner workings are optimized for speed and efficient memory use. Once you couple NCO with a scripting language of your choice it becomes even more powerful. If you enjoyed this post or have suggestions please comment like subscribe and keep visiting! Cheers and happy NetCDF-ing :)

Tuesday, October 27, 2015

Scale Independent Hydrologic Metrics

Scale Independent Hydrologic Metrics

Scale independent hydrologic metrics are quantities that capture watershed hydrologic responses and they do not depend on the size of the watershed, put another way they are useful ways to measure hydrologic response that do not scale with watershed area. However, even these metrics generally breakdown when used at the very fine or large scales (e.g. pore scale or continental scale). Scale independent metrics are useful in relating hydrologically similar watersheds of different scales to one another and to identify fundamental hydrologic properties of a watershed. For example the relation between storage and discharge of a watershed can be described using scale invariant recession metrics. I put together a table of four basic and well known scale independent hydrologic metrics that can be calculated for a watershed using commonly available data such as streamflow, precipitation, and temperature.


Metric name Quantity Description
Runoff ratio $\huge{\frac{\overline{Q}}{\overline{P}}}$ $\overline{Q}$ and $\overline{P}$ are long-term average streamflow and precipitation for the watershed. The runoff ratio represents the fraction of precipitation that has contributed to streamflow in a watershed assuming the only hydrologic input is direct precipitation. Depending on the time period used to calculate the runoff ratio it can be an indicator of surface properties in a watershed (impervious to highly permeable) or over longer periods it indicates losses to evapotranspiration.
Snowfall/Precipitaion-day ratio $\huge{\frac{N_{S}}{N_{P}}}$ $N_{S}$ is the number of days in a given year that had snowfall and $N_P$ is number of days that exhibited precipitation. Sometimes these are measured using a threshold amount of of snow or precipitation e.g. number of days with snowfall above 1 cm or rainfall above 1 mm, $N_S$ may also be calculated as the number of days when the temperature was below freezing. Also called the snow day ratio this metric quantifies the dominance of snow as the major form of precipitation for a watershed.
Streamflow elasticity median$\huge{\left(\frac {\frac{dQ}{\overline{Q}}} {\frac{dP}{\overline{P}}} \right)}$ $\overline{Q}$ and $\overline{P}$ are long-term average streamflow and precipitation for the catchment. The derivative of $Q$ or $P$ is likely not available due to the non-continuous measurements of these variables and these quantities are estimated by differencing each day from the previous day's value, alternatively the difference can calculated by subtracting each day by the annual mean value. Streamflow elasticity indicates how sensitive the watershed is to precipitation, that is for a given change in $P$ how much will $Q$ change. A high value indicates an elastic or sensitive watershed whereas a low vaule indicates an inelastic or insensitive watershed.
Baseflow index $\huge{ \frac{\overline{Q_{bf}}} {\overline{Q_{sf}}} }$ $\overline{Q_{bf}}$ is long-term average of the baseflow component of streamflow, usually estimated using a digital filter and $\overline{Q_{sf}}$ is the long-term average streamflow for the catchment. Baseflow index or BFI represents the long-term average proportion of the baseflow component (i.e. groundwater discharge) of total streamflow in a stream. A BFI close to one means that nearly all flow is from slow subsurface discharge to a stream whereas a BFI near zero means that the stream gets most of its input from fast overland flow or preferential flow from precipitation events. Thus the BFI can be related to land cover, soil/aquifer permeability, and other watershed characteristics.

The concept of scale independent metrics for hydrology is important; they can give insight into fundamental hydrologic processes in a watershed. This is an abbreviated list of straightforward metrics that can easily be calculated for a gaged watershed. Using scale independent metrics to investigate hydrologic change and variability is important because the metrics relate to key hydrologic processes or important watershed characteristics. They are also commonly used to classify hydrologically similar watersheds. Can you think of other scale independent metrics? Feel free to post your thoughts in the comments.

Wednesday, October 7, 2015

Why you should learn Python's multiprocessing module

Python multilprocessing basics

This post is not an in depth guide to multiprocessing in Python or even a brief intro. Rather it is intended to give you motivation to bother learning it. When I recently experimented with Python's cross-platform multiprocessing module I was pleasantly surprised on how easy it was to use. I was able to quickly parallelize iterative tasks in Python. For example I have a script that runs the same text parsing on files in multiple directories. Using a simple tool from the multiprocessing module allowed me to easily run the text processing in multiple directories simultaneously. This saved a lot of time since the alternative was looping over the list of directories and running the same command in each directory one at a time. You might have heard that Python is not a good language for parallel processing due to the Global Interpreter Lock issue but the multiprocessing module enables bypassing the lock.



Simple example using Pool

The multiprocessing module has several objects that may be useful for you; one that stands out is the pool class. Pool allows you to utilize multiple processors to run a set of independent tasks quite efficiently (small amount of code) in parallel. To use pool first create a pool object and then simply call its map method which maps a python data collection e.g. a list or dictionary to a single parameter function. In other words map applies the function you pass it to every element in a collection you also must pass to it as input. Check it out,

import multiprocessing as mp
import time

a_list = range(8)

def f(x):
    time.sleep(5)
    print x


pool = mp.Pool(processes=8)
pool.map(f,a_list)

When run produces,

$ python mp.py 
2
1
0
3
4
5
6
7

In this arbitrary and simple example the last two lines of code do all the work. These two lines create the pool object and apply the function "f" to each element in "a_list" simultaneously on 8 cpu threads. The pool map method is analogous to the built in Python map function. Also notice that f simply prints the input parameter x but when run the output is not in the original order of a_list. This is the expected result because it is running 8 processes in parallel. Pool.map does not apply the function to the elements in the collection you pass it in any order. This means that the tasks you assign to pool must be completely independent from one another.

You can easily extend this example to fit more complicated workflows and as you can see it is incredibly simple. Chances are you have access to a multi-core/hyper-threaded processor that you are under-utilizing. So don't be scared any more, this Python module allows anyone to utilize multiprocessing in a very simple way. Any time you invest learning this module will greatly reward you by saving your runtime later. Cheers

Saturday, September 26, 2015

Concatenate and average NetCDFs

Concatenate and average NetCDFs

Network Common Data Form AKA NetCDF files are commonly used to hold multidimensional scientific data, the NetCDF project was started in the 80's and is maintained by Unidata. If you work with netCDF files on a routine basis you probably already have a method for merging or aggregating data across multiple netCDF files. Conversely, if you are new to netCDF files this lesson is a good starting point for basic commands that aid in aggregating data across multiple netCDF files. Ultimately, learning NCO should save you a lot of time developing your own methods to work with NetCDF files from scratch. This is a basic introduction to a few basic NCO operators with examples. Everything in this post should work for Debian based Linux distributions and with slight modifications on any system.


NetCDF Operators (NCO) toolkit

  • Command line tools for essential data wrangling tasks on netCDF, HDF, and DAP files
  • Fast optimized algorithms, Open Source
  • Flexible syntax, shell wildcard expansion and extended regex compatibility
  • Can seem like a lot to learn at first- start simple

Get NCO for Debian systems

Get the newest full distribution from a compressed tarfile here. I installed via the aptitude repository and recommend it:

sudo apt-get update
sudo apt-get install nco

Concatenating


Concatenate data that extends multiple files using ncrcat

Commonly when working with netCDF files you will have a sequence of files that correspond to the same record. Commonly time series data that spans multiple files. For example you may have temperature data from a station that records every 15 minutes and each file has one year of data. Lets say you have a netCDF file of this data for every year from 1990 to 2010:

$ ls
temp_1990.nc
temp_1991.nc
temp_1992.nc
temp_1993.nc
temp_1994.nc
temp_1995.nc
temp_1996.nc
temp_1997.nc
temp_1998.nc
temp_1999.nc
temp_2000.nc
temp_2001.nc
temp_2002.nc
temp_2003.nc
temp_2004.nc
temp_2005.nc
temp_2006.nc
temp_2007.nc
temp_2008.nc
temp_2009.nc
temp_2010.nc

To concatenate these files such that the time series values from each file flow together in one file. This is simply done using the ncrcat command which concatenates multiple files along a record dimension- commonly time. All NCO command line operations follow the general syntax: operator -options [option_params] input_file(s) output_file(s). We can run ncrcat to concatenate all the temperature files using the standard shell * wildcard which matches any character any number of times.

ncrcat temp_*.nc temp_full_record.nc

The resulting temp_full_record.nc NetCDF file will contain a complete time series of all data variables indexed to their record dimensions. The alternative to using the * glob expansion wildcard would be to list each input file one by one space delimited. Example:

ncrcat temp_2000.nc temp_2001.nc temp_2002.nc temp_2000-2002.nc

Some time you may be working with very large files or a large number of files and to save resources and time you may only want to concatenate one or a few variables from the files, this is easily done using the -v or --variable (long name) option of ncrcat. For example if the temp NetCDF files contained variables such as humidity, tmax, tmin, dewpoint, solrad, and windspeed and we are only interested in the humidity and windspeed then the following command would do concatenate just these two variables as well as maintaining any associated dimensions or coordinates (e.g. time, latitude, longitude, vertical layers, etc.).

ncrcat -v humidity,windspeed temp_*.nc humidity_wind.nc

Note the variables listed after the -v option must be listed as comma delimited without any white space. Another useful option that complements -v is the exclude option --exclude or -x which will exclude the specified variables from the output file while maintaining all others found in the input files.

Ensemble Concatenation using ncecat

Another useful NCO operator ncecat will concatenate multiple NetCDF files that have the same length record dimension and the same variables. For example, you may have output from five climate model realizations (from the same model) that all ran for the same time period and all have the same dimensions, coordinates, and variables. If you want all of them in one file with variables side by side then use ncecat. A disadvantage is ncecat does require that all the files be of the same length and dimension for each variable and all the files must contain the same variables (names of variables and dimensions as well as values of dimensions must match exactly). In this example we have output files from five model realizations:

$ ls
scenario_00.nc
scenario_01.nc
scenario_02.nc
scenario_03.nc
scenario_04.nc

In a similar way that we used ncrcat above, we can place the variables in each of these files in one file. This time ncecat will place the same variables from each file side by side because each file has the same record dimensions (time is a common record dimension). Do not use ncrcat for concatenating files that have the same record dimensions. If we wanted to concatenate all data from the files for scenario two through four inclusive we could use ncecat with the [] glob expansion:

ncecat scenario_0[234].nc scenarios_02_to_04.nc

Averaging


Average data that extends multiple files using ncra

To average multiple NetCDF files that contain the same variables over the same record dimension (usually time) ncra can be used in an analogous way as ncrcat. It is easy to remember these two because their abbreviations: ncrcat "NetCDF record concatenator" and ncra: "NetCDF record averager". Using the same ten temperature files as shown above, to average all variables in these files over the record 1990-2010 simply run ncra on all the input files:

ncra temp_*.nc temp_full_record.nc

A useful option -d or long version --dimension allows us to take a subset (hyperslab) of data based on a dimensional range of that data. For example you may want to know the 20 year average of the data variables recorded (humidity, tmax, tmin, dewpoint, solrad, and windspeed) for all points north of 45 degrees latitude. This can de done via:

ncra -d lat,45.,90. temp_*.nc temp_full_record.nc

The general syntax for sub-setting is -d dim,[min],[max]. Note that there are no spaces between the dimension name (lat) and min (45.) and max (90.) latitude values. The above example output file will have the record average (usually time) for each variable in the file for only the spatial coordinates above the 45th latitude. Do not confuse this with averaging all the variable values above the 45th latitude. Concatenating operators ncrcat and ncecat can also use this option but for combinations of subsets based on two or more dimensions, e.g. averaging over latitude and longitude you will either need to use the NetCDF kitchen sink (ncks) operator first using the syntax ncks -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] in.nc out.nc. The resulting output file will have the desired dimensional domain. If stride is not given then all data between min and max are included, a stride of 2 will select every other value of the dimension (e.g. every other day if dim is time), 3 every third and so on. With ncks there is no limit on how many dimensions and ranges of dimensions you can subset over. See multislab-- section 3.9 in the NCO user's guide for more info, also in this specific case you would use the ncwa (weighted averager) listing multiple dimensional ranges.

Ensemble averaging using ncea

Recall the example on ensemble concatenation where ncecat combined data of the same record dimension from five model realizations into one NetCDF file. In a similar way ncea will perform an ensemble average over the various model output files. We can take a scenario ensemble average of humidity from the five climate scenarios, the output would be something like a singe spatial array of humidity. Remember humidity is a variable in the climate NetCDF files that were output from the climate model.

ncea -v humidity scenario_*.nc ensemble_avg_humidity.nc

Final remarks

These NCO operators are useful, fast, and free so use them! This post went over the basics of concatenating and averaging data from multiple NetCDF files and some of their fundamentally important options. It is a good start. However, there are more operators and there is much more that the NCO operators can do and even the operators described here offer more advanced options and functionality. For example, many times you might want to average over different dimensions as opposed to time (which is the typical record dimension) and for that the NetCDF weighted averager (ncra) is the tool you want. To learn more about advanced averaging methods using NCO including subsetting, conditional masking, and weighted statistics check out this post.


Useful links

NCO's homepage: http://nco.sourceforge.net/

NCO User's Manual: http://nco.sourceforge.net/nco.html

Saturday, September 19, 2015

Getting WhatPulse installed on Linux

Whatpulse on Linux

If you are having issues installing WhatPulse on a Debian based Linux system, hopefully this post will help. If you don't know what WhatPulse is then check it out at the official site but it is a sweet little free app that tracks your keyboard and mouse usage and gives you fun statistics like which keys you click the most or time series plots of how much you type over time, including application specific stats. You can also compare yourself to global user's stats, I like it because it keeps me active and motivated to write. If you decide you want to give it a try you can register a free WhatPulse account first (note this is my referral link-unpaid and my WhatPulse ID is darcyslaw). There is generally less documentation for Linux installation so I decided to share what worked for me here.


Install dependencies

There are quite a few libraries required for whatpulse, namely the QT platform that whatpulse was built on.

A list of these dependencies is given on a support document from whatpulse:

  • libQtCore
  • libQtWebKit
  • libqt4-sql
  • libqt4-sql-sqlite
  • openssl-devel (libssl-dev)
  • libQtScript

To get these libraries the easiest way is to use the aptitude repository, I found the current names of these packages so you can try just copy and pasting:

sudo apt-get update
sudo apt-get install libqtcore4 libqtwebkit4 libqt4-sql libqt4-sql-sqlite libssl-dev libqtscript4-core

Download and setup

Next you will want to download the correct compressed file for your Linux distribution and cpu here: https://whatpulse.org/downloads/

Extract the .tar.gz file you can use:

tar -zxvf whatpulse-linux-YOUR-VERSION.tar.gz

Now the last step is to give WhatPulse the ability to access your keyboard and mouse input and other privileges. Change directory into the extracted WhatPulse directory wherever you put it and run the supplied shell script setup-input-permissions.sh as root:

sudo ./setup-input-permissions.sh

Hit return and you will prompted for the user you would like to give WhatPulse permission for. That is the user on your Linux machine e.g. "john" in my case. That's it, you should be done and next time you reboot your system WhatPulse should start!

Sunday, September 13, 2015

Write and compile your first Fortran 95 program on Linux

Fortran is a compiled programming language commonly used in scientific and numerical computations. It is one of the oldest (if not the oldest) machine independent programming languages, created by IBM in the early 50's. Many optimized numerical libraries were written in Fortran and are used by modern high level languages such as Python's numerical library numpy. Scientific numerical models of physical phenomena such as weather and climate models are commonly written in Fortran, they often utilize parallel processing and run on the worlds most powerful supercomputers.


Get the gfortran compiler

Learn how to write a simple Fortran 95 program and compile it on Linux. I am using a Debian based system- Linux Mint and compiling with gfortran. If you already have gfortran installed skip down to "write the program", if you are not sure if you have gfortran installed you can run:


gfortran --version

If gfortran is installed you should see something like:


GNU Fortran (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

If you do not have gfortran installed you can run the following to install it from the apt repository:


sudo apt-get update
sudo apt-get install gfortran

Note, you have the option to install different versions of gfortran as opposed to the newest version in your repository. You might want to check out this StackExchange post if you're having trouble installing gfortran.

Write the program

Once you have installed gfortran we can create and compile and then run our first Fortran 95 program. yay! This part is easy, go ahead and create an empty file called first.f95 with your preferred text editor and then type the following Fortran source code:


!my first Fortran program ever written
 program first
    IMPLICIT NONE
    print *,'Hello World!'
    print *,'This is my first Fortran 95 program ever.'
 end program first

Save and close your file, be sure to save is as first.f95 if you did not create it with a name yet.

Compile and run!

At the Linux command prompt within the same directory as first.f95 we can now compile our source code with gfortran:


gfortran first.f95 -o first

Note, here we used the -o option so that we could name our executable, otherwise the default executable from gfortran will be a.out which may not be very appealing. Finally we need to test our new program, we can run it just like any other executable file on Linux by typing ./ in front of its name. With any luck you should get the following output:

./first
 Hello World!
 This is my first Fortran 95 program ever.

Congratulations! you just successfully wrote, compiled, and ran a simple Fortran program on Linux. More to come on Fortran in future posts!

Monday, September 7, 2015

How to highlight source code for displaying in HTML

In this post I will show you a useful web tool created by Alexander Kojevnikov that allows you to quickly and conveniently convert your source code to highlighted HTML so that you can place it on your website or blog.


Lets say you have some Python code snippet:

def convert(x):
    return(x*2.54)
## list of values to convert
vals = [31,24,15,6.3]
for each in vals:
    print convert(each)
   

What you really want is HTML that displays your source code as it appears in your development environment. For example with highlighted keywords like:


def convert(x):
    return(x*2.54)
## list of values to convert
vals = [31,24,15,6.3]
for each in vals:
    print convert(each) 

Try it

Now try it yourself, just go the the website http://hilite.me/ and type your source code in the box on the left, select the language of the source code and the output style, then just hit "Highlight!". You will get output HTML for displaying your source code for many different languages. For example the screen shot below shows how to create pretty HTML for the Python code snippet above as it would be displayed in the vim text editor using vim's default color scheme. You even have the option to display line numbers in the HTML!  Example screen shot:



 
 

Hilite.me is a useful web tool that you can easily use in a quick fashion to highlight and style source code of varying type to output in HTML, markdown, LateX, and others. With a bit of research you will find that everything done on the tool can easily be done on your system using the Pygments Python module. This may be useful especially in a workflow or if you do not have internet access. If there is interest in how to use Pygments in a workflow it will be a future post. Cheers