Saturday, September 26, 2015

Concatenate and average NetCDFs

Concatenate and average NetCDFs

Network Common Data Form AKA NetCDF files are commonly used to hold multidimensional scientific data, the NetCDF project was started in the 80's and is maintained by Unidata. If you work with netCDF files on a routine basis you probably already have a method for merging or aggregating data across multiple netCDF files. Conversely, if you are new to netCDF files this lesson is a good starting point for basic commands that aid in aggregating data across multiple netCDF files. Ultimately, learning NCO should save you a lot of time developing your own methods to work with NetCDF files from scratch. This is a basic introduction to a few basic NCO operators with examples. Everything in this post should work for Debian based Linux distributions and with slight modifications on any system.


NetCDF Operators (NCO) toolkit

  • Command line tools for essential data wrangling tasks on netCDF, HDF, and DAP files
  • Fast optimized algorithms, Open Source
  • Flexible syntax, shell wildcard expansion and extended regex compatibility
  • Can seem like a lot to learn at first- start simple

Get NCO for Debian systems

Get the newest full distribution from a compressed tarfile here. I installed via the aptitude repository and recommend it:

sudo apt-get update
sudo apt-get install nco

Concatenating


Concatenate data that extends multiple files using ncrcat

Commonly when working with netCDF files you will have a sequence of files that correspond to the same record. Commonly time series data that spans multiple files. For example you may have temperature data from a station that records every 15 minutes and each file has one year of data. Lets say you have a netCDF file of this data for every year from 1990 to 2010:

$ ls
temp_1990.nc
temp_1991.nc
temp_1992.nc
temp_1993.nc
temp_1994.nc
temp_1995.nc
temp_1996.nc
temp_1997.nc
temp_1998.nc
temp_1999.nc
temp_2000.nc
temp_2001.nc
temp_2002.nc
temp_2003.nc
temp_2004.nc
temp_2005.nc
temp_2006.nc
temp_2007.nc
temp_2008.nc
temp_2009.nc
temp_2010.nc

To concatenate these files such that the time series values from each file flow together in one file. This is simply done using the ncrcat command which concatenates multiple files along a record dimension- commonly time. All NCO command line operations follow the general syntax: operator -options [option_params] input_file(s) output_file(s). We can run ncrcat to concatenate all the temperature files using the standard shell * wildcard which matches any character any number of times.

ncrcat temp_*.nc temp_full_record.nc

The resulting temp_full_record.nc NetCDF file will contain a complete time series of all data variables indexed to their record dimensions. The alternative to using the * glob expansion wildcard would be to list each input file one by one space delimited. Example:

ncrcat temp_2000.nc temp_2001.nc temp_2002.nc temp_2000-2002.nc

Some time you may be working with very large files or a large number of files and to save resources and time you may only want to concatenate one or a few variables from the files, this is easily done using the -v or --variable (long name) option of ncrcat. For example if the temp NetCDF files contained variables such as humidity, tmax, tmin, dewpoint, solrad, and windspeed and we are only interested in the humidity and windspeed then the following command would do concatenate just these two variables as well as maintaining any associated dimensions or coordinates (e.g. time, latitude, longitude, vertical layers, etc.).

ncrcat -v humidity,windspeed temp_*.nc humidity_wind.nc

Note the variables listed after the -v option must be listed as comma delimited without any white space. Another useful option that complements -v is the exclude option --exclude or -x which will exclude the specified variables from the output file while maintaining all others found in the input files.

Ensemble Concatenation using ncecat

Another useful NCO operator ncecat will concatenate multiple NetCDF files that have the same length record dimension and the same variables. For example, you may have output from five climate model realizations (from the same model) that all ran for the same time period and all have the same dimensions, coordinates, and variables. If you want all of them in one file with variables side by side then use ncecat. A disadvantage is ncecat does require that all the files be of the same length and dimension for each variable and all the files must contain the same variables (names of variables and dimensions as well as values of dimensions must match exactly). In this example we have output files from five model realizations:

$ ls
scenario_00.nc
scenario_01.nc
scenario_02.nc
scenario_03.nc
scenario_04.nc

In a similar way that we used ncrcat above, we can place the variables in each of these files in one file. This time ncecat will place the same variables from each file side by side because each file has the same record dimensions (time is a common record dimension). Do not use ncrcat for concatenating files that have the same record dimensions. If we wanted to concatenate all data from the files for scenario two through four inclusive we could use ncecat with the [] glob expansion:

ncecat scenario_0[234].nc scenarios_02_to_04.nc

Averaging


Average data that extends multiple files using ncra

To average multiple NetCDF files that contain the same variables over the same record dimension (usually time) ncra can be used in an analogous way as ncrcat. It is easy to remember these two because their abbreviations: ncrcat "NetCDF record concatenator" and ncra: "NetCDF record averager". Using the same ten temperature files as shown above, to average all variables in these files over the record 1990-2010 simply run ncra on all the input files:

ncra temp_*.nc temp_full_record.nc

A useful option -d or long version --dimension allows us to take a subset (hyperslab) of data based on a dimensional range of that data. For example you may want to know the 20 year average of the data variables recorded (humidity, tmax, tmin, dewpoint, solrad, and windspeed) for all points north of 45 degrees latitude. This can de done via:

ncra -d lat,45.,90. temp_*.nc temp_full_record.nc

The general syntax for sub-setting is -d dim,[min],[max]. Note that there are no spaces between the dimension name (lat) and min (45.) and max (90.) latitude values. The above example output file will have the record average (usually time) for each variable in the file for only the spatial coordinates above the 45th latitude. Do not confuse this with averaging all the variable values above the 45th latitude. Concatenating operators ncrcat and ncecat can also use this option but for combinations of subsets based on two or more dimensions, e.g. averaging over latitude and longitude you will either need to use the NetCDF kitchen sink (ncks) operator first using the syntax ncks -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] -d dim,[min],[max],[stride] in.nc out.nc. The resulting output file will have the desired dimensional domain. If stride is not given then all data between min and max are included, a stride of 2 will select every other value of the dimension (e.g. every other day if dim is time), 3 every third and so on. With ncks there is no limit on how many dimensions and ranges of dimensions you can subset over. See multislab-- section 3.9 in the NCO user's guide for more info, also in this specific case you would use the ncwa (weighted averager) listing multiple dimensional ranges.

Ensemble averaging using ncea

Recall the example on ensemble concatenation where ncecat combined data of the same record dimension from five model realizations into one NetCDF file. In a similar way ncea will perform an ensemble average over the various model output files. We can take a scenario ensemble average of humidity from the five climate scenarios, the output would be something like a singe spatial array of humidity. Remember humidity is a variable in the climate NetCDF files that were output from the climate model.

ncea -v humidity scenario_*.nc ensemble_avg_humidity.nc

Final remarks

These NCO operators are useful, fast, and free so use them! This post went over the basics of concatenating and averaging data from multiple NetCDF files and some of their fundamentally important options. It is a good start. However, there are more operators and there is much more that the NCO operators can do and even the operators described here offer more advanced options and functionality. For example, many times you might want to average over different dimensions as opposed to time (which is the typical record dimension) and for that the NetCDF weighted averager (ncra) is the tool you want. To learn more about advanced averaging methods using NCO including subsetting, conditional masking, and weighted statistics check out this post.


Useful links

NCO's homepage: http://nco.sourceforge.net/

NCO User's Manual: http://nco.sourceforge.net/nco.html

Saturday, September 19, 2015

Getting WhatPulse installed on Linux

Whatpulse on Linux

If you are having issues installing WhatPulse on a Debian based Linux system, hopefully this post will help. If you don't know what WhatPulse is then check it out at the official site but it is a sweet little free app that tracks your keyboard and mouse usage and gives you fun statistics like which keys you click the most or time series plots of how much you type over time, including application specific stats. You can also compare yourself to global user's stats, I like it because it keeps me active and motivated to write. If you decide you want to give it a try you can register a free WhatPulse account first (note this is my referral link-unpaid and my WhatPulse ID is darcyslaw). There is generally less documentation for Linux installation so I decided to share what worked for me here.


Install dependencies

There are quite a few libraries required for whatpulse, namely the QT platform that whatpulse was built on.

A list of these dependencies is given on a support document from whatpulse:

  • libQtCore
  • libQtWebKit
  • libqt4-sql
  • libqt4-sql-sqlite
  • openssl-devel (libssl-dev)
  • libQtScript

To get these libraries the easiest way is to use the aptitude repository, I found the current names of these packages so you can try just copy and pasting:

sudo apt-get update
sudo apt-get install libqtcore4 libqtwebkit4 libqt4-sql libqt4-sql-sqlite libssl-dev libqtscript4-core

Download and setup

Next you will want to download the correct compressed file for your Linux distribution and cpu here: https://whatpulse.org/downloads/

Extract the .tar.gz file you can use:

tar -zxvf whatpulse-linux-YOUR-VERSION.tar.gz

Now the last step is to give WhatPulse the ability to access your keyboard and mouse input and other privileges. Change directory into the extracted WhatPulse directory wherever you put it and run the supplied shell script setup-input-permissions.sh as root:

sudo ./setup-input-permissions.sh

Hit return and you will prompted for the user you would like to give WhatPulse permission for. That is the user on your Linux machine e.g. "john" in my case. That's it, you should be done and next time you reboot your system WhatPulse should start!

Sunday, September 13, 2015

Write and compile your first Fortran 95 program on Linux

Fortran is a compiled programming language commonly used in scientific and numerical computations. It is one of the oldest (if not the oldest) machine independent programming languages, created by IBM in the early 50's. Many optimized numerical libraries were written in Fortran and are used by modern high level languages such as Python's numerical library numpy. Scientific numerical models of physical phenomena such as weather and climate models are commonly written in Fortran, they often utilize parallel processing and run on the worlds most powerful supercomputers.


Get the gfortran compiler

Learn how to write a simple Fortran 95 program and compile it on Linux. I am using a Debian based system- Linux Mint and compiling with gfortran. If you already have gfortran installed skip down to "write the program", if you are not sure if you have gfortran installed you can run:


gfortran --version

If gfortran is installed you should see something like:


GNU Fortran (Ubuntu 4.8.4-2ubuntu1~14.04) 4.8.4
Copyright (C) 2013 Free Software Foundation, Inc.

GNU Fortran comes with NO WARRANTY, to the extent permitted by law.
You may redistribute copies of GNU Fortran
under the terms of the GNU General Public License.
For more information about these matters, see the file named COPYING

If you do not have gfortran installed you can run the following to install it from the apt repository:


sudo apt-get update
sudo apt-get install gfortran

Note, you have the option to install different versions of gfortran as opposed to the newest version in your repository. You might want to check out this StackExchange post if you're having trouble installing gfortran.

Write the program

Once you have installed gfortran we can create and compile and then run our first Fortran 95 program. yay! This part is easy, go ahead and create an empty file called first.f95 with your preferred text editor and then type the following Fortran source code:


!my first Fortran program ever written
 program first
    IMPLICIT NONE
    print *,'Hello World!'
    print *,'This is my first Fortran 95 program ever.'
 end program first

Save and close your file, be sure to save is as first.f95 if you did not create it with a name yet.

Compile and run!

At the Linux command prompt within the same directory as first.f95 we can now compile our source code with gfortran:


gfortran first.f95 -o first

Note, here we used the -o option so that we could name our executable, otherwise the default executable from gfortran will be a.out which may not be very appealing. Finally we need to test our new program, we can run it just like any other executable file on Linux by typing ./ in front of its name. With any luck you should get the following output:

./first
 Hello World!
 This is my first Fortran 95 program ever.

Congratulations! you just successfully wrote, compiled, and ran a simple Fortran program on Linux. More to come on Fortran in future posts!

Monday, September 7, 2015

How to highlight source code for displaying in HTML

In this post I will show you a useful web tool created by Alexander Kojevnikov that allows you to quickly and conveniently convert your source code to highlighted HTML so that you can place it on your website or blog.


Lets say you have some Python code snippet:

def convert(x):
    return(x*2.54)
## list of values to convert
vals = [31,24,15,6.3]
for each in vals:
    print convert(each)
   

What you really want is HTML that displays your source code as it appears in your development environment. For example with highlighted keywords like:


def convert(x):
    return(x*2.54)
## list of values to convert
vals = [31,24,15,6.3]
for each in vals:
    print convert(each) 

Try it

Now try it yourself, just go the the website http://hilite.me/ and type your source code in the box on the left, select the language of the source code and the output style, then just hit "Highlight!". You will get output HTML for displaying your source code for many different languages. For example the screen shot below shows how to create pretty HTML for the Python code snippet above as it would be displayed in the vim text editor using vim's default color scheme. You even have the option to display line numbers in the HTML!  Example screen shot:



 
 

Hilite.me is a useful web tool that you can easily use in a quick fashion to highlight and style source code of varying type to output in HTML, markdown, LateX, and others. With a bit of research you will find that everything done on the tool can easily be done on your system using the Pygments Python module. This may be useful especially in a workflow or if you do not have internet access. If there is interest in how to use Pygments in a workflow it will be a future post. Cheers