Programming, hacking, Linux and beyond

2012-04-26

Google command line tools: Google calendar task scheduler

The other day I stumbled upon something called googlecl which packs different clients for different google services. I couldn't wait to test it out so in combination with an idea from the work of this guy, I decided to write a task scheduler entierly controlled by google calendar.

Since I need training in bash scripting, plus it is kind of a simple thing to do - I decided to write it as a bash script.

It is dead simple to schedule a task. It can be run once or periodically.

Sorry for the image being in swedish, the language settings did affect gmail but not google calendar. Anyway, the procedure is as follows.

Install the googlecl tools. Use it with your calendar so that you are forced to complete the verification procedure. (i.e $ google calendar today).
Fire up google calendar and create a new task.
Enter a title for the task. The title must begin with gcmd for the scheduler daemon to pick it up.
Enter the start and finish times (currently the time range cannot span over several days).
In the "where" field, start with the flag -i followed by the interval with which to run the task (in seconds, 0 to only run it once).
Continue by adding -- which means "end of arguments".
Finish up by entering your command of choice last. Note that the initial run is scheduled using at, which does not have any PATH variable set. In other words, give full path names.
Run the gcmd script.

It is probably best to schedule the gcmd script to run every minute or so using crontab to have it catch the updates when you're adding tasks at work, the bus, the car, the empire state building, you get the point.

Let's look at the script. It's quite commented so I don't feel a need to explain anything further.

gcmd

#!/usr/bin/env bash

# 1. Get todays tasks from google, grab only those starting with gcmd.
#    Loop over the different tasks.
google calendar today --fields "title,where,when,id" \
  | awk -F, '/gcmd/' \
  | while read line; do

  # 2. Default args. These are overridden by args read from the
  #    where field of the task.

  # Interval between executions, 0 to only run once.
  interval=0

  # 3. Get arguments from the where field and overwrite default values.
  set -- $(getopt i: "$(echo "$line" | awk -F, '{print $2}')")
  while [ $# -gt 0 ]; do
    case "$1" in
      (-i) interval=$2; shift;;
      (--) shift; break;; # Terminates the args
      (-*) echo "$0 error - unrecognized arg $1" 1>&2; exit 1;;
      (*) break;;
    esac
    shift
  done

  # 4. Get the unique ID and construct a filename of it. The id is displayed
  #    as a URL, ending with a unique id.
  arr=($(echo "$line"|awk -F, '{print $4}'|tr "/" " "))
  filename="/tmp/.gcmd_${arr[${#arr[@]} - 1]}"
  command="$*"

  # 5. If the job has already been scheduled, un-schedule the job and
  #    remove the file
  if [ -e "$filename" ]; then
    echo "File exists, re-scheduling"
    atrm $(cat "$filename")
  else
    echo "New job \"$(echo ${line}|awk -F, '{print $1}')\""
  fi

  # 6. Get the desired time to run/stop. This should be improved to account for
  #    date aswell. Currently we only parse the time, which isn't enough if we need
  #    to run the task for several days.
  time="$(echo ${line}|awk -F, '{print $3}'|awk -F- '{print $1}'|awk '{print $3}')"
  stoptime="$(echo ${line}|awk -F, '{print $3}'|awk -F- '{print $2}'|awk '{print $3}')"


  # 7. If an interval is specified, make the command run according to the interval specified.
  #    This can be improved since it doesn't take into account how long the job takes to run so
  #    the interval will be more like interval + time to run task.
  if [ $interval -ne 0 ]; then
    command="stoptime=$(date -d ${stoptime} +%s); gcmdfun() { ${command}; }; while [ \$(date +%s) -lt \$stoptime ]; do gcmdfun; sleep $interval; done;"
  fi

  # 8. Schedule the command to run using at and store the job number in a file (named by the unique id).
  echo "$command" | at "$time" 2>&1 | awk '/job/{print $2}' > "$filename"
done

It's late over here now. Enjoy and Good night!

2012-04-24

Cpulimit: take control over your CPU

Cpulimit is a tool that is very handy to have when doing data processing and other time consuming tasks. What it does is that it limits the CPU utilization of a process/program to a certain amount.

Imagine you started a cpu-hogging process and needs resources for other things

$ cpulimit --limit=50 --pid=1234

That will attach to the process and limit it's usage of cpu cycles to 50 percent of the available.

Cool!

Process substitution

Process substitution takes the form of <(list) or >(list). The process list is run with its input or output connected to a FIFO or some file in /dev/fd. The name of this file is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list. When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.

So what does this mean? Well basically that you can substitute files for command input/output. Imagine you want to compare the output of two files using diff you could do something like

$ diff <(echo afile) <(echo anotherfile)

Now, the real power comes with using process substitution in conjunction with tee. Just look at the next command and be stunned with the usefulness of this little nifty piece of feature.

$ ps aux | tee >(grep ^root > /tmp/rootps) >(grep ^simon > /tmp/simon)

Which will list all processes, write all processes owned by root to one file and the processes owned by simon to another file.

Convenient, huh?

5 Linux shell commands you should know about

Over the years I have gotten used to some very nifty commandline tools that I use more or less every day. Let's go through five of them right here.

This article will assume you have basic knowledge of the bash shell (good tutorial here http://mywiki.wooledge.org/BashGuide).

#1 tee

The `tee' command copies standard input to standard output and also to any files given as arguments. This is useful when you want not only to send some data down a pipe, but also to save a copy.

$ ./longrunningprocess | tee data.log

That is probably the simplest way you can use tee. Since it passes the stream on to stdout again you can use it as a proxy - storing the data to file but passing it on to the next command.

$ cat data | awk '{print $1+1}' | tee plusone | awk '{print $1-2}' > minusone

What that line does is - take the file data, containing some rows with numbers in it, pass it to awk and add one to each line, pass the modified stream to tee which stores it to file and once again passes it on to awk which subtracts two and writes it to another file.

Another more interesting way to do the same thing is

$ cat data | tee >(awk '{print $1+1}' > plusone) >(awk '{print $1-1}' > minusone)

which sends the output of data to two pipes running their separate versions of the awk script.

#2 wget

Wget is one of those tools that I use the most. Combining it with tee can make it a powerful tool.

$ wget www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.4-rc4.tar.bz2 -O - | tee kernel.tar.bz2 | tar xjvf -

That will download the linux kernel, have tee store it to file while tar decompresses it on the fly.

#3 awk

GNU awk is totally invaluable to me. I won't go in-depth on it here but look at my earlier post covering awk http://simonslinuxworld.blogspot.se/2012/04/awk-tutorial-by-example.html

#4 sed

Sed is a stream editor - that is - pass a stream to it, tell it what to edit and how to do it - store the output, or do something equally useful with it.

Let's say I know a guy, who knows a guy who once downloaded a season of a TV-show illegally. He told me how he needed to rename the files of the show to a specific naming convention for his XBMC media center to be able to download information about the show from some website. In this case (and many others) sed is needed.

$ ls | sed -r "s/(.+)_(.+).mkv/mv & \"Series (2012) \2.mkv\"/" | bash

Piece of cake - list files, substitute the filename series_S01EXX.mkv for mv series_S01EXX.mkv "Series (2012) S01EXX.mkv", pass it to bash for evaluation.

If you want bittorrent to still find the original files just make symlinks instead of moving the file (that's what he did).

#5 xargs

xargs reads items from stdin, delimited by blanks or newlines, and executes the command one or more times with any initial arguments followed by items read from standard input. Blank lines on the standard input are ignored.

Example: Remove all files matching pattern *~ (tempoary emacs files) recursively.

$ find . -name "*~" -type f -print | xargs rm -f

If you need more control over how the items are inserted into the command to execute you can use {}, which gets substituted by the actual item when the command is executed. E.g.

$ find . -name "*~" -type f -print | xargs -n 1 -I{} mv {} /tmp

Since mv only accepts two paths we add -n 1 to xargs to make it execute each command with one of the arguments. -I{} is used to tell xargs to use {} for substitution.

Final words

I hope you have enjoyed this little infomative post about useful linux commands. There's a lot of options to them so I suggest you check out the manpages to make full use of the commands.

2012-04-13

AWK Tutorial By Example

Awk is a really nifty tool for filtering large chunks of text as well as for gathering statistics or substituting words within the text.

Unlike many other IT tutorials I will be keeping this tutorial short and concise and get to the most basic and interesting stuff right away by a short example.

Imagine you have a CSV file containing statistical data of a group of people containing their IQ and their age, like so (file could actually be thousands of lines long):

data.csv


123,34  
119,23  
100,56

Now what I like to do is to find the mean age, mean IQ and standard deviation for them both and output them in structured way. Seems like lots of lots of code? Not with awk!

We start off by making a simple script that calculates the mean values for both columns, look at the code below.

script.awk


BEGIN {
    sum_iq=0;
    sum_age=0; 
}
{
    sum_iq = sum_iq + $1;
    sum_age = sum_age + $2;
}
END {
    print "Mean IQ: " sum_iq/NR;
    print "Mean age: " sum_age/NR;
}

Execute the following to test the script:


$ cat data.csv | awk -F, -f script.awk

This pipes the contents of data.csv to awk, which is run with the field terminator , (comma) and script file script.awk.

To explain what is happening in the script you have to know that awk is designed to be executed on batches of lines of text (i.e. text files). For each line in the file, the script is executed - retaining its variables values. Awk scripts are organized in the form.


pattern{action}

When a pattern is matched, the action (stuff between the curly braces) are executed. A pattern can be a keyword like BEGIN or END as we use here, completely empty (matching every line) or a regular expression contained between two forward slashes.

So what happens in our script is the following:

The first line is fed to the script, since this is the first line the pattern BEGIN is matched. Within the block we set the variables sum_age and sum_iq to 0.
Still on the first line but now the second curly-brace-block is matched (remember, it matches every line). Inside the block we add the value from the first column to the sum_iq variable and the value from the second column to the sum_age variable.
The rest of the lines are matched and added to the sum variables.
On the last line the END pattern is matched. Here we simply print out the sums divided by the keyword NR which is awk's line counter. Anywhere in the script it can be used as a variable, telling us on which line we are, special case is the last line - in this case NR tells us how many lines the file has.

Just to show off we can add a standard deviation calculation like this.


BEGIN {
    sum_iq=0;
    sum_age=0; 
}
{
    sum_iq = sum_iq + $1;
    sum_age = sum_age + $2;
    iq[NR] = $1;
    age[NR] = $2;
}
END {
    mean_iq = sum_iq/NR;
    mean_age = sum_age/NR;
    stdd_iq = 0;
    stdd_age = 0;
    for(i=1; i<=NR; i++) {
        stdd_iq += (iq[i] - mean_iq)**2;
        stdd_age += (age[i] - mean_age)**2;
    }
    stdd_iq = sqrt(stdd_iq/(NR-1));
    stdd_age = sqrt(stdd_age/(NR-1));
    print "Mean IQ: " mean_iq " (" stdd_iq ")";
    print "Mean age: " mean_age " (" stdd_age ")";
}

A simple estimation of standard deviation on an observed set is

this means that we need to use arrays to store all values for all rows. Then take each value, subtract the mean value (expected value), square it, sum all together and divide by number of rows (the NR) keyword minus one and finally take the square root of the calculations as the result.

You should not that there is no checking for blank or invalid lines in this code, that's an exercise for the reader.

Make sure to check out this awesome and much more in-depth tutorial on awk: http://www.grymoire.com/Unix/Awk.html

For some really cool examples and ideas, this blog is number one:
http://www.thegeekstuff.com/tag/awk-tutorial-examples/

I hoped you enjoyed this little tutorial and post some links to my brand new blog all over the internet ;)