Chapter 6

Unix Commands and Utilities

0. Table of Contents

        1. Basic Commands
        2. Basic Operators
        3. Regular Expressions
        4. The 'awk' Utility
        5. The 'grep' Utility   
        6. The 'sed' Utility
        7. Additional Commands


1. Basic Commands

    'chdir'     change your current working directory
    'chgrp'     change the group of files or directories
    'chmod'     change the permissions of files or directories
    'chown'     change the ownership of files or directories
    'cp'        copy a file 
    'curl'      a simple HTTP client that fetches files
    'date'      print the current date in various formats
    'echo'      print expanded arguments ending with a newline
    'expand'    convert tabs to spaces (unexpand - vice versa)
    'ls'        list directory contents
    'mkdir'     create directories
    'mv'        move or rename files and directories
    'printf'    print formatted output; doesn't automatically add newline
    'pwd'       print the current working directory
    'sleep'     sleep for a specified number of seconds
    'touch'     create files or change modification time of existing ones


2. Basic Operators

    '|'         pipe output of one command to input of another
                % ls | wc -l
    '>'         redirect the standard output to a new file
                % ls > file.txt
    '<'         direct contents of a file to the standard input
                % grep html < file.html
    '>>'        append to the end of an existing file
                % echo 'one more line' >> lines.txt
    '&'         run a command in the background
                % find ~ -name "*.jpg -print > file.txt &
    ';'         perform commands in sequence 
                % sleep 5 ; echo 1 ; ls
                

    Note that the following three commands do exactly the same thing:

    % grep html talking.html
    % grep html < talking.html
    % cat talking.html | grep html


3. Regular Expressions

    Here is a summary of some of the most common regular expression
    syntax available for use in 'sed' and 'grep' commands:

        The meta characters '.', '^', '$', '[', and ']' are treated 
        specially by the shell.
        
        .               matches any character
        ^               matches the beginning-of-line anchor
        $               matches the end-of-line anchor

        The escape character backslash '\' causes any meta characters
        to be treated literally.
 
        \.              matches a period
        \$              matches a dollar sign
        \^              matches a caret
        
        A list of characters enclosed by [ and ] matches any single
        character in that list; if the first character of the list is
        the caret ^ then it matches any character not in the list.
        Most meta characters lose their special meaning when enclosed
        in a character list.

        [.]             matches a period
        [a-zA-Z0-9]     matches one alphanumeric character 
        [^a-zA-Z]       matches one non alphabetic character

        In the following, let RE correspond to any regular expression.

        RE*             matches the regular expression zero or more times
        RE\{n\}         matches the regular expression n times 
        RE\{n,\}        matches the regular expression at least n times 
        RE\{n,m\}       matches the regular expression at least n times 
                        and no more than m times
        \(RE\)          matches the regular expression and saves the 
                        string of matched characters in the replacement 
                        variables \1, \2, etc. so that the first \( \) 
                        pairing is saved in the variable \1, the second 
                        pairing in \2 and so on.

        Note that different commands, e.g., 'sed' and 'grep', employ
        slightly different regular expression syntax.  This is the
        source of much confusion, and so, whenever possible, I try to
        use generic syntax that works with any command. 

        Entering a tab character in a 'sed' regular expression is a 
        little tricky. Under bash or tcsh use CTRL-v CTRL-i or CTRL-q 
        CTRL-i.  For additional tips on using 'sed' see the document
        'Handy one-line sed programs' in ./oneliners.sed. 

4. The 'awk' Utility

    This is the grandfather of computational Swiss army knives.  For
    many purposes, other commands have have usurped 'awk's dominant
    role, but there are certain idiomatic usages that come in handy.
    The 'awk' utility, like 'sort', 'cut' and 'join', allows you to
    specify fields in the strings corresponding to the lines in the
    input.  Fields are separated by the space character (the default)
    or an alternative delimiter specified by the -F option.  Here are
    some of the ways I've made use of 'awk' in various scripts:

    % cat file
    1;Fred;Felicity
    7;Alice;Allegory
    4;Sally;Salacious

    % cat file | awk 'BEGIN { srand } { print rand, $0 }'
    0.041960 1;Fred;Felicity
    0.223255 7;Alice;Allegory
    0.252055 4;Sally;Salacious

    Initialize the random number generator with a random seed (that's
    the 'BEGIN { srand }' part) and then print each line ($0) in the
    standard input preceded by a random number between 0 and 1.

    % cat file | awk 'BEGIN { s = 0} { s += $1 } END { print s }'
    12

    Initializes the 's' to 0 (isn't strictly necessary), adds up the
    values in the first field and prints out the resulting the sum.
    Since we're only interested in the first field it isn't necessary
    to specify an alternative delimiter.

    % cat file | awk -F ";" '{ print $3 ", " $2 }'
    Felicity, Fred
    Allegory, Alice
    Salacious, Sally

    Using the semicolon as a delimiter, print out the third field, a 
    comma, space and then the second field for each line of the input.


5. The 'grep' Utility

    The 'grep' utility is useful for searching the content of files
    (or the standard input). 'egrep' ('extended' grep) has a more
    powerful regular expression language (check out 'info grep') and
    requires fewer escape ('\') characters in regular expressions.

    % ls 
    sara.jpg
    artemis.txt
    michelle.jpg
    vanessia.gif
    damien.txt

    % ls | egrep "[a-zA-Z]*[.][gG][iI][fF]|[a-zA-Z]*[.][jJ[pP][gG]" 
    sara.jpg 
    michelle.jpg 
    vanessia.gif 

    Match file names with an alphabetic base and an extension
    corresponding to a GIF or JPEG image file.  Basic regular
    expression syntax is presented elsewhere, but note that a period
    appearing inside a character list is treated literally (i.e.,
    it only matches a period) whereas a period appearing elsewhere
    (unless preceded by an escape) matches any character.  Note 
    that the disjunctive operator (if 'RE1' and 'RE2' are regular 
    expressions then 'RE1|RE2' matches 'RE1' or 'RE1') is available
    in the 'grep' utility but not in the 'sed' utility.


6. The 'sed' Utility

    The 'sed' (for stream editor) utility like 'awk' has its own
    powerful scripting language, but the substitution command 
    accounts for the lions share of the use of 'sed' in writing shell
    scripts.  The 'sed' command uses so-called the syntax of 'basic
    regular expressions' (see 'info regex') which are similar, but 
    not exactly the same as 'grep' regular expressions.  Here's the
    prototypical use of the 'sed' utility: 

    sed 's/PATTERN/REPLACEMENT/g'

    The 'sed' utility can also be used with the -f (for 'file) option
    to specify a 'sed' program consisting of individual 'sed' commands
    with one such command to a line as in:

    % sed -f file

    where 'file' is a file of 'sed' commands.  For example, 

    % cat file
    s/Fred/Mary/g
    s/Sally/Bill/g
    ...

   Here are some examples illustrating the 'sed' substitution command:

        % set input = "Nathan Sequitur" 
        % echo $input | sed 's/\([a-zA-Z]*\)[ ]*\([a-zA-Z]*\)/\2, \1/g'
        Sequitur, Nathan

        Note that the following doesn't work as one might expect:

        % set string = '<tr align="right">$2,359</tr>'
        % echo $string | sed 's/<tr.*>/ @BEGIN_ROW@ /g'
         @BEGIN_ROW@ 

        This is because 'sed' matches 'greedily', that is to say a 
        regular expression like '.*' gobbles up as many characters
        as it possibly can and still succeed in finding a match. 

        Here's a fix using the 'complement' operator '^' 

        % echo $string | sed 's/<tr[^>]*>/ @BEGIN_ROW@ /g'
         @BEGIN_ROW@ $2,359</tr>


7. Additional Commands

   The 'comm' command:    compare two files 

      % comm -23 one.txt two.txt

      List the lines that are in one.txt but not in two.txt.  The
      'comm' commands relies on the two files being sorted. 

   The 'cut' command:     extract specified fields in a file

      % cat artist
      1;Bill;Frisell;1951-03-18;Baltimore, Maryland
      2;Bonnie;Raitt;1949-11-08;Burbank, California
      3;Melvin;Taylor;1959-03-13;Jackson, Mississippi
      4;Robert;Cray;1953-08-01;Columbus, Georgia
      5;Keith;Jarrett;1945-05-08;Allentown, Pennsylvania
      6;Sue;Foley;1968-03-29;Ottawa, Canada

      % cat artist | cut -f 3,4 -d ";"
      Frisell;1951-03-18
      Raitt;1949-11-08
      Taylor;1959-03-13
      Cray;1953-08-01
      Jarrett;1945-05-08
      Foley;1968-03-29

      Extract fields 3 and 4 using the semicolon as a field delimiter.

   The 'find' command:    search the file system for specified files
                        
      % find ~ -name "*[a-z]*.???" -print 

      Find and print any file in the directory tree rooted in my home
      directory whose name is lowercase alphabetic and whose extension
      has exactly three characters.

      % find . -name "*.jpg" -size +8 -exec /bin/rm {} \;

      Find and delete every file in the directory tree rooted in my
      current working directory with the extension 'jpg' whose size
      exceeds (the '+') 8 * 512 bytes.

   The 'join' command:   join two files using a specified field

      See the exercise 'Working With Databases' on the book web page. 

   The 'paste' command:   combine the lines in two files side-by-side

      % cat letters
      a
      b
      c
      % cat numbers
      1
      2
      3
      % paste numbers letters         
      1       a               
      2       b               
      3       c               

      Paste the two files side-by-side using the default delimiter tab.

      % paste -d ";" numbers letters
      1;a                             
      2;b                             
      3;c                           

      Paste the two files side-by-side using the semicolon as delimiter.

   The 'repeat' command:  repeat n times the specified (simple) command

      % repeat 3 echo 1
      1
      1
      1

      This turns out to be very useful for all sorts of scripting
      tricks.  For example, suppose that you want to initialize 
      an array (list) of a specified length to contain all zeros.

      % set n = 16
      % set array = ( `repeat $n echo 0` )
      % echo $#array
      16
      % echo $array[7]
      0

   The 'sort' command:    sort by lines or fields

      % cat file | sort -rn 

      Sort the file in reverse numeric order. The 
      default is to sort in lexicographic order.

   The 'tr' command:      translate characters

      % cat file | tr "A-Z" "a-z"

      Convert all uppercase letters to lowercase.

      % cat file | tr -dc "a-z \n"

      Delete all characters other than spaces, line feeds 
      and lowercase alphabetic characters.

   The 'uniq' command:    count or remove consecutive duplicate lines

      % sort file | uniq -c 

      Count the duplicate lines appearing in a file.