Software Development : Perl part II

by Guido Socher

About the author:

Guido is a long time Linux fan and Perl hacker. These days he is also very busy renovating the house and planting salad and other stuff in the garden.

Content:

A framework for your program
Using the template
If-statements
Variables
Subroutines
A real program
Hash Tables
What is next?

Perl part II

Abstract:

Perl part I provided a general overview about Perl. In perl part II we are now going to write our first useful program.

A framework for your program

Perl is best used to write small programs, specialized for one task. To speed up the development process it is a good idea to have a framework at hand which offers some basic structure and functionality that you would like to have in most programs. The following code template offers basic command line option reading and has already a subroutine to print a help message.

!/usr/bin/perl -w
# vim: set sw=8 ts=8 si et:
#
# uncomment strict to make the perl compiler very
# strict about declarations:
#use strict;
# global variables:
use vars qw($opt_h);
use Getopt::Std;
#
&getopts("h")||die "ERROR: No such option. -h for help\n";
&help if ($opt_h);
#
#>>your code<<
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
sub help{
print "help message\n";
exit;
}
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
__END__

Let's look at the code. The &getopts() calls a subroutine in the library Getopt::Std to read the command line options. It sets global variables of the name $opt_<option> according to the options provided on the command line. All options on the command line start with a "-" (minus sign) and must come after the program name and before any other arguments (Note: this is a general Unix rule). The string given to &getopts (the "h" in the above program ) lists all option letters that are allowed. If the option takes an argument then a colon must be written after the option letter. &getsopt("d:x:h") says that this program has the options -d, -x and -h. Options -d and -x take an argument. so "-d something" would be valid but "-d -x foo" is wrong as the -d is not followed by an argument.
If the option -h is given on the command line then the variable $opt_h is set and &help if ($opt_h); calls therefore the subroutine help if option -h was given on the command line. The statement sub help{ declares the subroutine. It is not so important for the moment that you understand every detail of the code. Just take it as a template where you need to add your main functionality.

Using the template

Let's write a little number converter which makes use of this framework. The program, let's call it numconv, should convert hex to decimal numbers and vice versa.
numconv -x 30 should print the hex equivalent of decimal 30.
numconv -d 1A should print the decimal equivalent of hex 1A.
numconv -h should print a help text.
The perl function hex() converts hex numbers into decimal and the function printf() can be used to convert decimal into hex. Inserting this into our template gives us a nice program:

#!/usr/bin/perl -w
# vim: set sw=8 ts=8 si et:
#
# uncomment strict to make the perl compiler very
# strict about declarations:
#use strict;
# global variables:
use vars qw($opt_d $opt_x $opt_h);
use Getopt::Std;
#
&getopts("d:x:h")||die "ERROR: No such option. -h for help\n";
&help if ($opt_h);
if ($opt_d && $opt_x){
    die "ERROR: options -x and -d are mutual exclusive.\n";
}
if ($opt_d){
    printf("decimal: %d\n",hex($opt_d));
}elsif ($opt_x){
    printf("hex: %X\n",$opt_x);
}else{
    # wrong usage -d or -x must be given:
    &help;
}
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
sub help{
    print "convert a number to hex or dec.
USAGE: numconv [-h] -d hexnum
    umconv [-h] -x decnum

OPTIONS: -h this help
EXAMPLE: numconv -d 1af
\n";
    exit;
}
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
__END__

Click here to download the numconv program code shown above.
In the following paragraphs we will look at this program a bit closer and try to understand it.

If-statements

The if-statement in perl comes in 2 forms:
expr if (cond);
or
if (cond) BLOCK [[elsif (cond) BLOCK ...] else BLOCK]

BLOCK is a number of statements enclosed in curly braces {}. This means that you can write e.g:

printf("hello\n") if ($i);

if ($i == 2){
   printf("i is 2\n");
}elsif ($i == 4){
   printf("i is 4\n");
}else{
   printf("i is neither 2 nor 4\n");
}

Like in C it is also possible to use the short cut operator && and ||.
printf("hello\n") if ($i);
can therefore also be written as
($i) && printf("hello\n");
Especially the || as used in our template translates quite well into spoken word.
&getopts("d:x:h")||die "ERROR\n";
"Get the options or die". The function die() is basically equivalent to a printf followed by exit. It prints a message and terminates the program.
&getopts("d:x:h")||die "ERROR\n";
is equivalent to
die "ERROR\n"; if (! &getopts("d:x:h"));
where the ! is a logical not operator. Again this can also be written as
die "ERROR\n"; unless (&getopts("d:x:h"));
unless is the same as if-not and is nicer to read than if(!..)

As you can see, there is more than one way of writing an if-statement in perl. You don't have to use them all. Use the one you feel most comforable with.

Variables

In the first perl article we saw that scalar variables (the $-variables) were used without declaring them. They come into existence the very moment they were used. This is a nice feature for small programs but it can lead to errors which are difficult to find in larger programs. Declaring a variable gives the compiler the possibility to do some extra checks for typing errors.

"use strict;" forces you to declare everything.

Consider the following correct code example:

#!/usr/bin/perl
use strict;
my $i=1;
print "i is $i\n";

This program is correct and produces "i is 1". Now assume that we type by mistake j instead of i:

#!/usr/bin/perl
#
$i=1;
print "i is $j\n";

This code will run fine in perl and produces "i is ". The perl module "use strict;" can force the compiler to complain about such a program. When you use the "strict" then everything must be declared otherwise an error message is returned.

#!/usr/bin/perl
use strict;
my $i=1;
print "i is $j\n";

This causes the following message and makes it easy to spot the typing error.

Global symbol "$j" requires explicit package name at ./vardec line 4.
Execution of ./vardec aborted due to compilation errors.
Exit 255

Variables can be declared in perl by using "my" or, as we already saw in the framework, "use vars qw()":
use vars qw($opt_h);

Global variables are declared with use vars. These variables are global to even all included libraries.
Variables local to the current program file (global among all subroutines in this file) are declared with my at the beginning of the program (outside a subroutine).
Variables local to the current subroutine are declared with my inside the subroutine.

People experienced in shell programming might be tempted to leave out the $-sign when declaring the variable or assigning it a value. This is not possible in perl. You just write always a $-sign when you use a scalar variable no matter what you do with it.

You can also directly assign a value to the variable when you declare it. my $myvar=10; declares the variable $myvar and sets its initial value to 10.

Subroutines

We have already used the "help" subroutine in the numconv program above. Subroutines can be used to program your own functions. They help to structure your program.
A subroutine can be inserted at any place in the program text (before or after it is called. It does not matter). You start a subroutine with sub name(){... and you call it with $retval=&name(...arguments...). The return value is the value of the last executed statement in the subroutine. The arguments given to the subroutine are passed to the code inside the subroutine in the special array @_. We will look at this in more detail when we talk about arrays in Perl III. For the moment it is enough to know that the values of scalar variables can be read inside the subroutine using shift. Here is an example:

#!/usr/bin/perl
use strict;
my $result;
my $b;
my $a;
$result=&add_and_duplicate(2,3);
print "2*(2+3) is $result\n";

$b=5;$a=10;
$result=&add_and_duplicate($a,$b);
print "2*($a+$b) is $result\n";

# add two numbers and multiply with 2:
sub add_and_duplicate(){
    my $locala=shift;
    my $localb=shift;
    ($localb+$locala)*2;
}

A real program

Now that we have covered a lot of perl syntax and elements of the language, it is time to write a real program.
Perl was designed to manipulate text files with very little programming effort. Our first Perl program should compare a list of abbreviations and then find the duplicates in that list. With duplicates we mean abbreviations that appear several times in the list. The list looks as follows:

It is easy to manipulate text files with Perl

AC Access Class
AC Air Conditioning
AFC Automatic Frequency Control
AFS Andrew File System
...

You can download the list here. The syntax of this file is:

one abbreviation per line
abbreviation and meaning are on the same line separated by space
the first word on the line is the abbreviation
the first word starts at the beginning of the line

How to read such a text file? Here is some perl code to read text line by line:

....
open(FD,"abb.txt")||die "ERROR: can not read file abb.txt\n";
while(){
#do something
}
close FD;
....

The open function takes a file descriptor as first argument and the name of the file to read as second argument. File descriptors are some kind of special variables. You just put it in the open function, you use it in the function that reads out the data from the file and finally you give it to the close function. Reading the file is done with <FD>. The <FD> can be given as argument to a while loop and this results then in a line by line reading.
Traditionally file descriptors are written with all upper case letters in Perl.

Where does our data go? Perl has a number of implicit variables. These are variables which you did not declare. They are always there. One such variable is $_. This variable holds the line which is currently read inside the above while loop. Let's try it (download the code):

#!/usr/bin/perl
use strict;
my $i=0;
open(FD,"abb.txt")||die "ERROR: can not read file abb.txt\n";
while(<FD>){
   # increment the line counter. You probably
   # know the ++ from C:
   $i++;
   print "Line $i is $_";
}
close FD;

The implicit variable $_ holds the current line.

As you can see we did NOT write print "Line $i is $_ \n". The $_ variable holds the current line from the text file including the newline character (\n).

Now we know how to read the file. To actually complete our program we need to learn 2 more things:

How to read the abbreviation from the start of the line.
How perl hash tables work

Regular expressions provide sophisticated means to search for a pattern in a text string. We are looking for the first string in a line until the first space. In other words our pattern is "start of line-->a number of characters but not space-->a space". In terms of perl regular expressions this is ^\S+\s. If we put this inside a m//; then perl will apply this expression to the $_ variable (Remember: this variable holds the current line; nice, isn't it?!). The \S+ in the regular expressions corresponds to "a number of characters but not space". If we put brackets around the \S+ then we get the "not space characters" back in the variable $1. We can add this to our program:

#!/usr/bin/perl -w
# vim: set sw=8 ts=8 si et:
#
use strict;
# global variables:
use vars qw($opt_h);
my $i=0;
use Getopt::Std;
#
&getopts("h")||die "ERROR: No such option. -h for help.n";
&help if ($opt_h);
#
open(FD,"abb.txt")||die "ERROR: can not read file abb.txt\n";
while(<FD>){
    $i++;
    if (m/^(\S+)\s/){
        # $1 holds now the first word (\S+)
        print "$1 is the abbreviation on line $i\n";
    }else{
        print "Line $i does not start with an abbreviation\n";
    }
}
close FD;
#
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
sub help{
     print "help text\n";
     exit;
}
#-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
__END__

The match operator (m/ /) returns 1 if the regular expression could successfully be applied to the current line. We can therefore use it inside an if-statement. You should always use an if-statement arround the match operator before you use $1 to ensure that $1 really contains valid data.

Hash Tables

Now we can read the file and get the abbreviation and all that is missing is some means to see if we did already read this abbreviation before. Here we need a new perl data type: Hash Tables. Hash Tables are arrays which can be indexed by a string. When you mean the whole Hash Table you write a % sign in front of the variable name. To read out an individual value you use $variable_name{"index_string"}. We use the same $ as for other scalar variables as a field inside the Hash Table is just a normal scalar variable. Here is an example:

#!/usr/bin/perl -w
my %htab;
my $index;
# load the hash with data:
$htab{"something"}="value of something";
$htab{"somethingelse"}=42;
# get the data back:
$index="something";
print "%htab at index \"$index\" is $htab{$index}\n";
$index="somethingelse";
print "%htab at index \"$index\" is $htab{$index}\n";

When running this program we get:

%htab at index "something" is value of something
%htab at index "somethingelse" is 42

Now our program is complete:

1  #!/usr/bin/perl -w
2  # vim: set sw=4 ts=4 si et:
3  #
4  use strict;
5  # global variables:
6  use vars qw($opt_h);
7  my %htab;
8  use Getopt::Std;
9  #
10  &getopts("h")||die "ERROR: No such option. -h for help.n";
11  &help if ($opt_h);
12  #
13  open(FD,"abb.txt")||die "ERROR: can not read file abb.txt\n";
14  print "Abbreviations with several meanings in file abb.txt:\n";
15  while(<FD>){
16      if (m/^(\S+)\s/){
17          # we use the first word as index to the hash:
18          if ($htab{$1}){
19              # again this abbrev:
20              if ($htab{$1} eq "_repeated_"){
21                  print; # same as print "$_";
22              }else{
23                  # this is the first duplicate we print first
24                  # occurance of this abbreviation:
25                  print $htab{$1};
26                  # print the abbreviation line that we are currently reading:
27                  print;
28                  # mark as repeated (= appears at least twice)
29                  $htab{$1}="_repeated_";
30              }
31          }else{
32              # the first time we load the whole line:
33              $htab{$1}=$_;
34          }
35      }
36  }
37  close FD;
38  #
39  #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
40  sub help{
41          print "finddup -- Find abbreviations with several meanins in the
42  file abb.txt. The lines in this file must have the format:
43  abrev meaning
44  \n";
45          exit;
46  }
47  #-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
48  __END__

You can download the program by clicking here.

How does it work? We read the file line by line and store the lines in our hash called %htab (line 33). The index to the hash is the abbreviation. Before we load the hash we test if there is already something stored in the hash (line 18). If there is already something in the hash then we have two possibilities:

This is the first duplicate
We had already several duplicates of this abbreviation

To distinguish between the two cases we write the string "_repeated_" into the hash to mark that we have already found a duplicate in the file (line 29).

It is probably the best to download the code and try it out.

What is next?

In this article you have already learned some details of the perl language. We have not yet covered all data types that perl has and you probably wonder if it is possible to avoid hard-coding the file name "abb.txt" in our program above. You know already how you could use an option to avoid this (e.g finddup -f abb.txt). Try to change the program! The general way how to read the command line and the datatype array will be covered in the next article.

1999-11-06, generated by lfparser version 0.9