lf131, UNIX Basics: GNU file utilities

This article is available in: English Castellano Deutsch Francais Nederlands Russian Turkce

by Manuel Muriel Cordero

About the author:
Manuel Muriel Cordero studies at the Statistics and Computer Science Faculty at Sevilla University.
Content:

Introduction: the Unix way of working
The genesis of GNU utils
grep
Regular expressions
Find
cut & paste
sort
wc
Comparison tools : cmp,comm,diff
uniq
sed
awk
The shell scripts
Resources
Bibliography
Talkback form for this article

GNU file utilities

Abstract:

The previous article in this series (Basic UNIX commands) gave a general overview over Linux. It was an introduction to the Linux elements, in order to be able to get basic skills and manage the operating system, but the user may want to learn the usual set of Unix commands. Using these commands and the shell you can achieve very efficient file and system management. This article will deal with those advanced, although basic, tools.

Introduction: the Unix way of working

Before describing the commands, the reader should know some facts about their history. Kem Thompsom and Dennis Ritchie, when developing Unix at the begin of the seventies, wanted to make an operating system to ease the life of programmers. They decided that the best way to achieve that goal was defining a small number of simple tools extremely good in some specialized tasks. More complicated tasks could be performed just by joining those tools using the output from one as the input for others.

This idea to send the information it is done using the standard input and output (screen and keyboard). Due to existence of pipes and redirection (seen in the previous article) it is possible to combine commands.

It is very easy to demonstrate using an example. A user writes:

$ who | grep pepe

who and grep are two separate programs joined with the pipe "|" . who shows a list with every user logged on the computer at this moment. The typical output could be something like:

$ who
manolo	tty1	Dec 22	13:15
pepe	ps/2	Dec 22	14:36
root	tty2	Dec 22	10:03
pepe	ps/2	Dec 22	14:37

The output is composed of 4 fields separated with tabs. The fields are the username (login), the login terminal, date and time for connection.

"grep pepe" searchs the lines matching the string "pepe".

And the output is:

$ who | grep pepe
pepe	ps/2	Dec 22	14:36
pepe	ps/2	Dec 22	14:37

Maybe you want something simpler, than knowing if somebody is logged or not. You can check the number of terminals being used in that moment by using the program wc.

wc is a character, words and lines counter. In this case, we only need to know the number of lines, and we use the option -l

$ who | wc -l
	4
$ who | grep pepe | wc -l
	2

4 people are logged in in total and pepe is logged in at 2 terminals

If we check now for antonio

$ who | grep antonio | wc -l
	0

antonio is not logged

The genesis of GNU utils

Richard Stallman, the founder of the GNU project, raised the discussion about the control over the Unix OS by a few large software companies , which prevented computer science to grow up naturally. After developing the emacs editor while working at MIT, he very disliked the fact that big commercial firms took his work to make proprietary versions. Confronted with this situation, he decided to begin a project where the source code of the software was available to everybody. That was GNU. The long-term target was to make a whole open-source operating system. The first steps were a new open-source version of emacs and a C compiler (gcc), as well as some typical tools for unix systems. These tools are discussed in this article.

grep

Our first example showed the main functionality of grep. Now we will explain it in greater detail

Basic usage of grep is

$ grep [-options] pattern files

And the most used options are
-n prints the line number before the matched lines (useful for search in big files and to know exactly where the match is located)
-c prints the number of matches found
-v search for non-matching lines (search for lines where the pattern is not present)

The pattern is any group of characters to search. If there is a blank embeded, the pattern must be double-quoted (") to prevent confusion between the pattern and the files to be searched. For example

$ grep "Hola mundo" file.txt

If we are looking for strings including wildcards, apostrophes, quotes or slashes they must be escaped (preceeded by a backslash) or quoted, to avoid substitution from the shell.

$ grep \*\"\'\?\< file.txt
This finds:
Esto es una cadena chunga -> *"'?<

Regular expressions

grep and other GNU utils are able to perform more advanced searches. That is possible with regular expressions. They are similar to shell wildcards, in the sense that they replace characters or groups of characters. Under the resource at the end of the article you find also a link to a separate article explaining regular expressions in detail.
Some examples:

$ grep c.n

search for any occurrence of a string with c, any character and t.

$ grep "[Bc]el"

search every occurrence of Bel and cel.

$ grep "[m-o]ata"

find those lines containing mata, nata or oata.

$ grep "[^m-o]ata"

Lines with a string ending in ata and not containing m,n or o as their first letter.

$ grep "^Martin come"

Every line beginning with 'Martin come'. As ^ is out of brackets, it means the beginning of a line, not a negation of a group as in the previous example.

$ grep "durmiendo$"

All the lines finishing with the string 'durmiendo'. $ remains for the end of line.

$ grep "^Caja San Fernando gana la liga$"

Those lines exactly matching the string.

To avoid the special meaning of any of these characters they must be backslashed. For example:

$ grep "E\.T\."

search for the string 'E.T.'.

Find

This command is used to find files. Another LinuxFocus article explains this command, and the best we can do is to point to it.

cut & paste

Within Unix, the information used to be stored in ASCII files with line-records, and fields delimited with some special character, usually a tabulation mark or a colon (:). A typical use case is to take some fields from a file and join them into another one. cut and paste are able to do this work.

Let us use as an example the file /etc/passwd, with the users information. It contains 7 fields, delimited with ":". The fields contain information about login name, encrypted password, user ID, group ID, geco, home directory for the user and his preferred shell.

Here is a typical piece from this file:

root:x:0:0:root:/root:/bin/bash
murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash
practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh
wizardi:x:502:502:Wizard para nethack:/home/wizard:/bin/bash

If we want to pair the user with their shells, we must cut fields 1 and seven. Let's go:

$ cut -f1,7 -d: /etc/passwd
root:/bin/bash
murie:/bin/bash
practica:/bin/ksh
wizard:/bin/bash

The option -f specifies the fields to cut, and -d defines the field separator (tabulation mark is the default).

And it is possible to select a range of fields:

$ cut -f5-7 -d: /etc/passwd
root:/root:/bin/bash
Manuel Muriel Cordero:/home/murie:/bin/bash
Usuario de practicas para Ksh:/home/practica:/bin/ksh
Wizard para nethack:/home/wizard:/bin/bash

If we have redirected the output using '>' to two different files and we want to join both outputs, we can use the command paste:

$ paste output1 output2
root:/bin/bash:root:/root:/bin/bash
murie:/bin/bash:Manuel Muriel Cordero:/home/murie:/bin/bash
practica:/bin/ksh:Usuario de practicas para Ksk:/home/practica:/bin/ksh
wizard:/bin/bash:Wizard para nethack:/home/wizard:/bin/bash

sort

Let´s assume that we want to sort /etc/passwd using the geco field. To achieve this, we will use sort, the unix sorting tool

$ sort -t: +4 /etc/passwd
murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash
practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh
wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash
root:x:0:0:root:/root:/bin/bash

It is very easy to see that the file has been sorted, but using the ASCII table order. If we don´t want to make a difference among capital letter, we can use:

$ sort -t: +4f  /etc/passwd
murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash
root:x:0:0:root:/root:/bin/bash
practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh
wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash

-t is the option to select the field separator. +4 stands for the number of field to jump before ordering the lines, and f means to sort regardless of upper and lowercase.

A much more complicated sort can be achieved. For example, we can sort using the shell in a first step then sort using the geco:

$ sort -t: +6r +4f /etc/passwd
practica:x:501:501:Usuario de practicas para Ksh:/home/practica:/bin/ksh
murie:x:500:500:Manuel Muriel Cordero:/home/murie:/bin/bash
root:x:0:0:root:/root:/bin/bash
wizard:x:502:502:Wizard para nethack:/home/wizard:/bin/bash

You have a file with some people you lend money and the amount of money you gave them. Take ´deudas.txt´ as an example:

Son Goku:23450
Son Gohan:4570
Picolo:356700
Ranma 1/2:700

If you want to know the first one to ´visit´, you need a sorted list.
Just type

$ sort +1 deudas
Ranma 1/2:700
Son Gohan:4570
Son Goku:23450
Picolo:356700

which is not the desired result because the number of fields is not the same across the file. The solution is the ´n´ option:

$ sort +1n deudas
Picolo:356700
Son Goku:23450
Son Gohan:4570
Ranma 1/2:700

Basic options for sort are
+n.m jumps over the first n fields and the next m characters before begin the sort
-n.m stops the sorting when arriving to the m-th character of the n-th field

The following are modification parameters:
-b jumps over leading whitespaces
-d dictionary sort (just using letters, numbers and whitespace)
-f ignores case distinction
-n sort numerically
-r reverse order

wc

As we have seen before, wc is a character, word and line counter. Default output contains the number of lines, words and characters of the input file(s).

The output type is modifiable with the options

-l just lines
-w only show word number
-c display the number of characters

Comparison tools : cmp,comm,diff

Sometimes we need to know the differences between two versions of the same file. This is especially used in the programming area when various people work on the same project thus modifying source code. To find the variations from a version to the other, these tools are the right ones.

cmp is the most basic one. It compares two files and locates the place where the first difference appears (character number and line of the difference)

$ cmp old new
old new differ: char 11234, line 333

comm is a bit more advanced. Its output provides 3 columns. The first one contains the unique lines of the first file, the second one contains the unique lines of the second file and the third one contains the common ones. Numeric parameters allow removal of some of these columns.
-1, -2 and -3 tell comm not to display the first, second or third column. This example shows the lines only appearing in the first file and the common ones.

$ comm -2 old new

Last but not least there is diff. It's an essential tool for advanced programming projects. If you already downloaded a kernel to compile it, you know that you can download the source code of the new one or the patch for the previous version, this last being smaller. This patch has a diff suffix, what means it's a diff output. This tool can use editor commands (vi, rcs) to make files identical. This also applies to directories and the archives holding them. The use case is quite obvious : you download less source code (just the changes), you apply the patch and you compile. Without parameters, the output specifies in these formats how to make changes in such a way that the first one becomes equal to the second one, with vi commands.

$ diff old new
3c3
< The Hobbit
---
> The Lord of the Rings
78a79,87
>Three Rings for the Elven-kings under the sky,
>Seven for the Dwarf-lords in their halls of stone,
>Nine for Mortal Men doomed to die,
>One for the Dark Lord on his dark throne
>In the Land of Mordor where the Shadows lie.
>One Ring to rule them all, One Ring to find them,
>One Ring to bring them all and in the darkness bind them
>In the Land of Mordor where the Shadows lie.

3c3 means at line 3, three line have to be changed, removing "The Hobbit" and replacing it with "The Lord of the Rings". 78a79,87 means you must insert new lines from line 79 to 87.

uniq

uniq is a redundancy cleaner. For example, if we want to know the people actually connected to the computer, we must use the commands who and cut.

$ who | cut -f1 -d' '
root
murie
murie
practica

But the output is not completely good. We need to delete the second entry for user murie. And that means

$ who | cut -f1 -d' ' | sort | uniq
murie
practica
root

The line option -d' ' means that the fields separator is the white space, because the output from who use that character instead of tabulation marks.

uniq compares only consecutive lines. In our case the 2 "murie" appeared after each other but it could have been in a different order. It is therefore a good idea to sort the output before giving it to uniq.

sed

sed is one of the most peculiar Unix tools. It means stream editor. Usual editing accepts interactively the modifications that user wants. sed allow us to create small shell scripts similar to batch files from MS-DOS. It give us the ability to modify the content of a file without user interaction. The editor's capabilities are rather complete, and going deeper into the subject will make this article too long. So, we will go for a brief introduction, leaving the man and info pages for the interested user.

sed is usually invoked as:

$ sed 'command' files

Take as example a file where we want to replace every presence of "Manolo" with "Fernando". Let's do it:

$ sed 's/Manolo/Fernando/g' file

And it returns through standard output the modified file. If you want to keep the result, just redirect with ">".

Many users will recognize there the common search & replace vi command. Actually, most of ":" commands (those which call to ex) are commands to sed.

Usually, sed instructions are composed by one or two address (to select lines) and the command to execute. The address could be a line, a range of lines or a pattern.
The most widely used commands are:

Command	  Action
-------   ------
a\         adds the line after the addressed lines in the input
c\         changes the addressed lines, writing the line
d          deletes the line(s)
g          makes global substitutions of the pattern instead of substitute
           only first appearance
i\         insert lines after addressed lines
p          prints the actual line, even using -n option
q          finish (quit) when arriving the addressed line
r file     read a file, appending the contents to the output
s/one/two  replaces string "one" with string "two"
w file     copies the actual line to a different file
=          prints the line number
! command  applies the command to the actual line

Using sed you can specify which lines or range of lines you want to act on:

$ sed '3d' file

will delete the third line of the file

$ sed '2,4s/e/#/' file

will substitute the first appearance of character "e" with the character "#" in lines 2 to 4 (including both).

Lines containing a string can be selected using regular expresions described above. For example

$ sed '/[Qq]ueen/d' songs

will delete every line including the word "Queen" or "queen".

It´s very easy to delete empty lines from a file using patterns

$ sed '/^$/d' file

although those lines containing white spaces will not be deleted. To achieve this, you must use a slightly wider pattern

$ sed '/^ *$/d' file

where the "*" character means any number of occurrences of the previous character, " " (space) in this case.

$ sed '/InitMenu/a\
> the text to append' file.txt

This example will search for the line containing string "InitMenu" inserting a new line after it. This example works only as shown with bash or sh as shell. You type until a\ then you hit return and type the rest.

Tcsh expands newlines inside quotes in a different way. Therefore you must us in tcsh a double backslash:

$ sed '/InitMenu/a\\
? the text to append' file.txt

The ? comes from the shell just as the > in the bash example.

awk

Last but not least: awk. Its peculiar name came from their original developers names: Alfred Aho Peter Weinberger and Brian Kernighan.

The awk program is one of the most interesting among Unix utilities. It is an evolved and complex tool that allows, from the command line, to perform a wide variety of actions.

It should be noticed that awk and sed are key pieces of the more complex shell scripts. The things you can do without C or any other compiled language is really impressive. The SlackWare Linux distribution setup as well as many CGI web programs are just shell scripts.

Nowadays, the command line tools have been left aside, as it is too old for the actual window environment and with the arrival of PERL many shell scripts became substituted by perl scripts. It might look like these command line tools will be forgotten. However my own experience say that many applications can be done with a few lines in a shell script (including a small database manager). Apart from that you can be very very productive if you know how to use these commands and the shell.

If you join the power of awk and sed you can perform things very quickly and fast that are usually done with a small database manager plus a spread sheet.

Take a invoice where you find the articles you bought, how many pieces of each one and their prices per product. Let's call this file "sales":

oranges 	5	250
peras   	3	120
apples  	2	360

It's a file with 3 fields, with tabulation marks as field separators. Now you want to define a fourth field with the total price per product.

$ awk '{total=$2*$3; print $0 , total }' sales
oranges	6	250	1250
peras	3	120	360
apples	2	360	720

total is the variable which will contain the product of the values stored in the second and third fields. After calculation, the whole input line and the total value are printed.

awk is nearly a programming environment itself, very well suited to the automated work with information from text files. If you are interested in this tool, I encourage you to learn more, using the man and info pages of them.

The shell scripts

Shell scripts are system commands sequences stored in a file to be executed.

Shell scripts are similar to batch files from DOS but more powerful. They allow users to make their own commands just combining existing ones.

Shell scripts are able to accept parameters, of course. They are stored in variables $0 (for the command/script name), $1, $2, ... up to $9. All the command line parameters can be referred with $*.

Any text editor can create shell scripts. To execute a script just type:

$ sh shell-script

Or, much better, you can give execution permission with

$ chmod 700 shell-script

and execute just typing the name:

$ shell-script

We will finish here this article and the discussion about shell scripts, that will be postponed for the future. The next article will introduce the most common Unix text editors: vi & emacs. Every Linux user should know them well.

Resources

This is an introductory article, and readers could learn more details within other LinuxFocus articles:

Bibliography

For further reading:

The lord of the rings, J.R.R Tolkien.

Talkback form for this article

Every article has its own talkback page. On this page you can submit a comment or look at comments from other readers:

talkback page

Webpages maintained by the LinuxFocus Editor team
© Manuel Muriel Cordero, FDL
LinuxFocus.org
Click here to report a fault or send a comment to LinuxFocus

2001-01-27, generated by lfparser version 2.8