Computerworld

The A-Z of Programming Languages: AWK

Alfred V. Aho of AWK fame talks about the history and continuing popularity of his pattern matching language.

Computer scientist and compiler expert Alfred V. Aho is a man at the forefront of computer science research. He has been involved in the development of programming languages from his days working as the vice president of the Computing Sciences Research Center at Bell Labs to his current position as Lawrence Gussman Professor in the Computer Science Department at Columbia University.

As well as co-authoring the 'Dragon' book series; Aho was one of the three developers of the AWK pattern matching language in the mid-1970's, along with Brian Kernighan and Peter Weinberger.

Computerworld recently spoke to Professor Aho to learn more about the development of AWK, in the first of a series of investigations into the most widely-used programming languages.

How did the idea/concept of the AWK language develop and come into practice?

As with a number of languages, it was born from the necessity to meet a need. As a researcher at Bell Labs in the early 1970s, I found myself keeping track of budgets, and keeping track of editorial correspondence. I was also teaching at a nearby university at the time, so I had to keep track of student grades as well.

I wanted to have a simple little language in which I could write one- or two-line programs to do these tasks. Brian Kernighan, a researcher next door to me at the Labs, also wanted to create a similar language. We had daily conversations which culminated in a desire to create a pattern-matching language suitable for simple data-processing tasks.

We were heavily influenced by GREP, a popular string-matching utility on UNIX, which had been created in our research center. GREP would search a file of text looking for lines matching a pattern consisting of a limited form of regular expressions, and then print all lines in the file that matched that regular expression.

We thought that we'd like to generalize the class of patterns to deal with numbers as well as strings. We also thought that we'd like to have more computational capability than just printing the line that matched the pattern.

So out of this grew AWK, a language based on the principle of pattern-action processing. It was built to do simple data processing: the ordinary data processing that we routinely did on a day-to-day basis.. We just wanted to have a very simple scripting language that would allow us, and people who weren't very computer savvy, to be able to write throw-away programs for routine data processing.

Page Break

Were there any programs or languages that already had these functions at the time you developed AWK?

Our original model was GREP. But GREP had a very limited form of pattern action processing, so we generalized the capabilities of GREP considerably. I was also interested at that time in string pattern matching algorithms and context-free grammar parsing algorithms for compiler applications. This means that you can see a certain similarity between what AWK does and what the compiler construction tools LEX and YACC do.

LEX and YACC were tools that were built around string pattern matching algorithms that I was working on: LEX was designed to do lexical analysis and YACC syntax analysis. These tools were compiler construction utilities which were widely used in Bell labs, and later elsewhere, to create all sorts of little languages. Brian Kernighan was using them to make languages for typesetting mathematics and picture processing.

LEX is a tool that looks for lexemes in input text. Lexemes are sequences of characters that make up logical units. For example, a keyword like 'then' in a programming language is a lexeme. The character 't' by itself isn't interesting, 'h' by itself isn't interesting, but the combination 'then' is interesting. One of the first tasks a compiler has to do is read the source program and group its characters into lexemes.

AWK was influenced by this kind of textual processing, but AWK was aimed at data-processing tasks and it assumed very little background on the part of the user in terms of programming sophistication.

Can you provide Computerworld readers with a brief summary in your own words of AWK as a language?

AWK is a language for processing files of text. A file is treated as a sequence of records, and by default each line is a record. Each line is broken up into a sequence of fields, so we can think of the first word in a line as the first field, the second word as the second field, and so on. An AWK program is of a sequence of pattern-action statements. AWK reads the input a line at a time. A line is scanned for each pattern in the program, and for each pattern that matches, the associated action is executed.

A simple example should make this clear. Suppose we have a file in which each line is a name followed by a phone number. Let's say the file contains the line 'Naomi 1234'. In the AWK program the first field is referred to as $1, the second field as $2, and so on Thus, we can create an AWK program to retrieve Naomi's phone number by simply writing $1 == "Naomi" {print $2} which means if the first field matches Naomi, then print the second field. Now you're an AWK programmer! If you typed that program into AWK and presented it with a file that had names and phone numbers that program, then it would print 1234 as Naomi's phone number.

A typical AWK program would have several pattern-action statements. The patterns can be Boolean combinations of strings and numbers; the actions can be statements in a C-like programming language.

AWK became popular since it was one of the standard programs that came with every UNIX system.

Page Break

What are you most proud of in the development of AWK?

AWK was developed by three people: me, Brian Kernighan and Peter Weinberger. Peter Weinberger was interested in what Brian and I were doing right from the start. We had created a grammatical specification for AWK but hadn't yet created the full run-time environment. Weinberger came along and said 'hey, this looks like a language I could use myself', and within a week he created a working run time for AWK. This initial form of AWK was very useful for writing the data processing routines that we were all interested in but more importantly it provided an evolvable platform for the language.

One of the most interesting parts of this project for me was that I got to know how Kernighan and Weinberger thought about language design: it was a really enlightening process! With the flexible compiler construction tools we had at our disposal, we very quickly evolved the language to adopt new useful syntactic and semantic constructs. We spent a whole year intensely debating what constructs should and shouldn't be in the language.

Language design is a very personal activity and each person brings to a language the classes of problems that they'd like to solve, and the manner in which they'd like them to be solved. I had a lot of fun creating AWK, and working with Kernighan and Weinberger was one of the most stimulating experiences of my career. I also learned I would not want to get into a programming contest with either of them however! Their programming abilities are formidable.

Interestingly, we did not intend the language to be used except by the three of us. But very quickly we discovered lots of other people had the need for the routine kind of data processing that AWK was good for. People didn't want to write hundred-line C programs to do data processing that could be done with a few lines of AWK, so lots of people started using AWK.

For many years AWK was one of the most popular commands on UNIX, and today, even though a number of other similar languages have come on the scene, AWK still ranks among the top 25 or 30 most popular programming languages in the world. And it all began as a little exercise to create a utility that the three of us would find useful for our own use.

How do you feel about AWK being so popular?

I am very happy that other people have found AWK useful. And not only did AWK attract a lot of users, other language designers later used it as a model for developing more powerful languages.

About 10 years after AWK was created, Larry Wall created a language called Perl, which was patterned after AWK and some other UNIX commands. Perl is now one of the most popular programming language in the world.. So not only was AWK popular when it was introduced but it also stimulated the creation of other popular languages.

AWK has inspired many other languages as you've already mentioned: why do you think this is?

What made AWK popular initially was its simplicity and the kinds of tasks it was built to do. It has a very simple programming model. The idea of pattern-action programming is very natural for people. We also made the language compatible with pipes in UNIX. The actions in AWK are really simple forms of C programs. You can write a simple action like {print $2} or you can write a much more complex C-like program as an action associated with a pattern. Some Wall Street financial houses used AWK when it first came out to balance their books because it was so easy to write data-processing programs in AWK.

AWK turned a number of people into programmers because the learning curve for the language was very shallow. Even today a large number of people continue to use AWK, saying languages such as Perl have become too complicated. Some say Perl has become such a complex language that it's become almost impossible to understand the programs once they've been written.

Another advantage of AWK is that that the language is stable. We haven't changed it since the mid 1980's. And there are also lots of other people who've implemented versions of AWK on different platforms such as Windows.

Page Break

How did you determine the order of initials in AWK?

This was not our choice. When our research colleagues saw the three of us in one or another's office, they'd walk by the open door and say 'AWK! AWK!'. So, we called the language AWK because of the good natured ribbing we received from our colleagues. We also thought it was a great name, and we put the AUK bird picture on the AWK book when we published it.

What did you learn from developing AWK that you still apply in your work today?

My research specialties include algorithms and programming languages. Many more people know me for AWK as they've used it personally. Fewer people know me for my theoretical papers even though they may be using the algorithms in them that have been implemented in various tools. One of the nice things about AWK is that it incorporates efficient string pattern matching algorithms that I was working on at the time we developed AWK. These pattern matching algorithms are also found in other UNIX utilities such as EGREP and FGREP, two string-matching tools I had written when I was experimenting with string pattern matching algorithms.

What AWK represents is a beautiful marriage of theory and practice. The best engineering is often built on top of a sound scientific foundation. In AWK we have taken expressive notations and efficient algorithms founded in computer science and engineered them to run well in practice.

I feel you gain wisdom by working with great people. Brian Kernighan is a master of useful programming language design. His basic precept of language design is to keep a language simple, so that a language is easy to understand and easy to use. I think this is great advice for any language designer.

Have you had any surprises in the way that AWK has developed over the years?

One Monday morning I walked into my office to find a person from the Bell Labs micro-electronics product division who had used AWK to create a multi-thousand-line computer-aided design system. I was just stunned. I thought that no one would ever write an AWK program with more than handful of statements. But he had written a powerful CAD development system in AWK because he could do it so quickly and with such facility. My biggest surprise is that AWK has been used in many different applications that none of us had initially envisaged. But perhaps that's the sign of a good tool, as you use a screwdriver for many more things than turning screws.

Do you still work with AWK today?

Since it's so useful for routine data processing I use it daily. For example, I use it whenever I'm writing papers and books. Because it has associative arrays, I have a simple two-line AWK program that translates symbolically named figures and examples into numerically encoded figures and examples; for instance, it translates 'Figure AWK-program' into "Figure 1.1'. This AWK program allows me to rearrange and renumber figures and examples at will in my papers and books. I once saw a paper that had a 1000-line C that had less functionality than these two lines of AWK. The economy of expression you can get from AWK can be very impressive.

Page Break

How has being one of the three creators of AWK impacted your career?

As I said, many programmers know me for AWK, but the computer science research community is much more familiar with my theoretical work. So I initially viewed the creation of AWK as a learning experience and a diversion rather than part of my regular research activities. However, the experience of implementing AWK has greatly influenced how I now teach programming languages and compilers, and software engineering.

What I've noticed is that some scientists aren't as well known for their primary field of research by the world at large as they are for their useful tools. Don Knuth, for example, is one of the world's foremost computer scientists, a founder of the field of computer algorithms. However, he developed a language for typesetting technical papers, called TeX. This wasn't his main avenue of research but TeX became very widely used throughout the world by many scientists outside of computer science. Knuth was passionate about having a mathematical typesetting system that could be used to produce beautiful looking papers and books.

Many other computer science researchers have developed useful programming languages as a by-product of their main line of research as well. As another example, Bjarne Stroustrup developed the widely used C++ programming language because he wanted to write network simulators.

Would you do anything differently in the development of AWK looking back?

One of the things that I would have done differently is instituting rigorous testing as we started to develop the language. We initially created AWK as a 'throw-away' language, so we didn't do rigorous quality control as part of our initial implementation.

I mentioned to you earlier that there was a person who wrote a CAD system in AWK. The reason he initially came to see me was to report a bug in the AWK compiler. He was very testy with me saying I had wasted three weeks of his life, as he had been looking for a bug in his own code only to discover that it was a bug in the AWK compiler! I huddled with Brian Kernighan after this, and we agreed we really need to do something differently in terms of quality control. So we instituted a rigorous regression test for all of the features of AWK. Any of the three of us who put in a new feature into the language from then on, first had to write a test for the new feature.

I have been teaching the programming languages and compilers course at Columbia University, for many several years. The course has a semester long project in which students work in teams of four or five to design their own innovative little language and to make a compiler for it.

Students coming into the course have never looked inside a compiler before, but in all the years I've been teaching this course, never has a team failed to deliver a working compiler at the end of the course. All of this is due to the experience I had in developing AWK with Kernighan and Weinberger In addition to learning the principles of language and compiler design, the students learn good software engineering practices.. Rigorous testing is something students do from the start. The students also learn the elements of project management, teamwork, and communication skills, both oral and written. So from that perspective AWK has significantly influenced how I teach programming languages and compilers and software development.