BEST·BOOKS
+ MENU
← Back to Software Tools in Pascal

AI Study Notebook AI-generated

Software Tools in Pascal

Brian Kernighan and P. J. Plauger

Key points Not available
On this page

Software Tools in Pascal — Chapter-by-Chapter Outline

Author: Brian W. Kernighan and P. J. Plauger First published: 1981 Edition covered: First and only edition, Addison-Wesley, 1981 (ISBN 0-201-10342-7). The Pascal version is a direct rewrite of the authors' earlier Software Tools (1976, in Ratfor/Fortran), retaining the same chapter structure and program set but recasting everything in standard Pascal with explicit attention to portability across Pascal compilers.


Central thesis

Good software is built from good tools — small, well-designed programs that do one thing well, communicate through standard streams, and compose into larger solutions without modification. Kernighan and Plauger argue that the craft of writing such tools is learnable, transferable, and language-independent: the same habits of mind that produce a clean character-counting utility produce a clean macro processor, and both are products of disciplined top-down design, systematic testing, and an obsession with portability.

The book's method is pedagogical-by-construction. Rather than describing principles in the abstract, the authors build a progression of real, useful Unix-style tools in Pascal — from a one-page character copier to a full macro processor — and let the design lessons emerge from the code. Each program is a working artifact the reader can compile and use; each exercise asks the reader to extend or repair it. The philosophy is that programmers learn to write good software by reading and writing good software, not by reading about it.

How do you write programs that are useful, reliable, portable, and composable — and how do you teach others to do the same?


Chapter 1 — Getting Started

Central question

What is the simplest possible set of programs that demonstrates the discipline of writing good tools, and what fundamental design principles do even the simplest programs reveal?

Main argument

The filter model and standard I/O

The chapter opens by establishing the book's foundational architectural choice: every program in the book reads from standard input and writes to standard output. This deliberate constraint — which the authors acknowledge some systems do not make easy — is what makes programs composable. A character copier, a word counter, and a tab expander can be chained together precisely because none of them cares where its input comes from or where its output goes.

The five programs of Chapter 1 are chosen for minimal complexity, not for triviality. copyprog (a character-by-character copy of input to output, analogous to Unix cat) establishes the basic idiom: a loop over getc/putc calls, a NEWLINE character as the line terminator, and the special sentinel value ENDFILE to signal end-of-input. The character-level I/O abstraction, rather than Pascal's built-in file operations, is introduced immediately and used throughout the entire book — a design decision the authors defend on portability grounds.

Counting programs and incremental complexity

charcount adds a single accumulator to the copy loop; linecount tests for NEWLINE; wordcount introduces a simple state machine (in-word / not-in-word) to count words defined as maximal non-blank, non-tab, non-newline sequences. The progression from copyprog to wordcount is deliberate: each program adds one conceptual mechanism — accumulation, line detection, state — without changing the underlying I/O skeleton. The reader sees that structural regularity before encountering more complex tools.

detab and the first taste of parameterization

detab replaces tab characters with spaces, using tab stops spaced every n columns (the book uses a default of 4). This requires the program to track its current column position — the first program in the book that maintains non-trivial internal state. The authors introduce the concept of a tab stop array and show how to compute whether a given column is a tab stop. Notably, the include mechanism needed to share the tab stop definition between detab and entab (Chapter 2) is alluded to but not available until Chapter 3, giving the reader an early preview of a problem that will be solved later.

Documentation as a first-class deliverable

Each program is presented with a brief but complete manual page in Unix style: name, synopsis, description, and notes. The authors argue explicitly that documentation is part of the program, not an afterthought, and that the discipline of writing a one-paragraph description of what a program does before writing the code is a reliable filter against muddled designs.

Key ideas

  • Standard input and standard output as the universal interface; programs that respect this constraint compose without modification.
  • A portable character I/O layer (getc/putc, ENDFILE, NEWLINE) replacing Pascal's built-in file operations — necessary because Pascal's standard I/O varies across compilers.
  • The simplest useful program is a character copier; everything else adds mechanism on top of this skeleton.
  • State machines (in-word / out-of-word) as the natural structure for programs that must track context across characters.
  • Column tracking as a primitive that recurs throughout the book whenever tab-sensitive output is required.
  • Writing the program's description before the code as a design discipline, not just a documentation chore.
  • Exercises as integral to learning: each chapter section ends with exercises that extend or generalize the tool just built.

Key takeaway

The simplest tools establish the foundational idioms — standard I/O, character-level abstraction, incremental state — that all subsequent, more complex tools inherit.


Chapter 2 — Filters

Central question

What family of text-transformation programs can be built entirely as filters (reading stdin, writing stdout), and what does building them teach about abstraction, portability, and small domain-specific languages?

Main argument

The filter as a reusable program unit

A filter is any program that reads a stream of characters from standard input, transforms it in some way, and writes the result to standard output. The filter abstraction is powerful not because individual filters are complex but because filters compose: piping the output of compress into expand yields the identity transformation, and piping translit into compress yields a combined operation that neither tool anticipated. The chapter builds six filters of increasing sophistication and uses each to sharpen the reader's sense of what makes a filter design good or bad.

entab: the companion to detab

entab is the inverse of detab: it replaces runs of spaces with the minimum number of tabs and spaces that produce the same column layout. Its structure mirrors detab almost exactly — both maintain a column counter and a tab-stop array — which gives the authors an opportunity to discuss code reuse. The include mechanism for sharing tab stop definitions is previewed here; it will be implemented properly in Chapter 3.

overstrike: historical I/O abstraction

overstrike handles the historical convention of overstriking: some terminals and printers rendered bold or underlined text by printing a character, backspacing, and printing again (or printing an underscore). Overstrike reads a stream containing embedded backspaces and converts it into multiple output lines, one per "layer" of characters. This program is less practically relevant in 1981 than the others but serves as an exercise in managing multiple simultaneous output streams and in thinking carefully about what a "character" means at the byte level.

compress and expand: run-length encoding

compress encodes repeated characters as a count followed by the character (a simple form of run-length encoding), and expand reverses the operation. Together they demonstrate the general pattern of paired encode/decode filters and introduce the question of how to choose an encoding that is unambiguous — that is, how to ensure the decoder can always determine where one encoded run ends and the next begins. The authors choose a simple scheme: runs of two or more identical characters are encoded as count character, while single characters pass through unchanged. The pair also introduces the idea that a filter's output should itself be valid input to a related filter.

echo: introducing command-line arguments

echo is trivially simple as a text transformer (it just writes its arguments to stdout separated by spaces, followed by a newline) but introduces something new: processing command-line arguments. The getarg primitive, which retrieves the nth command-line argument as a string, is introduced here and will be used in nearly every subsequent program. The authors discuss argument conventions — how to number arguments, what to do if too few are supplied, whether flags should be prefixed with a dash — and establish the conventions the rest of the book follows.

translit: a small domain-specific language

translit is the chapter's most sophisticated program and the one with the most lasting design lessons. It takes two character-set specifications as arguments and translates characters in the first set to corresponding characters in the second, deleting characters that appear in the first set but not the second (if the second is shorter). The power of translit comes from its argument language: a-z denotes the range of lowercase letters, [] encloses an explicit set, and a special escape character handles characters that cannot be typed directly. This miniature language — a domain-specific language for character-set specification — allows the user to express complex transformations concisely. The authors use translit to introduce the idea that a well-designed argument syntax can multiply a program's utility far beyond what enumeration would allow.

Key ideas

  • Filters compose: the output of one filter is valid input to another, enabling pipelines not anticipated at design time.
  • Paired encode/decode filters (compress/expand) demonstrate symmetric design and the importance of unambiguous encoding.
  • Command-line argument conventions — numbering, validation, error reporting — established by echo and carried through the book.
  • Domain-specific languages (the translit argument syntax) as a design pattern: a small, well-defined notation that lets users express complex operations concisely.
  • Code reuse between entab and detab previews the need for the include mechanism developed in Chapter 3.
  • The choice of what to include in a filter's "language" is a design decision: too little makes the tool inflexible; too much makes it a programming language.

Key takeaway

Filters multiply their utility through composition and through small, well-designed argument languages; the translit program shows that a few metacharacters can make a simple tool vastly more expressive.


Chapter 3 — Files

Central question

How do programs that must handle named files, multiple input sources, and collections of files differ from pure filters, and what mechanisms — file inclusion, file comparison, concatenation, and archiving — are needed to build a useful file-handling toolkit?

Main argument

Beyond standard I/O: named files and argument-driven I/O

The programs in Chapters 1 and 2 read only from standard input. Real tools often need to operate on named files passed as arguments, open multiple files in sequence, and report errors per file rather than globally. Chapter 3 introduces the portable file-open and file-close primitives (fopen, fclose, xopen, xclose) that abstract over Pascal's non-standard file-handling mechanisms, and uses them in every program in the chapter.

compare: file differencing

compare reads two files in parallel and reports lines that differ between them, printing the line number and both versions. The implementation introduces the challenge of reading from two files simultaneously, managing two "current line" buffers, and deciding what "end of one file but not the other" means. The program is a direct precursor to Unix diff, though simpler: it reports only exact-line differences without computing a minimal edit distance.

include: the file-inclusion preprocessor

include reads a file and, whenever it encounters a line of the form #include filename, replaces that line with the contents of the named file (recursively). This is the mechanism that Chapter 1 promised for sharing tab-stop definitions between detab and entab, now delivered. The implementation is notable for its use of a stack of open file handles to manage nested includes, and for the explicit discussion of error handling: what to do if the included file does not exist, or if includes are nested too deeply.

The include tool is also significant as an architectural example: rather than building file inclusion into every program that might need it, the authors factor it out as a standalone preprocessor. Programs that need it invoke include; programs that do not are unaffected.

concat: file concatenation with argument-driven input

concat concatenates one or more files, writing all of them in sequence to stdout. If no file arguments are given, it reads from standard input — following the Unix convention that makes programs behave sensibly both alone and in pipelines. The implementation demonstrates the standard idiom for argument-driven I/O: loop over arguments, open each file, copy it, close it; if no arguments, copy standard input.

print: printing with headers

print is a filter that adds page headers (file name, page number, date) and handles pagination — inserting form feeds and resetting line counters at page boundaries. It introduces the concept of output state (current page number, current line on page, header format) that the format program of Chapter 7 will generalize into a full document formatter. The choice of page width and length as parameters rather than hard-coded constants illustrates the general principle: programs should not have magic numbers baked into them.

archive: a file-collection manager

archive is the chapter's most substantial program. It manages a single archive file containing multiple member files, supporting operations to create, list contents, extract, delete, update, and print members. The archive format is a simple sequential file with a header record for each member (name, size in bytes) followed by the member's content. The implementation demonstrates how to design a binary file format that is self-describing, how to scan a sequential file for a named member, and how to update an archive by rewriting it (since Pascal does not support in-place modification of arbitrary file positions).

The archive program is pedagogically important because it is the first program in the book that has multiple distinct subcommands — the same program invoked with different first arguments (create, list, extract, etc.) performs different operations. This models the design of tools like Unix ar and tar.

Key ideas

  • Portable file-open/close primitives as a necessary abstraction layer over non-standard Pascal file handling.
  • Argument-driven I/O: programs that loop over file arguments and fall back to stdin when none are given compose cleanly with pipelines.
  • include as a standalone preprocessor rather than a built-in feature: factoring common mechanisms into separate tools.
  • Stack-based recursion for nested file includes.
  • The archive program's design of a self-describing binary format with header records.
  • Subcommand dispatch (create/list/extract/delete/update) as a pattern for multi-operation tools.
  • Error reporting with file names and line numbers so the user can locate problems.

Key takeaway

Programs that handle files need a portable I/O abstraction layer and a consistent convention for argument-driven vs. stdin-driven input; the archive program shows how a single executable can provide a family of related operations through subcommand dispatch.


Chapter 4 — Sorting

Central question

How do you design a family of sorting tools that work correctly on arbitrary input sizes, handle the full range of sort keys and comparison orders, and scale from in-memory to external (disk-based) sorting?

Main argument

The progression of sorting algorithms

The chapter opens with the simplest correct sorting algorithm — bubble sort — not because it is efficient but because its correctness is transparent and it establishes the data structure (an array of strings) and comparison function that all subsequent variants use. The authors then replace bubble sort with Shell sort, which is dramatically faster in practice while remaining straightforward to implement and understand. Shell sort works by making multiple passes over the array with decreasing gap sizes, eventually reducing to a gap of 1 (insertion sort). The key insight is that by sorting elements far apart first, Shell sort reduces the amount of work needed in later, fine-grained passes.

Quicksort and its implementation subtleties

The chapter's main sorting algorithm is quicksort, presented both for its practical efficiency and as an opportunity to discuss implementation subtleties: pivot selection (using the middle element to avoid worst-case behavior on already-sorted input), partition correctness, and the recursive structure. The authors are careful to present a version that handles edge cases — empty subarrays, arrays of size one, arrays with many duplicate keys — and to test it against adversarial inputs. The quicksort implementation uses Pascal's recursive procedure calls and demonstrates how a clean recursive structure produces correct, readable code.

In-memory sort with a string table

A practical in-memory sort reads all input lines into a string table (a large character array with a separate array of pointers into it), sorts the pointer array without moving the strings themselves, and writes the sorted lines. This data-structure choice — separating the string content from the sort keys — is a recurring pattern in systems programming. The authors show how to size the string table, what to do if input exceeds it, and how to write the sorted output by iterating over the sorted pointer array.

External sort for large files

When input is too large to fit in memory, the sort must work in passes: a run-generation phase reads chunks, sorts them in memory, and writes sorted runs to temporary files; a merge phase merges the runs into a single sorted output. The chapter implements both phases and discusses the merge algorithm, including a k-way merge using a priority queue (the authors use a simple selection over the heads of k runs rather than a heap). External sort is the most algorithmically complex program in the book and the one that most directly engages with the constraints of 1981-era systems (limited RAM, relatively slow disk I/O).

unique: removing duplicates from sorted input

unique reads a sorted stream and removes consecutive duplicate lines — a simple filter whose practical importance depends entirely on composing it with sort. The authors use unique as an example of a program whose design is determined by its place in a pipeline: it assumes sorted input (a precondition it cannot check) and produces deduplicated output that other tools can consume.

kwic: keyword-in-context indexing

kwic (keyword-in-context) takes a set of input lines and produces an index in which each line appears once for each significant word it contains, rotated so that the keyword appears at a fixed column. The output, when sorted, produces a concordance useful for finding all occurrences of a word in context. The program requires sorting rotated versions of lines without actually storing all the rotations, accomplished by storing rotation offsets rather than copies. unrotate reconstructs the original lines from the rotated, sorted output. The kwic/unrotate pair is the most demanding end-to-end pipeline in the book and demonstrates the power of composing small, single-purpose tools.

Key ideas

  • The sorting algorithm progression (bubble → Shell → quicksort) as a pedagogical sequence: correctness first, then efficiency.
  • Shell sort's gap sequence as a practical improvement that does not require a complex proof of correctness.
  • Quicksort pivot selection as a design decision that prevents worst-case behavior on common inputs.
  • Separating string content from sort keys (pointer array over string table) as an efficiency pattern.
  • Run-generation and merge-phase architecture for external sorting.
  • unique as a filter that assumes a precondition (sorted input) rather than establishing it — a legitimate design choice when composing tools.
  • kwic/unrotate as a showcase of the filter-pipeline model's expressive power.

Key takeaway

A well-designed sort tool progresses from simple correctness to practical efficiency, handles both in-memory and external cases, and composes with filters like unique and kwic to solve index-building problems that no single program could solve alone.


Chapter 5 — Text Patterns

Central question

How do you write programs that search for and transform text based on patterns rather than literal strings, and how do you design a pattern language that is expressive, consistent, and implementable?

Main argument

find: the grep precursor

find reads lines from stdin (or named files) and prints those that match a given pattern. It is the direct predecessor of Unix grep. The initial version matches literal strings; subsequent refinements introduce metacharacters that make the pattern language far more expressive. The authors discuss why a pattern-based tool is fundamentally more useful than a string-search tool: the programmer specifies what kind of text to find, not exactly which text.

The pattern language

The book's pattern language uses these metacharacters:

  • . — matches any single character
  • * — matches zero or more occurrences of the preceding pattern element
  • [...] — matches any character in the specified set; [^...] matches any character not in the set
  • % — matches the beginning of a line (the book uses % rather than ^ to avoid conflict with Pascal's comment syntax and some terminal conventions)
  • $ — matches the end of a line
  • @ — escapes the next metacharacter (analogous to \ in modern regex)

This language is a recognizable subset of what would become POSIX regular expressions, designed to be implementable with a simple recursive descent matcher and to cover the most common text-search use cases.

Pattern compilation and matching

The authors implement pattern matching in two stages: compilation converts the textual pattern into an internal representation (a compact array encoding each pattern element and its type), and matching applies the compiled pattern to a text string. The two-stage design allows the same compiled pattern to be applied to many lines efficiently, which matters for the find program (which applies the pattern to every input line). The matching algorithm handles anchoring (patterns that must match at the beginning or end of a line), character classes, and the * quantifier's greedy behavior.

change: search-and-replace

change extends find with a replacement: for each line containing a match of the source pattern, it substitutes the matched portion with a replacement string. The replacement string may include & to refer to the entire matched text — the first appearance in the book of a backreference. The implementation of change requires the match function to return not just a yes/no answer but the start and end positions of the match within the line, so that the surrounding text can be reassembled around the substitution.

The chapter discusses the interaction between greedy matching and substitution: * matches as much as possible, which can lead to surprising behavior when the pattern is applied to a line with multiple possible match positions. The authors do not implement non-greedy matching but note the issue explicitly.

Key ideas

  • Pattern languages as a way to specify classes of text rather than specific strings.
  • The metacharacter set (., *, [...], %, $, @) as a minimal but expressive pattern vocabulary.
  • Two-stage design: compile the pattern once, apply the compiled form many times.
  • The matching algorithm's handling of anchoring, character classes, and *.
  • The & backreference in change as a simple but powerful substitution feature.
  • Greedy vs. non-greedy matching as a design choice with user-visible consequences.
  • Pattern matching as the enabling technology for the editor built in Chapter 6.

Key takeaway

A small, carefully chosen set of metacharacters gives a pattern language far greater expressive power than literal-string matching at the cost of a pattern compiler and a slightly more complex matching algorithm — a trade-off that pays off immediately in the editor chapter.


Chapter 6 — Editing

Central question

How do you build a line-oriented text editor from the pattern-matching and file-handling primitives developed in earlier chapters, and what design decisions determine the editor's usability and extensibility?

Main argument

The line editor model

The edit program is a line-oriented text editor in the tradition of Unix ed. It maintains the text of a file as an array of lines in memory (the edit buffer), reads commands from the user one line at a time, modifies the buffer according to each command, and writes the buffer back to a file on exit or explicit save. The line-oriented model — as opposed to a screen-oriented editor — is both a product of 1981-era terminals and a deliberate pedagogical choice: a line editor is complex enough to demonstrate real design challenges (address parsing, command dispatch, undo) while remaining small enough to implement and understand completely.

Address specification

The editor's command language allows lines to be addressed by number (5), by pattern (/pattern/), by the current line (.), by the last line ($), or by a range (1,$, .,+3). The address parser is the most syntactically complex component of the editor and occupies a significant portion of the chapter. The authors use a recursive descent approach: they parse addresses before dispatching to the command handler, separating address computation from command execution. This separation makes it easy to add new commands without touching the address parser.

Command set and dispatch

The editor supports the standard ed-like command set: p (print), d (delete), a (append after), i (insert before), c (change — delete and replace), s (substitute, using the pattern language from Chapter 5), r (read a file into the buffer), w (write the buffer to a file), q (quit), and = (print current line number). Command dispatch is a simple character-switch: read the first non-blank character after the address, dispatch to the corresponding procedure.

The s (substitute) command integrates the pattern compiler and match function from Chapter 5, demonstrating how the earlier chapters' work pays off in a real application. The editor's substitute command supports the g (global) flag to replace all occurrences on a line and the p flag to print the result.

Global commands and pattern-driven operation

The g global command applies a sequence of other commands to every line that matches a pattern — an enormously powerful extension that allows operations like "delete all blank lines" or "substitute across all matching lines" to be expressed as single commands. The implementation of global requires iterating over the buffer, marking matching lines, and then executing the command sequence on each marked line without corrupting the iteration. This is one of the more subtle implementation challenges in the chapter.

Design lessons from the editor

The editor chapter is the longest in the book and the one that most directly engages with software design at the architectural level. Key lessons include: keeping the address parser separate from command execution; designing commands to compose (global + substitute is more powerful than either alone); keeping the internal representation simple (an array of strings rather than a gap buffer or linked list) and accepting the performance limitations that entails; and writing each command as a self-contained procedure with a clean interface to the buffer and the address state.

Key ideas

  • The edit buffer as an in-memory array of line strings: simple, correct, and adequate for files that fit in memory.
  • Address parsing as a separate, self-contained component that feeds all commands.
  • Relative, absolute, pattern-based, and range-based addressing as a unified system.
  • Command dispatch via character-switch: clean, extensible, easy to read.
  • The s substitute command as the integration point for Chapter 5's pattern matching.
  • Global commands as the multiplicative interaction of address matching and command execution.
  • The deliberate choice of simplicity (array buffer) over efficiency (gap buffer), appropriate for a pedagogical implementation.

Key takeaway

A line editor is built from three separable components — address parsing, command dispatch, and buffer management — and the discipline of keeping them separate is what makes the editor extensible and its behavior predictable.


Chapter 7 — Formatting

Central question

How do you build a document formatter — a program that takes marked-up text and produces filled, justified, paginated output — and what is the minimal command set needed to make it genuinely useful?

Main argument

The formatter model

The format program reads a stream of text lines, some of which are formatting commands (lines beginning with a dot, like .fi, .nf, .sp, .bp, .ce), and produces formatted output: filled paragraphs, page headers, justified lines, centered text, and controlled spacing. This is the direct predecessor of Unix roff and troff. The formatter represents a significant increase in program complexity over any previous chapter: it must maintain a substantial amount of output state (current line being assembled, current fill mode, current page position, output line width, number of blank lines pending) and update it correctly in response to each command and each word of input.

Fill mode and justification

In fill mode (the default), the formatter collects words from the input into an output line, breaking to a new line only when the next word would exceed the line width. When justification is enabled, each completed line (except the last line of a paragraph) is padded with extra spaces between words so that it reaches exactly the desired line width. The padding algorithm distributes extra spaces as evenly as possible, alternating from which end spaces are added to avoid the visual "river" effect (columns of white space running diagonally through a paragraph).

No-fill mode and verbatim output

In no-fill mode (.nf), each input line becomes one output line, preserving whitespace and line breaks exactly. This mode is used for program listings, tables, and any text where the author's line breaks are meaningful. The formatter tracks fill mode as a boolean and switches between modes at each .fi (fill) or .nf (no-fill) command.

Spacing, pagination, and page headers

The formatter maintains a count of lines output on the current page. When a .bp (break page) command is encountered, or when the page fills, it outputs a form feed, resets the line counter, and prints the page header (which includes the page number, updated on each new page). The .sp n command outputs n blank lines (or as many as will fit on the current page without triggering a page break). The .ce n command centers the next n input lines.

The formatter's command set

The full command set implemented in the chapter includes: .fi (fill on), .nf (fill off), .br (break — flush the current partial line without justification), .sp n (space n lines), .bp (begin new page), .ce n (center next n lines), .ls n (line spacing — single or double), .pl n (page length), .ll n (line length), and .ti n (temporary indent — indent the next output line by n spaces). This set is deliberately minimal — the authors note that a production formatter would need dozens more commands — but sufficient for formatting the programs and documentation in the book itself.

Key ideas

  • The formatter as a state machine: every input token (word, blank line, command) transitions the formatter's output state.
  • Fill mode as the default: most text should be reflowed to fit the output width; no-fill is the exception for structured content.
  • Justification via space redistribution: distributing padding evenly and alternating from which end prevents visual rivers.
  • Page headers and pagination as output state, updated automatically at page boundaries.
  • The minimal command set as a design philosophy: implement the 10 commands that handle 90% of documents rather than the 100 commands that handle everything.
  • The formatter is itself formatted with format: the book's source was processed by the tools it describes.

Key takeaway

A document formatter is a state machine that translates marked-up text into filled, justified, paginated output; its complexity lies not in any single operation but in correctly composing a dozen pieces of output state in the presence of arbitrary command sequences.


Chapter 8 — Macro Processing

Central question

How do you build a general-purpose macro processor — a program that performs textual substitution driven by user-defined patterns — and how do macro arguments, built-in operations, and recursive expansion interact?

Main argument

What a macro processor does

A macro processor reads a stream of text and, whenever it encounters a defined macro name (possibly followed by parenthesized arguments), replaces it with the macro's body, optionally with the arguments substituted for formal parameters in the body. The result may itself contain macro calls, which are expanded recursively. This is the technology behind m4 (Unix's general-purpose macro processor), C preprocessor #define, and TeX's \def command. The macro program built in this chapter is a scaled-down but fully functional implementation.

The define built-in

The fundamental operation is define(name, body), which associates the string name with the string body in the macro table. After this definition, every occurrence of name in the input (not inside a quoted string) is replaced with body. The macro table is implemented as a simple hash table (or linear search for small tables) of name-body pairs.

Argument substitution

Macros can take arguments: define(greet, Hello, $1!) defines a macro that, when called as greet(World), expands to Hello, World!. The $1, $2, etc. placeholders in the body are replaced by the corresponding actual arguments at expansion time. The argument-parsing logic — finding the matching closing parenthesis, splitting arguments at commas, handling nested parentheses — is a non-trivial piece of the implementation and a recurring theme in language processors.

Recursive expansion and quoting

Because macro expansion may produce text that itself contains macro calls, the expander must apply itself recursively to its own output. This creates a potential for infinite loops (a macro that expands to a call to itself) but also the power to write macros that produce structured text. The quoting mechanism ([ and ] in the book's implementation, though the exact characters vary) suppresses expansion of the quoted text, allowing macro bodies to contain literal uses of macro names.

Built-in operations

The macro program includes several built-in macros that cannot be defined by the user:

  • define(name, body) — define a macro
  • undefine(name) — remove a definition
  • ifdef(name, yes, no) — conditional: expands to yes if name is defined, no otherwise
  • len(string) — returns the length of its argument as a decimal string
  • substr(string, start, length) — returns a substring
  • expr(arithmetic-expression) — evaluates a simple integer arithmetic expression and returns the decimal result

The expr built-in is the most complex: it implements a simple expression evaluator supporting +, −, ×, ÷, and parentheses, with the result returned as a string for further expansion. This allows macros to perform numeric computation, enabling uses like automatic section numbering.

Design of the expansion engine

The expander reads input tokens (words, punctuation, parentheses), looks up each token in the macro table, and if found, reads the argument list, substitutes arguments into the body, and pushes the expanded text back onto the input for re-scanning. The use of a pushback buffer — a stack of characters to be re-read before fresh input — is the key mechanism that implements recursive expansion without explicit recursion in the expander itself.

Key ideas

  • Macro expansion as systematic textual substitution: the macro processor does not understand the text it processes, only its syntactic structure (names and parenthesized argument lists).
  • The define built-in as the user's interface to the macro table.
  • Argument substitution via $1, $2 placeholders — a minimal but sufficient parameter mechanism.
  • Quoting to suppress expansion: necessary for writing macros that produce macro definitions.
  • Recursive expansion via a pushback buffer: the output of one expansion becomes the input to the next.
  • Built-in macros (ifdef, expr) that extend the macro language with control flow and arithmetic.
  • The macro processor as the book's final and most general tool: a programmable text transformer.

Key takeaway

A macro processor is a recursive text substitution engine whose power comes from the combination of user-defined patterns, argument substitution, quoting, and built-in operations — and the pushback-buffer design cleanly separates the expansion mechanism from the macro table management.


The book's overall argument

  1. Chapter 1 (Getting Started) — establishes the foundational idioms: standard I/O, portable character-level primitives, and simple state machines; the simplest programs set the template all later programs follow.
  2. Chapter 2 (Filters) — extends the filter model to transformation tools and introduces domain-specific argument languages; demonstrates that composition multiplies utility.
  3. Chapter 3 (Files) — adds portable multi-file I/O, the include preprocessor, and the archive manager; shows how standalone tools can replace built-in language features.
  4. Chapter 4 (Sorting) — builds a complete sorting toolkit from naive to external, showing the progression from correctness to efficiency; the kwic/unrotate pair proves the pipeline model's expressive power at scale.
  5. Chapter 5 (Text Patterns) — introduces the pattern language and a two-stage compiler/matcher; pattern matching is the enabling technology for the editor and is reused directly in Chapter 6.
  6. Chapter 6 (Editing) — assembles address parsing, command dispatch, and buffer management into a full line editor; demonstrates that separating concerns makes a complex program tractable.
  7. Chapter 7 (Formatting) — builds a document formatter as a state machine over a minimal command set; the formatter is the first tool complex enough to process its own source.
  8. Chapter 8 (Macro Processing) — crowns the progression with a recursive text-substitution engine; the macro processor is the most general tool in the book and can be used to pre-process input to any of the earlier tools.

Common misunderstandings

Misunderstanding: The book is about Pascal.

The book uses Pascal as its implementation language, but its subject is software design. The authors say explicitly in the preface that the lessons — top-down design, portability, composability, testing discipline — apply in any language. Readers who skip the book because they do not use Pascal, or who use it only as a Pascal tutorial, miss the point. The Pascal code is the medium, not the message.

Misunderstanding: The tools described are obsolete because Unix already provides grep, sort, ed, etc.

The book is not primarily a tools-delivery mechanism; it is a methods textbook. The fact that Unix ships with grep does not diminish the value of understanding how to build a pattern matcher from first principles, just as the existence of compiled math libraries does not make it useless to understand how numerical algorithms work. The programs are vehicles for design lessons.

Misunderstanding: The portable I/O layer is a workaround for 1981 Pascal deficiencies, not a design principle.

The portable I/O layer (getc, putc, ENDFILE) is partly motivated by Pascal's non-standard I/O across compilers, but the underlying principle — that programs should depend on an abstraction, not on a platform — is a permanent design principle. Modern equivalents include abstracting over file handles, network streams, and in-memory buffers behind a common interface.

Misunderstanding: Top-down design means writing the main procedure first and filling in the rest later.

The authors mean something more specific: decompose the problem into sub-problems whose specifications can be written down independently, choose names that make the main procedure read like a description of the algorithm, and implement sub-procedures only after the top-level structure is clear. This is a discipline for managing complexity, not a mechanical order for writing code.

Misunderstanding: Each program in the book is a finished, production-ready tool.

The programs are complete and tested but deliberately simplified. The format program does not handle footnotes or tables; the editor lacks undo; the macro processor does not handle all edge cases of quoting. The authors explicitly use exercises to point toward the missing features. The programs are teaching artifacts, not production software, though several are close enough to production quality to be directly useful.


Central paradox / key insight

The book's central insight is that constraining programs makes them more powerful. Specifically: by insisting that every program read from standard input and write to standard output — forgoing named files, GUI interfaces, and complex state — the authors create programs that combine with each other in ways that no designer anticipated. The constraint that seems to limit each tool is the very property that makes the toolkit as a whole more capable than any single general-purpose program could be.

This is the Unix philosophy stated as a design principle rather than a cultural observation: small programs with narrow interfaces compose into large solutions; large programs with rich interfaces do not compose at all. The paradox is that adding capability to an individual tool (by giving it file arguments, configuration files, and multiple modes) typically subtracts capability from the system of tools taken together.

"Programs that do one thing and do it well, and programs that work together, are more powerful than large programs that do many things."

The book does not state this as an aphorism but demonstrates it operationally: the kwic/sort/unrotate pipeline, the include preprocessor feeding into the formatter, the pattern-matching code from Chapter 5 appearing unchanged inside the editor in Chapter 6 — all of these are consequences of the single design discipline of keeping programs small and their interfaces simple.


Important concepts

Standard input / standard output (stdin/stdout)

The universal I/O interface: a program that reads only from stdin and writes only to stdout can be used anywhere in a pipeline without modification. The discipline of building every tool as a pure filter is the architectural foundation of the entire book.

Filter

A program that reads a character stream from stdin, transforms it in some way, and writes the result to stdout. Filters compose by connecting the stdout of one to the stdin of the next.

ENDFILE sentinel

The special character value (defined as a constant, distinct from any valid input byte) that getc returns at end-of-input. Using an explicit sentinel rather than a boolean flag unifies the "more input" and "end of input" cases in the same loop, making filter loops uniformly structured.

Portable I/O primitives (getc, putc, fopen, fclose)

A thin abstraction layer over Pascal's non-standard file I/O, designed to behave identically across all Pascal implementations. The rest of the book's programs depend only on these primitives, not on any compiler-specific file operations.

Top-down design

The practice of writing the main procedure of a program first in terms of named sub-procedures whose specifications are clear but whose implementations are deferred. The main procedure serves as a specification of the algorithm; the sub-procedures are then implemented and tested independently.

Portability

The property of code that compiles and runs correctly on multiple hardware architectures and operating systems without modification. In 1981, Pascal varied significantly across compilers; the book's I/O layer, use of symbolic constants for all magic values, and avoidance of non-standard features are all portability disciplines.

Pattern language

A small formal language for specifying classes of text strings, using metacharacters (., *, [...], %, $, @). More expressive than literal-string matching; the basis for find, change, and the editor's substitute command.

Pattern compilation

The two-stage design of pattern matching: first compile the textual pattern into a compact internal representation; then apply the compiled form to each input line. Separating compilation from matching allows the same compiled pattern to be used many times efficiently.

Run-length encoding

The compress/expand scheme: represent a run of n identical characters as the count n followed by the character, reducing the size of text with many repeated characters. Used in Chapter 2 as an example of paired encode/decode filters.

Shell sort

A sorting algorithm that makes multiple passes with decreasing gap sizes, performing insertion sort within each gap. Faster than bubble sort in practice for medium-sized inputs and simple enough to implement correctly, Shell sort serves as the first "efficient" sort in the Chapter 4 progression.

KWIC (Keyword-In-Context)

A concordance technique: rotate each input line so that each significant word appears at a fixed column position, sort the rotations, and the result is an index that groups all occurrences of each word together in context. The kwic/sort/unrotate pipeline in Chapter 4 is the book's most elaborate demonstration of tool composition.

Edit buffer

The in-memory representation of the file being edited: an array of line strings, addressed by line number. The simplest possible representation that supports all of the editor's operations.

Macro expansion

The process of replacing a macro invocation (a name, optionally with arguments) with the macro's body, with argument values substituted for formal parameters. Expansion is applied recursively, so the result of one expansion may trigger further expansions.

Pushback buffer

A stack of characters to be re-read before fresh input is consumed. In the macro processor, expanded text is pushed back onto the input for re-scanning, implementing recursive expansion without explicit recursion in the expander.

define

The fundamental built-in macro: define(name, body) associates name with body in the macro table, so that subsequent occurrences of name in the input are replaced by body. The macro table is the macro processor's entire state.


Primary book and edition information

Background and overview

Source code implementations and reimplementations

Related works

  • Kernighan, Brian W. and Plauger, P. J. The Elements of Programming Style. McGraw-Hill, 1974. — The companion volume on code clarity that informs the writing style of every program in this book.
  • Software Tools in Free Pascal — oberon07.com — notes on porting the programs to Free Pascal, with discussion of portability issues.

Additional chapter summaries and study resources

These are secondary summaries and should be used alongside, rather than instead of, the original book.