for Zlex Version 1.02.
Copyright (C) 1995 Zerksis D. Umrigar
Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.
Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one.
Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions.
In order to introduce Zlex, the process of scanning is reviewed and some terms are introduced. Zlex is is compared with similar programs and the motivation for the development of yet another scanner generator is presented. An example is used to illustrate the operation of Zlex.
A scanner is a program or portion of a program which performs the task of partitioning a input stream into a token stream. Writing a scanner is a very common programming task: all applications which analyze some form of text input will usually contain some kind of scanner (even though it may not be identified as such). Such scanners are usually written by hand, typically in a procedural programming language like C or Pascal, where knowledge about the syntax of tokens is intimately interwoven into programming constructs which specify how to recognize those tokens. Even though scanners are usually programs of relatively modest complexity, maintaining non-trivial hand-written scanners can still be a demanding task. .
A scanner-generator is a program which is given a formal specification of the syntax of tokens and automatically generates a scanner from those specifications. Since the specifications are largely declarative in that they specify only what constitutes a token without specifying how to recognize one, they are much easier to maintain than hand-written scanners. A pattern language based on regular expressions is the formalization used to specify the syntax of tokens for most scanner generators. The efficiency of a automatically-generated scanner can be comparable to that of a typical hand-written scanner.
See section Patterns. . . . Zlex is such a scanner generator, automatically transforming a scanner specification into a scanner program. It accepts a scanner specification given in a file referred to as the Zlex source file and generates a C code file referred to as the generated scanner file. The Zlex source file contains (among other things) patterns specifying the syntax of tokens. For each pattern there is also a corresponding action, consisting of arbitrary C code, which is to be executed when the input matches the pattern. These actions are copied verbatim into the generated scanner. The generated scanner file needs to be compiled and linked with the rest of the program and with the Zlex library to produce an executable program.
The generated scanner provides a function which is the main scanner function. Whenever this scanner function is called, it scans its input stream looking for a match with any of the specified patterns. If it finds such a match, it executes the action corresponding to the pattern. If the action terminates in a return, it returns to its caller; otherwise it continues scanning the input looking for the next match. If no pattern matches the current input character, then the scanner executes a predefined action which defaults to merely echoing the unmatched character to the standard output. .
Once the generated scanner has recognized a token, it is typically transformed into a small integer for processing by the rest of the program. A token typically has at least one attribute: its lexeme which is the actual subsequence of the input-stream corresponding to that token. The generated scanner allows its actions to access the lexeme for the current token.
Scanner-generators were popularized by the scanner-generator lex
which is distributed with the popular Unix operating system.
Unfortunately, lex-generated
scanners had the reputation of being
less efficient than hand-written scanners. An attempt to remedy this
efficiency problem resulted in flex
(see section `Flex' in Flex - a scanner generator) which was extremely successful in attaining
this stated goal. Zlex is largely upward compatible with both
flex
and lex. Its raison d'etre is multi-faceted:
lex
and flex
have a large number of features,
they are not flexible enough to be used in certain situations.
lex
and flex
specialize, the lexical specifications for their own input languages are
severely restrictive in not allowing the free-form input typical of
modern programming languages.
The performance of Zlex-generated scanners is comparable to those
generated by flex
, but its additional features enable tasks which
would be very difficult if not impossible with flex
.
The following Zlex program counts the number of lines, words and characters in its standard input, where a word is a maximal string of characters not containing a whitespace character (a whitespace character is defined to be either a space, tab or new-line).
001 /* Word-count program for stdin. */ 002 %% 003 %{ 004 unsigned cc= 0; /* # of chars seen so far. */ 005 unsigned wc= 0; /* # of words seen so far. */ 006 unsigned lc= 0; /* # of lines seen so far. */ 007 %} 008 009 010 [^\t \n]+ wc++; cc+= yyleng; 011 [\t ]+ cc+= yyleng; 012 \n+ lc+= yyleng; cc+= yyleng; 013 <<EOF>> printf("%d %d %d\n", lc, wc, cc);
The above program consists of two sections separated by a line containing
only %%
. The first section is the declarations section which is used
to declare Zlex and C entities (in the above program it is empty). The
second section contains the patterns along with the corresponding C actions.
Lines enclosed within decorated-braces %{
and %}
are copied
directly into the generated C-file. In this example, the lines within
decorated braces at the start of the second section are used to declare and
initialize C variables local to the generated scanner function yylex
.
These variables are counters which keep track of the number of characters,
words and newlines seen so far.
Line 10 in the second section consists of a pattern to match our
specification of a word, followed by a C action. The `[' and
`]' delimit a character-class which specifies a set of
characters. A character-class is a regular expression which matches any
character in that class. For example, [\t \n]
matches any character
which is a tab, blank or newline (Zlex allows C-style escape sequences
starting with `\' within character-classes). The `^' at the
beginning of a class denotes the negation of that character-class: hence
[^\t \n]
denotes any character except a tab, blank or newline, i.e. a
non-whitespace character. The postfix operator `+' denotes one or more
repetitions of the previous regular expression: hence [^\t \n]+
denotes a sequence of one or more non-whitespace characters. Since Zlex
always prefers the longest possible match, the specified regular expression
will match "a maximal string of characters not containing a whitespace
character" -- namely a word.
The action for the first pattern simply increments the word count wc
by 1 and increments the character count cc
by the number of
characters matched (the variable yyleng
always contains the length of
the current lexeme). Lines 11 and 12 handle blanks/tabs and newlines in a
similar manner. Line 13 contains a special pattern which matches
end-of-file and a action which prints out the values of the three counters.
Assuming that the above program is in the file `wc.l', it can be compiled and executed using a sequence of commands similar to the following:
$ zlex wc.l -o wc.c $ cc wc.c -lzlex -o wc $ wc 'Twas brillig, and the slithy toves Did gyre and gimble in the wabe: All mimsy were the borogoves, And the mome raths outgrabe. ^D => 4 23 135 $
The option `-o' for both Zlex and the C-compiler cc allows naming the
output file. The first line transforms the Zlex file `wc.l' to a C
file `wc.c'. The second line compiles the C-file into a executable,
linking it with the Zlex library (which provides a default main
program which merely calls the generated scanner function yylex
).
The third line runs the executable: the next six lines are input followed by
an end-of-file (shown as a ^D). This is followed by a line which is
the executable's output containing the number of lines, number of words and
number of characters.
The enhancements provided by Zlex over lex
and flex
are
the following:
yylineno
feature of lex
. The
method used does not require the generated scanner to test each incoming
character to see if it is a newline.
<stdio>
input functions are not used (`--stdio'
option see section Alphabetical Listing of all Options), then all Zlex generated scanners can operate
interactively without any performance degradation.
When the main scanner function is entered, it initializes its data structures if it is the first time it has been called. It then enters a select-act loop, where it recognizes a pattern which matches a prefix of the current input, carries out the specified C action and then repeats the process on the unprocessed suffix of the input. If the action terminates in a return from the main scanner function, then when the scanner function is called again, it merely reenters the select-act loop.
The following possibilities arise as the scanner attempts to match its patterns with a prefix of its input:
When multiple patterns match a prefix of the input, the scanner needs to choose between these conflicting patterns. These choices are governed by the following rules:
Hence given the scanner specification:
%%
while `Action for keyword while
.'
[[:alpha:]_][[:alnum:]_]+ `Action for an identifier.'
The first pattern simply matches the keyword `while'. The second
pattern matches an identifier which starts with an alphabetic character or
`_' and is followed by one or more alphanumeric characters or
`_'s. If the input is `while', the identifier pattern would match
the prefixes `w', `wh', `whi', `whil' and `while';
the while
keyword pattern would match the entire input. By rules (2)
and (3), the `Action for keyword while
.' will be that executed by
the generated scanner.
As a character is scanned by a Zlex scanner, it may tentatively be matched with a pattern, but subsequently it may be discovered that the tentative pattern match is incorrect and that the character needs to be rescanned for an alternate match. Backtracking refers to the rescanning of characters to identify alternate matches.
Most Zlex scanners will do some backtracking under normal operation.
Backtracking can also be forced by the Zlex programmer by using the special
REJECT
action (see section Forced Backtracking: REJECT
).
.
As a Zlex scanner scans its input, it usually looks ahead by a single character to decide which pattern it is in, and whether it has reached the end of a pattern. That single character lookahead is not always sufficient: the scanner may have to scan several extra characters before it can be sure which action to take.
Consider the following scanner which ignores an alphabetic string if it is followed by a digit, but outputs an alphabetic string in blank-separated groups of upto 4 characters when it is not followed by a digit.
%% [[:alpha:]]{4} printf("%s ", yytext); [[:alpha:]+/[[:digit:]] /* No action. */ .|\n ECHO;
The first pattern matches a sequence of exactly four alphabetical characters (indicated by the `{4}'). The second pattern matches a sequence of one or more alphabetical characters only if it is followed by a digit (indicated by the trailing context `/[[:digit:]]'). The final pattern matches any single character.
Consider the input line
abcdefg
The scanner will scan all the characters in `abcdefg' before it
realizes that the newline terminating this alphabetic string is not a digit
and hence the second pattern cannot match. It will match the first pattern
using `abcd' returning `efg' to the input stream. As part of the
action of matching the first pattern it will output `abcd ', and
will then resume scanning. It will then look at the `efg' it pushed
back, scanning past all three characters before realizing that the input
does not match either of the first two patterns. The only alternative is
the third pattern which it matches, ECHO
ing `e', and pushing
back `fg'. The same sequence of overscan and pushback repeats for
`fg' with output `f' and pushback `g'. Finally the remaining
`g' matches the last pattern. The output is:
abcd efg
Note that the `e' is scanned twice, the `f' thrice, and the `g' four times.
This sort of backtracking in Zlex is not inordinately expensive, but should be avoided if possible. As illustrated by the above example, the backtracking arises because of overlapping patterns: hence overlapping patterns should be avoided as far as possible.
ECHO
ed to yyout
(see section Output in a Zlex Scanner). This default action
can be suppressed by specifying the --suppress-default
option
(see section Alphabetical Listing of all Options).
A scanner which marks all lines containing character sequences which look like ANSI-C trigraphs (which start with a sequence of 2 `?'s) can be generated from the following:
%% .*"??".+ printf("*** %s", yytext);
The `.' is a regular expression which matches any character except
newline; the `*' is a postfix operator which specifies 0 or more
repetitions of the preceeding regular expression; the `+' is a postfix
operator which specifies 1 or more repetitions of the preceeding regular
expression. Hence the pattern will match only lines containing a sequence
of at least two `?' followed by at least one other (non-newline)
character. The action specified for that pattern prints out the contents of
the matching line (yytext
) preceeded by a mark `*** '. Lines
which do not match the specified pattern will be handled character by
character by the default action and echoed to yyout
.
For all applications except very simple filters, usually it is not a good idea to depend on this default behavior for the following reasons:
.|\n
(with an error action) to ensure
that every character will be matched. Such patterns will also need to be
provided for every exclusive start state (see section Start State Patterns).
A Zlex file consists of upto three sections, with each section used for
different purposes. The delimiter sequence %%
on a line by itself is
used to separate sections.
The text contained in these sections is of two types:
#define
the macro
END
to be `}', and then used END
instead of `}':
Zlex does not know anything about C-preprocessing.
%{
. The rest of the line and all subsequent lines are copied
to the generated scanner until a line starting with a decorated right-brace
%}
is encountered. The delimiting %{
and %}
are not
copied. The contents of the copied text are not analyzed at all.
Note that the text within decorated-brace, indented or pattern code blocks is not analyzed in any way. This has the advantage of language independence: if Zlex were to be retargeted to generate a scanner in a language other than C, there would be no change in the specifications for these code blocks. The disadvantage is that it is impossible for Zlex to recognize the terminating delimiter for the code block when it occurs within a target language construct like a comment or string.
%%
which merely copies its input to its output using the default rule (see section Default Action).
Besides comments, this declarations section can contain the following:
%
character followed by an
alphabetic string specifying the directive. The directives currently
accepted include:
%option
%option
directives must precede all other directives.
%array
%option --array
.
%pointer
%option --pointer
.
%s
or %S
%x
or %X
If the optional initial C-code section is present, it is copied into the beginning of the generated scanner function. It can be used to declare and initialize any local variables needed by the Zlex programmer.
A pattern-action rule consists of a pattern (see section Patterns) followed by a action. The possibilities for an action are:
At any point in a zlex source file, outside a code block or a comment a line which looks like
%line nnn file-namewill pretend that the following line is line number nnn from file-name. The file-name is any string not containing newlines enclosed within double quotes `"': it may contain ANSI-C escape sequences. Both the line number nnn and file-name are optional.
A %line
directive, like all other directives, is only recognized when
it occurs at the start of a line. It is useful to track the origin of
source lines when a zlex file is generated automatically from another source
file by a preprocessor. The %line
directive is similar to the
#line
directive accepted by C-preprocessors.
If a %line
directive occurs in section 2 of the zlex source file,
then it may break old lex
or flex
programs which would regard
the character sequence `%line' at the start of a line as a pattern.
Hence the %line
directive is not recognized in section 2 of the Zlex
source file when the `--lex-compat' option is specified
(see section Alphabetical Listing of all Options).
The Zlex programmer is provided with C objects for accessing information
about and controlling the operation of the generated scanner. The C
entities used for this interface are functions, variables and macros. For
example, Zlex provides a main scanner function with default name
yylex
; the text of the last matched token can be accessed using
variables with default names yytext
and yyleng
; Zlex provides
the C macro REJECT
to find an alternate way to tokenize the current
input.
Certain conventions are used in naming these entities. Many of these names can be changed by the Zlex programmer. Many entities also have alternate names. When we refer to an entity in this manual we usually refer to it by its common name, which is the way it was referred to in historical implementations.
Certain conventions are used by Zlex in choosing the names for programmer-visible C entities.
YY_REJECT
and
YY_NEW_FILE
.
In order to retain compatibility with lex
and flex
,
alternate names are provided for some macros. These alternate names do
not meet the above conventions for canonical names. For example,
REJECT
is an alternate name for YY_REJECT
.
Unfortunately, there is no consistency whether the call of a macro M
which does not require any arguments is written as M()
or
simply M
. This inconsistency arises because of the need to
maintain backward compatibility with lex
and flex
.
For example, the default name of the variable which holds the length of the
current lexeme is `yyleng' (see section Current Lexeme Length: yyleng
). If at scanner
generation time, the programmer specifies the option `--prefix=lex_',
then the name of the variable will be `lex_len'. If the programmer
#define
s the macro YY_LENG
to be tokLength
, then the
name will be `tokLength'.
yylex
. Its name can be
changed in a manner similar to variable names (see section Variable Names), by
either specifying the `--prefix' option during scanner generation or by
defining the macro YY_LEX
. For example, if during scanner generation
the programmer specifies the option `--prefix=scan', then the name of
the main scanning function will be `scanlex'. If the programmer
#define
s the macro YY_LEX
to be scan
, then the name of
the function will be `scan'.
The other documented functions are those in the Zlex library. With one
exception, the names of all these library functions start with the prefix
`yy'. They do not contain any underscores, but the first letter of
each word is capitalized. It is not possible to change these names as they
are precompiled into the Zlex library when it is built during installation.
Examples of these library function names are yyCreateBuffer
,
yyTopState
and yySwitchToBuffer
.
.
The one exception to the rule for library function names is the function
main
which is a default main program which simply invokes
yylex()
.
YY_REJECT
and yy_act
.
.
The effect of using a name outside its intended scope is undefined. In practice, it will usually result in a compiler error when compiling the generated scanner.
void
, typedef
'd
as a YYDataHandle
. This pointer can be found in a variable with
default name yydataP
having extern
linkage. Like all
other variable names, the name of this variable can be changed to an
arbitrary name by defining the macro YY_DATA_P
to the new name.
Alternatively, the prefix used for the name can be changed by using the
`--prefix' option (see section Variable Names).
Since this variable has external linkage, it can be accessed from files other than the generated scanner file and passed as a handle to Zlex library routines.
The pattern language used for expressing the syntax of tokens is essentially the language of regular expressions, extended with constructs which allow context-dependent matching.
Note that we distinguish between patterns and regular expressions. Patterns are regular expressions augmented with context-sensitive operators. All regular expressions are patterns but not all patterns are regular expressions.
In patterns, most characters usually stand for themselves: i.e. the occurrence of a particular character in a pattern specifies that that particular character should be matched. However some characters do not stand for themselves but are special meta-characters which tell Zlex how to combine patterns. The meta-characters used by Zlex are the following:
\ " ( ) { } < > . ? | + * / ^ $ , -
If a pattern is required to match any of the above characters, then the
character can be quoted by preceeding it by a backslash `\'. If a
pattern is required to match a backslash, then the backslash itself can be
quoted by using \\
. Any character other than a `\' or `"'
can also be quoted by simply enclosing it within `"' delimiters.
There are some contexts within which the above set of meta-characters is reduced. Since it can be difficult to remember exactly which characters are special within which contexts, it is advisable for the Zlex programmer to quote all non-alphanumeric characters which are to be matched literally.
A
matches the character `A'.
The regular expression #
matches the character `#'. However,
since `#' is a non-alphanumeric character it is advisable to quote it
by escaping it using a `\' as \#
or by enclosing it within
`"' delimiters as "#"
.
+
is not a regular expression since `+' is a
meta-character.
\a
, \b
, \f
,
\n
, \r
, \t
, \v
are regular expressions which
match the characters BEL (bell), BS (backspace), FF
(form-feed), NL (newline), CR (carriage-return), TAB
(tab) and VT (vertical-tab) respectively.
.
lo-hi
, where lo and hi are the first and last
characters in the range. A negated character class is a character class
whose first character is `^' and denotes the complement of the
character class. Escape sequences (see section Escape Sequences) are recognized within a
character-class. The rules for recognizing special characters within
character-classes are different from those for other patterns and are given
below:
Any whitespace character (except newline) is significant within a character
class and specifies a character within the class. Comments are never
recognized within a character class. Hence the character class
[/*a */ a]
would contain the four characters ` ', `/',
`*' and `a'. Newlines are not allowed directly within a character
class and must be specified using the escape sequence `\n'.
[lL]
matches any one of the characters `l' or `L'.
[0-9a-fA-F]
matches any hexadecimal digit.
[^0-9a-zA-Z]
matches any non-alphanumeric character.
[-+]
or [+-]
matches any one of the characters +
or
`-'.
[\t \n]
matches a tab, space or newline character.
Specifying lowercase alphabetic characters using a pattern like [a-z]
may not work with character sets other than ASCII as the character codes for
lower-case letters may not always be contiguous in the underlying character
set. To remedy this problem, POSIX introduced named character classes of
the form [:Name:]
. The class represented by
[:Name:]
is precisely the set of characters c for which
the standard C-library function isName(c)
returns
non-zero.
[:alnum:]
[:alpha:]
[:blank:]
[:cntrl:]
[:digit:]
[:graph:]
[:lower:]
[:print:]
[:punct:]
[:space:]
[:upper:]
[:xdigit:]
These named classes cannot occur directly in a pattern but only as members
of a character class. Hence [:alpha:]
is not a valid pattern but
[[:alpha:]]
is.
.
.|\n
is a pattern which matches any character (`|' is a
regular expression operator specifying the union of two regular
expressions).
"[]"
matches the string consisting of the two characters `[]'.
"\x30\0\"\\\n"
matches the string containing five characters: the
first character has the hexadecimal character code 30
, the second
character has the character code 0, the third and fourth characters are
`"' and `\' respectively, and the last character is the newline
character.
{M}
within a regular expression is expanded to
(R)
. Note that the definition is restricted to be a
regular expression; it cannot be a pattern containing any
context operators (see section Patterns).
A macro name can contain alphanumeric characters or `_' or `-', but must start with a alphabetical character or `_'. When the macro is defined in section 1 of the Zlex file, the name must occur at the beginning of a line. This must be followed by whitespace followed by the macro definition on the same line. The regular expression comprising the macro definition can contain calls to other macros, including those which have not yet been defined. It is an error for a macro to contain a call to itself, either directly or indirectly via calls to other macros. A macro is not expanded until a call to the macro is encountered in section 2 of the Zlex file.
Whitespace is allowed within the defining regular expression (see section Whitespace Within Patterns). When a macro name is used within braces, no whitespace or comments are allowed within the braces. This makes it easier for Zlex to disambiguate a macro use, from the start of a block of C-code.
Macros can be used to make patterns more readable if the Zlex programmer chooses suitable mnemonic macro names. They do not add anything to the expressive power of the pattern language since every use of a macro name is fully equivalent to its defining regular expression enclosed in parentheses.
lex
and flex
programs one often encounters macros
like the following:
alpha [a-zA-Z]This is not portable across all character set and is no longer necessary since named character classes (see section Named Character Classes) can be used instead.
R?
is a regular expression which matches zero or one
occurrences of r'.
[-+]?
denotes an optional sign.
\.?
can be used to denote an optional decimal point.
R*
is a regular expression which matches zero or more
repetitions of r'. `*' is often referred to as the
Kleene-closure or simply closure operator.
[[:alnum:]]*
will match a sequence of zero or more alphanumeric
characters.
[*]*
will match a sequence of 0 or more `*'s.
R+
is a regular expression which matches one or more
repetitions of r'.
[[:xdigit:]]+
will match a sequence of one or more hexadecimal
characters.
.+
will match the rest of the current line provided it is nonempty
(recall that `.' matches any character except a newline).
R{lo,hi}
where lo and hi are
positive integers matches lo through hi occurrences of r';
R{num}
where num is a positive integer
matches exactly num occurrences of r'.
R{lo,}
matches at least lo
occurrences of r'.
No whitespace or comments are ever allowed between the starting `{' and the first digit of the repetition count. This restriction makes it easier for Zlex to disambiguate counted repetition from a C-code block.
[[:alpha:]]{1,6}
matches any
nonempty string of alphabetic characters upto 6 letters long.
[[:digit:]]{3}
matches exactly three digits.
[[:alnum:]]{5,}
matches a sequence of at least 5 alpha-numeric
characters.
RS
is a
regular expression which matches the concatenation of r' and s'.
[a-zA-Z_][-0-9a-zA-Z_]*
can be
used to denote a Zlex macro name.
The regular expression 0[0-7]*
can be used to denote a octal number
in ANSI-C.
R|S
is a regular expression
which matches either r' or s'.
[0-9]+[lL]?[uU]?|[0-9]+[uU]?[lL]?
denotes a ANSI-C integer which consists of a sequence of digits
followed optionally by `l' or `L' (denoting long),
or by `u' or `U' (denoting unsigned) in either order.
The above can be expressed slightly more succinctly by using
parentheses to factor out the [0-9]+
, as
[0-9]+([lL]?[uU]? | [uU]?[lL]?)
.
(R)
is also a regular
expression equivalent to R. The parentheses are used for grouping
regular expressions to override the default precedence of the regular
expression operators.
(a|b)c
matches either the string `ab' or `ac'. If the
parentheses were omitted and the pattern was written as a|bc
, then
the pattern would match the string `a' or the string `bc'.
Some of the operators are merely syntactic sugar and can be expressed in terms of the other operators. For example:
R+ => RR* R{2,4} => RR|RRR|RRRR R{2,} => RRR* [/*-] => "/"|"*"|"-"
The following examples use Zlex regular expressions to specify the syntax of comments in various programming languages.
An Ada comment starts with the characters `--' and continues to end-of-line. A suitable pattern is
"--".*
Pascal has two commenting conventions. One of them is to enclose the body of a possibly multi-line comment within braces `{' and `}'. The body of the comment should not contain any `}' characters. An incorrect attempt to write a regular expression for such a comment is:
"{"(.|\n)*"}" /* Wrong. */
The problem is that the regular expression does not enforce the restriction that the body of the comment should not contain any `}' characters. In fact, with Zlex's rule for preferring the longest match, the above regular expression will interpret all the text between the first `{' and the last `}' in a Pascal file as a comment!
A correct regular expression is:
"{"[^}]*"}"
Though this is correct, it has the disadvantage that it forces Zlex to save the text of a long comment. For more efficient ways of processing long comments, see section Start States Example: C comments.
("/*" "/"* ([^*/] | [^*]"/" | "*"[^/])* "*"* "*/") /* Wrong. */
Analyzing the above expression we realize that the expression within the inner parentheses corresponds to the body of the comment except for a possibly empty prefix containing only `/'s and a possibly empty suffix containing only `*'s. Analyzing the inner expression further, it specifies 0 or more repetitions of
Though the above seems correct, it is not. A counterexample is the valid comment `/**1/*/' which does not match the above regular expression. The problem is caused by ignoring the possible overlap between the subpatterns `[^*]"/"' and `"*"[^/]' where the negated character classes in both patterns may need to match the same character (`1' in the counterexample).
A solution which is claimed to be correct is the following:
("/*" [^*]* "*"+ ([^/*][^*]*"*"+)* "/")
The [^*]*
deals with that prefix of the comment body which does not
contain any `*'s. When a `*' occurs, we need to have a sequence
of one or more of them ("*"+
). The inner closure
([^/*][^*]*"*"+)*
specifies that the sequence of `*'s be
followed by 0 or more repetitions of text not starting with `/' or
`*' and terminating in a sequence of `*'s. So irrespective of the
number of iterations of the inner closure, the input character at the end of
the closure must be a `*'. Hence a further `/' in the input
terminates the comment.
Once again, this is not the recommended way to specify C comments in Zlex because of the possibly excessive growth of the text saved by Zlex. For the recommended method, see section Start States Example: C comments. As the preceeding remark makes clear, these complicated regular expressions are mainly useful as exercises with which to plague students. What is more interesting is the non-eureka process by which these expressions may be constructed, but that is beyond the scope of this manual.
digit [0-9] sign [-+] exp E{sign}?{digit}+ real {digit}+\.{digit}*{exp}?
The above definition allows numbers like `22.' with an empty fraction
and exponent. Unfortunately, constructs like `1..10' are commonly used
in Modula-2 to indicate subranges, and should be scanned as three tokens
`1', `..' and `10'. However since Zlex always prefers the
longest match, the effect of the pattern {real}
on the input
`1..10' will be to scan the first token as 1.
, which is wrong
for Modula-2. One solution is to scan a number as a real only if it is not
followed by a `.' character. This can be achieved by suffixing the
above pattern with a special right-context construct which imposes this
restriction:
{real}/[^.]
`/' is the right-context operator. If R and C are
arbitrary regular expressions, then R/C
is a pattern
which matches input R' iff R matches R' and the input
after R' matches C. Note that the input which matched C
is available to be rescanned.
Returning to the Modula-2 example, {real}/[^.]
will not match the
input 1..10
. Instead the 1
can be matched by a pattern for an
integer, the ..
can be matched by an appropriate pattern, and the
10
can be matched by the pattern for an integer. On the other hand
if the input is 1.+2
, then {real}/[^.]
will match the
`1.', since `+' matches [^.]
. The `+' will then be
rescanned and can be matched by a suitable pattern.
There are no restrictions on the regular expressions on either side of the
`/'. Unfortunately, this freedom allows ambiguous patterns like
[a-zA-Z0-9]+/[0-9]+"#"
, for which there are multiple ways to match an
input like `aA12b123#'. Specifically, the prefixes `aA12b',
`aA12b1' and `aA12b12' all match the specified pattern. It is
necessary for Zlex to use a disambiguating rule to resolve the ambiguity: it
always matches the longest prefix. For the above example, Zlex would match
`aA12b12'. Note that other scanner generators may get confused by
similar patterns.
It is sometimes necessary to match a regular expression R only at the
end of a line. This can be achieved by using the pattern R/\n
.
The $
end-of-line anchor is available to abbreviate this pattern to
R$
. The `$' character is special only at the end of a
pattern.
[A-Za-z0-9]+/[\t ]+$
which
attempts to recognize an alphanumeric word only when it occurs at the end of
a line is illegal, since `$' provides an additional right-context
operator. Instead, the pattern can be written as [A-Za-z0-9]+/[\t
]+\n
which is legal.
In a Zlex scanner, it is possible to use two methods for allowing left-context to influence a match. The first is useful when the interpretation of a token is affected by whether or not it is at the start of a line. The second is more general, and allows encapsulating the left-context into a state which selects a subset of the patterns which are allowed to match.
In a C preprocessor, `#' signals a preprocessor directive only if it occurs at the beginning of a line (preceeded optionally by whitespace). A pattern which recognizes a `#' only when it signals a preprocessor directive is the following:
^[\t \v\f]*\#
The `^' is the start-of-line anchor: the following pattern is matched only if the previous character was a newline character.
When a scanner uses one or more patterns containing the start-of-line anchor
`^', it is possible to query and set the current start-of-line
condition during scanning. See See section Querying Beginning of Line: YY_AT_BOL
and
See section Setting Beginning of Line: yy_set_bol
.
<<EOF>>
(which cannot contain any internal
whitespace or comments) is used to match the end of the input file. It may
be qualified with a set of start conditions using a syntax identical to that
used for qualifying regular expressions. The end-of-file pattern is useful
for doing special processing at end-of-file. The following example shows
how it can be used to signal that a construct like a comment was not
terminated before end-of-file was encountered:
<COMMENT><<EOF>> fprintf(stderr, "EOF detected within comment.");
It is assumed that the scanner entered a COMMENT
start state when a
comment was encountered.
For special Zlex actions which can be used in <<EOF>>
patterns,
see section End-of-File and Termination.
flex
and lex
.
The rules for how whitespace within different constructs are as follows:
<<EOF>>
which is
regarded as an indivisible token. This behavior is independent of the
`--whitespace' option.
{macro}
which are regarded as indivisible tokens. This
behavior is independent of the `--whitespace' option.
One consequence of these rules is that when the `--whitespace' option is used, it is not possible to include a action for a pattern in section 2 of the Zlex file without enclosing the action within braces.
Two variables with external linkage allow accessing the characters constituting the last matched token, as well as its length.
yytext
YY_TEXT
to the new
name. Alternatively, the prefix used for the name can be changed by
using the `--prefix' option (see section Variable Names). Its default
declaration depends on whether the option `--pointer' or
`--array' is used (see section Alphabetical Listing of all Options). When `--pointer' is
used, its default declaration is char *yytext
; when
`--array' is used, its default definition is equivalent to
.
char *yytext[YYLMAX];
where YYLMAX
is a macro which gives the size of the array.
YYLMAX
can be defined by the user in section 1 of the Zlex file if a
value different from the default value (8192
) is desired.
When yytext
is declared to be an array and the length of a matched
lexeme is greater than the value of YYLMAX
, then the yytext
array will silently overflow with unpredictable results. When yytext
is declared to be a pointer, there is no possibility of overflow as the
lexeme text is maintained within the scanner's buffer (which is grown
dynamically as needed).
A scanner in which yytext
is declared to be a pointer is usually
faster than one in which it is declared to be an array. This fact, coupled
with the overflow problem mentioned previously, make a %array
declaration fairly useless except for backward compatibility with
lex.
The Zlex programmer should always treat yytext
as a read-only
variable.
The following program fragment shows a pattern-action pair which matches the
occurrence of an identifier at the beginning of a line and saves it in
dynamic memory pointed to by the variable text
.
%% [[:alpha:]_][[:alnum:]_]* { text= malloc(yyleng + 1); /*+1
for terminatingNUL
. */ if (!text) { `Call an error routine.' } strcpy(text, yytext); }
yyleng
int yyleng
holds the length of the
current token. The length of a token is the number of characters
in the lexeme of the token (not counting any terminating '\0'
).
Like all other variable names, the name of this variable can be changed
to an arbitrary name by defining the macro YY_LENG
to the new
name. Alternatively, the prefix used for the name can be changed by
using the `--prefix' option (see section Variable Names).
The Zlex programmer should always treat yyleng
as a read-only
variable.
The following program produces a histogram of word-lengths, where a word is defined to be a maximal sequence of characters not containing a space, tab or newline.
%{ enum { MAX_WORD_LEN= 10 }; static unsigned freq[MAX_WORD_LEN]; %} %% [^\t \n]+ { if (yyleng > MAX_WORD_LEN) { `Signal error;' } else { freq[yyleng]++; } } [\t \n]+ /* No action. */ <<EOF>> { unsigned i; for (i= 0; i < MAX_WORD_LEN; i++) { printf("%d: %d\n", i, freq[i]); } }
yymore
If an action contains a call to the yymore()
macro, then the lexeme
for that token is prefixed to the lexeme of the next token recognized.
Effectively, this allows the programmer to recognize subtokens within a
larger token. The canonical form YY_MORE()
can also be used instead.
The library function yyMore(YYDataHandle)
can also be used from files
other than the generated scanner file.
For example, let us suppose that an application requires printing out the
input lines in reverse order, and printing the total number of words in the
input. Whenever a token within a line is recognized the scanner executes
a yymore
action: hence when the `\n' terminating a line is
finally matched, yytext
contains the text for the entire line. This
is saved in a stack of lines using a function pushLine()
shown below.
Finally at <<EOF>>
this stack is traversed with the lines being
printed in reverse order.
%{ #include <stdio.h> #include <stddef.h> typedef struct LineStruct { struct LineStruct *last; char *text; } LineStruct; static LineStruct *pushLine(LineStruct *lines, const char *text, int textLen); %} %% /* Declare local variables. */ int wc= 0; LineStruct *lines= NULL; [\t ]+ yymore(); [^\t \n]+ wc++; yymore(); \n lines= pushLine(lines, yytext, yyleng); <<EOF>> { LineStruct *p; for (p= lines; p; p= p->last) fputs(p->text, stdout); printf("# of words= %d\n", wc); } %% static LineStruct * pushLine(LineStruct *lines, const char *text, int textLen) { char *const savedText= malloc(textLen + 1); LineStruct *const lineP= malloc(sizeof(LineStruct)); if (!savedText || !lineP) { fprintf(stderr, "Out of memory.\n"); exit(1); } strcpy(savedText, text); lineP->text= savedText; lineP->last= lines; return lineP; }
A log of running the scanner generated from the above follows:
"Beware the Jabberwock, my son! The jaws that bite, the claws that catch! Beware the Jubjub bird, and shun The frumious Bandersnatch!" ^D => The frumious Bandersnatch!" Beware the Jubjub bird, and shun The jaws that bite, the claws that catch! "Beware the Jabberwock, my son! # of words= 22
Start states allow the behavior of the scanner to depend on the left context within the input. Several actions allow the scanner to control or access its current start state.
Start state qualified patterns can occur only in section 2 of the Zlex file. The syntax for qualifying patterns is to prefix the pattern with the names of the start states separated by commas `,', and enclosed within angle brackets `<' and `>'. The following patterns are examples of start state qualified patterns:
<INITIAL>"/*" <COMMENT>"*/" <INITIAL,COMMENT>\n
where it is assumed that INITIAL
and COMMENT
are suitably
declared start states.
Before a start state name can be used in section 2 of the Zlex file, it must
be declared in section 1 of the Zlex file. An exclusive (inclusive) start
state is declared in section 1 by a line starting with %x
(%s
)
or %X
(%S
) followed by whitespace followed by the name of the
start state on the same line. Multiple start states of the same type can be
declared by including multiple names on the same line separated by space.
The characters allowed within a start state name are identical to those
allowed in a macro name: a sequence of alphanumeric or `_' or `-'
characters starting with an alphabetic or `_' character.
The following are examples of start state declarations:
%x COMMENT C_CODE /* Exclusive start states. */ %s RANGE SS_USE /* Inclusive start states. */
In the generated scanner, the programmer declared start state names are
#define
d to be small integers. Hence the programmer should not use
these names in any other context.
.
.
.
All Zlex generated scanners predefine an
inclusive start state called INITIAL
which is the initial start
state for the scanner when it is first called.
INITIAL
is #define
d to be
0
; The
user should not make any assumptions about the assignment of integers to
other start states, and should always refer to them using their symbolic
names.
BEGIN
BEGIN
is used to set the current start state. To set the
current start state to one with name ss, BEGIN(ss)
can be
used. For backwards compatibility reasons, BEGIN ss
without
the parentheses can also be used.
The canonical name YY_BEGIN
can be used instead; unlike BEGIN
,
the parentheses are always required. To begin start-state ss the form
YY_BEGIN(ss)
is used.
Since the INITIAL
start state (see section Start State Declarations) is
#define
d to be 0
, BEGIN 0
is synonymous with
BEGIN INITIAL
.
The following example shows how inclusive start states can be used to recognize numbers in different bases depending on a specific directive. The base is set by a `%bin', `%oct' or `%hex' directive which must occur at the start of a line.
%s BIN OCT HEX %% ^"%bin" BEGIN BIN; ^"%oct" BEGIN OCT; ^"%hex" BEGIN HEX; <BIN>[01]+ `Action for a binary number.' <OCT>[0-7]+ `Action for a octal number.' <HEX>[a-fA-F0-9]+ `Action for a hexadecimal number.' `Other non-qualified patterns.'
YY_START
The macro YY_START
returns the current start state (an unsigned
integer). YYSTATE
is synonymous with YY_START
.
Accessing the current start state using YY_START
allows the Zlex
programmer to use start-state subroutines. For example, in the scanner for
Zlex, C-style comments are allowed within several constructs. These
comments are processed using an exclusive start state COMMENT
(see section Start States Example: C comments). When we are in a construct and see the
start of a comment, we do a BEGIN COMMENT
after saving the
current start state in a global variable, say commentRet
. Then when
in the COMMENT
state we see the end of the comment we do a
BEGIN commentRet
, which puts us back in the start state in which we
originally saw the comment.
In the above situation, we could predict exactly how many start states we need to save at any time (exactly one). That may not be possible in general. Start state stacks may be used in such situations (see section Start State Stacks).
In a Zlex scanner, start state stacks can be created and manipulated using three routines.
yy_push_state
The macro yy_push_state(ss)
pushes the current start-state on
top of the start state stack and does a BEGIN ss
action. The
canonical name YY_PUSH_STATE
may be used synonymously. From files
other than the generated scanner, the programmer can call the Zlex library
function yyPushState
with prototype:
void yyPushState(YYDataHandle d, YYState ss);
to push the current start state on the start state stack of the scanner
specified by d
and enter start state ss
.
yy_pop_state
The macro yy_pop_state()
sets the current start state to the state on
top of the start state stack and pops the start state stack. The canonical
name YY_POP_STATE
may be used synonymously. From files other than
the generated scanner, the programmer can call the Zlex library function
yyPopState
with prototype:
void yyPopState(YYDataHandle d);
to set the current start state to the state on top of the start state stack
of the scanner specified by d
and pop its start state stack.
yy_top_state
The macro yy_top_state()
return the start state on top of the start
state stack. The start state stack is not changed. The canonical name
YY_TOP_STATE
may be used synonymously. From files other than the
generated scanner, the programmer can call the Zlex library function
yyTopState
with prototype:
YYState yyTopState(YYDataHandle d);
to return the start state on top of the start state stack of the scanner
specified by d
.
The following example is the recommended way to process C-style comments using Zlex. It illustrates the use of exclusive start states to allow the scanner to process the comments in reasonable line-sized chunks.
When the generated scanner sees a `/*' it enters a exclusive start
state named COMMENT
where it is looking for the terminating
`*/'. Because COMMENT
is an exclusive start state, Zlex will
ignore all patterns not qualified by COMMENT
when in the
COMMENT
state.
001 %x COMMENT /* Declare start-state. */ 002 %% 003 "/*" BEGIN COMMENT; 004 <COMMENT>"*/" BEGIN INITIAL; 005 <COMMENT>[^*\n]+ 006 <COMMENT>\n 007 <COMMENT>"*"+/[^/]
Line 1 declares the identifier COMMENT
to be an exclusive
start-state. Line 3 has a pattern for recognizing the `/*' which
begins a comment. Since the pattern is not qualified by any start states,
it will be active in all inclusive start states: namely INITIAL
. Its
action uses the special Zlex macro BEGIN
(see section Entering a Start State: BEGIN
) to enter the special COMMENT
state.
Line 4 recognizes the terminating `*/' only when the scanner is in the
COMMENT
state. Its action is to change the scanner state back to
INITIAL
. Once the scanner is back in the INITIAL
state, the
patterns prefixed by COMMENT
are ignored, and other patterns (not
shown) become active.
Line 5 recognizes any prefix of a comment line which does not contain
`*'. Note the use of \n
in the negated character class; if we
had simply used the regular expression [^*]+
, then it could
conceivably match several lines of text -- something which is undesirable
as the yytext
saved by the scanner may become excessively large.
Lines 6 and 7 recognize those portions of a comment not recognized by line
5. Line 6 recognizes a newline occurring within a comment. The
given code does not have any action but if the scanner is keeping track of
line numbers, an appropriate action would be to increment a line number
counter. Line 7 recognizes `*'s occurring within a comment which are
not followed by a `/'. We use "*"+/[^/]
rather than simply
"*"/[^/]
, as it is always desirable to scan as large a token as
possible to reduce scanner overhead.
{lo,hi}
to separate lo from hi.
The Zlex scanner defines two inclusive start states RANGE
and
SS_USE
which return a comma as a special token. A highly simplified
version of the code is shown below.
%s RANGE /* Start state for counted repetition. */ %s SS_USE /* Start state for start state list. */ %% "{" BEGIN RANGE; "<" BEGIN SS_USE; <RANGE,SS_USE>"," return ','; . return CHAR_TOK;
If a comma is encountered when the scanner is in either one of the states
RANGE
or SS_USE
it is returned as the special token ','
.
Otherwise it is simply returned as a `CHAR_TOK'. Note that any other
characters will be matched using the patterns without any start-state
qualifications: in the currently popular object-oriented parlance,
RANGE
and SS_USE
inherit behavior from the patterns
without start-state qualifications.
A Zlex scanner reads its input from a stdio
FILE
pointer
with default name yyin
. For performance reasons, it buffers its
input. Normally, it is the main scanner function which reads its input
directly from the buffer, but it is also possible for the Zlex
programmer to read directly from the buffer using the input
macro. It is possible for the programmer to specify the method by which
the scanner fills its buffer by defining the YY_INPUT()
macro.
The programmer is allowed to modify the characters in the scanner buffer
and backtrack to alternate matches with the prefix of the input. It is
also possible for the Zlex programmer to query the position in the
current input stream, or the current line or column number. When
patterns involving the start-of-line anchor `^' have been used,
Zlex makes it possible to query and set the current start-of-line
condition.
yyin
yyin
is the default name of the variable with declaration
FILE *yyin
which Zlex uses to read its input. Like all other
variable names, the name of this variable can be changed to an arbitrary
name by defining the macro YY_IN
to the new name. Alternatively,
the prefix used for the name can be changed by using the `--prefix'
option (see section Variable Names).
When the scanner function is first entered it initializes yyin
to
stdin
, unless the user has already initialized it to a
non-NULL
FILE
pointer. So if the generated scanner should
read from a file other than the standard input, the programmer need only
initialize yyin
to a suitable FILE
pointer. For example, the
following main program illustrates how to setup the scanner to read from the
file specified by the first command-line argument.
%{ #define YY_IN inFile /* UseinFile
instead ofyyin
. */ %} %% `Patterns go here.' %% int main(int argc, const char *argv[]) { if (argc < 2) { `Usage error.' } if (!(inFile= fopen(argv[1], "r"))) { `File open error.' } return yylex(); /* Call generated scanner function. */ }
input
The input()
macro returns the next character from the input buffer,
returning -1 if end-of-file is encountered. If C++ is being used, then the
alternate name yy_input()
is used instead. The canonical name
YY_GET()
is also recognized. The Zlex library function
int yyGet(YYDataHandle)
can also be used to read the next character from
the input. It returns -1 on EOF. Use YY_GET()
to read the input
when the call is within the scanner file. Outside the scanner file it is
necessary to call yyGet()
passing it the data handle of the relevant
scanner.
The following excerpt illustrates a common use of input()
to ignore
C-style comments:
"/*" { int ch0, ch1= ' '; do { ch0= ch1; ch1= input(); } while (ch1 != EOF && (ch0 != '*' || ch1 != '/')); if (ch1 == EOF) error("EOF within comment."); }
Note that this is not the recommended way to process comments in Zlex. For the recommended method, see section Start States Example: C comments.
YY_INPUT
YY_INPUT(buf, result, maxSize)
provides input to Zlex buffers. It
should fill the char *buf
with upto int maxSize
characters and
return in int result
either the number of characters read or
YY_EOF_IN
to indicate end-of-file. Its default definition uses the
system read
routine, but if the --stdio
option is specified
(see section Alphabetical Listing of all Options), then its default definition uses the fread
routine
from the stdio
library.
The definition of the macro YY_EOF_IN
to be returned by
YY_INPUT
defaults to YY_NULL
(see section The Null Value: YY_NULL
), but
it can be redefined by the programmer in section 1 of the Zlex source file
to some other value.
This macro can be redefined in section 1 of the Zlex file to get input some
other way. For example, Zlex currently supports processing of only 7-bit or
8-bit characters. However, it is possible to use Zlex to process words of
size larger than that of a char
, if those words can be mapped into
characters without loss of information. This can be done as follows:
#define YY_INPUT(buf, result, n) result= wordInput(buf, n) int wordInput(char *buf, unsigned n) { Word *wordBuf= (Word *)malloc(n * sizeof(Word)); unsigned nWords; int result; if (!wordBuf) { `Signal memory allocation error.' } nWords= readWords(wordBuf, n); /* Read words from source. */ if (nWords == 0) { result= YY_EOF_IN; } else { unsigned i; for (i= 0; i < nWords; i++) buf[i]= mapWordToChar(wordBuf[i]); result= nWords; } free(wordBuf); return result; }
where mapWordToChar()
maps a word into a character. Note that the
scanner will maintain the current lexeme in yytext
using characters;
it will be the programmer's responsibility to map these characters back into
Word
s.
YY_NULL
The macro YY_NULL
is used for two purposes:
YY_INPUT
(see section Redefining the Input Macro YY_INPUT
) on end-of-file.
The default definition for YY_NULL
is 0
, but the programmer
can redefine this macro in a C-code section in section 1 of the Zlex file.
Zlex uses YY_NULL
only for compatibility with undocumented behavior
of flex
. Its use is discouraged, as it is has two distinct
purposes. Instead, the programmer should use YY_EOF_IN
(see section Redefining the Input Macro YY_INPUT
) or YY_EOF_OUT
(see section Return Value on Termination: YY_EOF_OUT
) for each respective purpose.
Two methods can be used by a Zlex programmer to force the generated
scanner to insert characters into the input stream. The first of these is
YY_LESS
which returns characters from the current lexeme to the input
stream; the other is YY_UNPUT
which can be used to insert arbitrary
characters (not necessarily from the current lexeme) into the input stream.
yyless
yyless(n)
returns all but the first n characters of the current lexeme
back to the input stream. yytext
and yyleng
are suitably
adjusted. The canonical form YY_LESS(n)
can also be used. The library
function yyLess(YYDataHandle d, int n)
can also be used from files
other than the generated scanner file.
Note that if it is necessary to look ahead in the input stream in order to
recognize a token, it is preferable to use right context patterns
(see section Right Context). Note also that yyless(0)
will cause the
scanner to enter an infinite loop unless its state is changed in some way.
The following excerpt illustrates the use of yyless
to generate
multiple tokens from the same subsequence of the input stream. This may be
useful in a situation where a single input subsequence signals both the end
of a syntactic construct and the start of the next syntactic construct. If
we assume that xxx
is a Zlex macro defining the subsequence of
interest then the following code should achieve our goal:
{xxx} { if (flag == 0) { flag= 1; yyless(0); return TOK0; } else { flag= 0; return TOK1; } }
We assume that flag
is a suitably declared C variable, and
TOK0
and TOK1
are the token values.
unput
unput(c)
puts the character c
onto the input stream to be the
next character read. The Zlex programmer should ensure that 0 <= c
<
character set size; unput
cannot be used to unput an EOF
character. The contents of yytext
are unaffected. Note that it is
more efficient to use yyless
if all that is desired is to unput a
suffix of yytext
.
The canonical form YY_UNPUT(c)
may also be used. The Zlex library
function yyUnput(YYDataHandle d, int c)
may also be used from files
other than the generated scanner file.
The following excerpt illustrates the use of unput
to translate
character sequences. If an application dictates that the input sequences
`%%(' and `%%)' be translated to the sequences `[' and
`]' respectively before any tokenizing occurs, and it is known that the
sequences cannot occur within other tokens, then we can use the following
pattern-action pairs:
"%%(" unput('['); "%%)" unput(']');
Note that this suffices only because it is specified that the sequences
cannot occur within other tokens. If that is not the case, then the above
code would not be correct and we would either need to redefine
YY_INPUT
(see section Redefining the Input Macro YY_INPUT
) appropriately, or use
intra-token patterns (see section Using Intra-Token Patterns).
REJECT
REJECT
action. The initial choice
of pattern is governed by the rules built into the generated scanner. When
multiple patterns match the input to a Zlex generated scanner, the choice of
pattern is governed by rules which first prefer the longest match and then
the pattern which occurs earlier in the Zlex source file (see section Pattern Conflicts).
REJECT
transfers control to the action of the next pattern which
matches the current lexeme or a prefix of the current lexeme. This action
can also be referred to using the canonical name YY_REJECT
.
REJECT
performs a transfer of control -- it is equivalent to an
unconditional goto
and the code immediately following the
REJECT
will never be executed. Also REJECT
has function scope
and hence it cannot be used outside the actions.
REJECT
is useful when overlapping subsequences of the input are to be
recognized as tokens. This is illustrated by the following scanner which
outputs all the prefixes of the words in its input, where a word is a
maximal sequence not containing tab, blank or newline.
%% [^\t \n]+ printf("%s\n", yytext); REJECT; .|\n
yytext
is a NUL
-terminated C-string giving the text of the
current lexeme (see section Current Lexeme Text: yytext
). Given the word `abc' the first
pattern will match; its action will first output `abc' on a separate
line. When the REJECT
action is executed, there is no other pattern
to match `abc'. Hence it will try to match a prefix of `abc':
`ab' matches the first pattern. So it will again output `ab' and
execute a REJECT
action. This REJECT
results again in a match
with the first pattern and an output of `a'. The subsequent
REJECT
matches the second pattern with `a' but no action is
taken. Hence the output will be:
abc ab a
The REJECT
action is not inordinately expensive.
YY_CHAR_NUM
YY_CHAR_NUM
returns the number of characters read by the
scanner from the current file or memory buffer upto the start of the current
yytext
. The position does not include any of the characters of
yytext
. The returned position is zero-origin: hence the character
just after yytext
will be at absolute position YY_CHAR_NUM +
yyleng
in the file or in-memory buffer.
The value returned by YY_CHAR_NUM
will not be correct if the
unput
(see section Unputting Characters: unput
) action is used.
Many scanning applications require tracking the current line and column
number. If newlines can occur within other tokens, then the
`--yylineno' option provides suitable facilities (see section Current Line Number: yylineno
). If newlines cannot occur within other tokens, then the
recommended method is illustrated by the following code fragment which shows
how YY_CHAR_NUM can be used to compute the current column number within a
line.
%{ int lineStartPos= 0; /* Starting YY_CHAR_NUM for a line. */ int lineNum= 1; /* 1 + # of '\n's seen so far. */ #define COL_NUM (YY_CHAR_NUM - lineStartPos) %} %% \n { lineStartPos= YY_CHAR_NUM + 1; lineNum++; `Other actions for a newline.' }
The macro COL_NUM
can now be used within other actions to access the
column number.
yylineno
If the `--yylineno' option is specified, when the scanner is
generated, then the current line number (1-origin) is maintained in the
variable whose default name is yylineno
and declaration int
yylineno
. Like all other variable names, the name of this variable can
be changed to an arbitrary name by defining the macro YY_LINENO
to the new name. Alternatively, the prefix used for the name can be
changed by using the `--prefix' option (see section Variable Names).
Unlike the implementation of yylineno
by other scanner
generators, a Zlex generated scanner does not test every character to
see if it is a newline. It does these tests only when it is known that a
lexeme contains or is followed by a newline character: this information
is obtained using a hidden intra-token pattern (see section Using Intra-Token Patterns). Hence scanning of lexemes which do not contain newlines is
not slowed down except for a simple test of a flag which is performed on
once per action rather than once per character.
Since a hidden intra-token pattern +\n
(see section Using Intra-Token Patterns is used to implement the yylineno
feature, this feature
will not work if the user specifies a intra-token pattern which overlaps
with the hidden pattern. It will also not work correctly if the
programmer uses unput
to put newline characters into the buffer.
This feature was added to Zlex for backward compatibility with an
undocumented feature of lex
(documented in flex
). When
newlines cannot occur within other tokens it is usually not necessary to
use this feature as it is easy enough for the programmer to update a
line number counter whenever a pattern containing a newline character is
matched (see section Current Character Count: YY_CHAR_NUM
).
If the `--yylineno' option is specified, then the macro
YY_COL_NUM
returns the 0-origin column number within the current
line. If newlines cannot occur within other tokens, see the example in
section Current Character Count: YY_CHAR_NUM
, for the recommended way to track this
information.
The current column number is computed only when the Zlex programmer uses
the YY_COL_NUM
macro. The implementation uses a hidden
intra-token pattern +\n
(see section Using Intra-Token Patterns to
implement the YY_COL_NUM
macro. Hence this feature will not work
if the user specifies a intra-token pattern which overlaps with the
hidden pattern. It will also not work correctly if the programmer uses
unput
to put newline characters into the buffer.
YY_AT_BOL
YY_AT_BOL()
returns non-zero if the next token to be
matched can match beginning-of-line patterns having a `^' anchor.
Note that this macro is provided only when there is at least one pattern which uses the beginning-of-line `^' anchor.
yy_set_bol
yy_set_bol(v)
sets the beginning-of-line condition
for the next pattern to true if v is non-zero; false if v is
zero. When the beginning-of-line condition is set true, the next pattern
can match beginning-of-line patterns having a `^' anchor; when it is
set false, the next pattern cannot match beginning-of-line patterns having a
`^' anchor.
The canonical macro name YY_SET_BOL
can be used synonymously with
yy_set_bol
.
Note that these macros are provided only when there is at least one pattern which uses the beginning-of-line `^' anchor.
Limited facilities are provided in a Zlex scanner for echoing the current
lexeme to a FILE
pointer with default name yyout
yyout
yyout
is the default name of the variable with declaration
FILE *yyout
which Zlex uses to echo the current lexeme
(see section Echoing Lexeme Text: ECHO
). Like all other variable names,
the name of this variable can be changed to an arbitrary name by
defining the macro YY_OUT
to the new name. Alternatively, the
prefix used for the name can be changed by using the `--prefix'
option (see section Variable Names).
When the scanner function is first entered it initializes yyout
to
stdout
, unless the user has already initialized it to a non-NULL
FILE
pointer. So if the generated scanner should echo to a file
other than the standard output, the programmer need only initialize
yyout
to a suitable FILE
pointer.
ECHO
ECHO
macro echoes the current lexeme to yyout
. The
canonical name YY_ECHO
can also be used.
The following example removes all lines starting with #.
%% ^#.*\n | ^#.* /* No action: don't echo. */ .*\n | .* ECHO;
The patterns ^#.*
and .*
take care of processing the
last line in the file when it does not end with a newline.
yyin
(see section Input File Pointer: yyin
), as the scanner will
continue reading from its previously buffered input. It is necessary to
switch to a buffer for the new file. Buffer management actions provide
facilities for doing this.
Buffers need not necessarily be associated with files. It is possible to create buffers whose contents are taken from a string or some other in-memory structure. When the scanner reaches the end of an in-memory buffer, it does normal end-of-file processing.
Tokens are not allowed to span buffer boundaries.
YY_BUFFER_STATE
typedef void *YYBufHandle;For compatibility with
flex
, the programmer can also refer to this
type using the macro YY_BUFFER_STATE
. This opaque type can be passed
to and returned from the buffer management actions.
yy_current_buffer
yy_current_buffer
is the default name of a variable which
contains a YY_BUFFER_STATE
handle to the current buffer. Like
all other variable names, the name of this variable can be changed to an
arbitrary name by defining the macro YY_CURRENT_BUFFER
to the new
name. Alternatively, the prefix used for the name can be changed by
using the `--prefix' option (see section Variable Names).
The user should never explicitly assign a value to this variable, but do so
only implicitly by calling the appropriate buffer management routine
(see section Switching Buffers: yy_switch_to_buffer
).
yy_create_buffer
yy_create_buffer(f, s)
creates a buffer for
FILE
pointer f
, having space for at least s
characters.
(The macro YY_BUF_SIZE
contains a recommended value for s
.)
The value returned is a YY_BUFFER_STATE
(see section The Buffer Type: YY_BUFFER_STATE
).
The canonical name YY_CREATE_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with
prototype
YY_BUFFER_STATE yyCreateBuffer(YYDataHandle d, FILE *f, yy_size_t s);can be used to create and initialize a buffer for the file with
FILE
pointer f, having space for at least s
characters. It returns the
handle of the newly created buffer, aborting execution on error.
yy_delete_buffer
yy_delete_buffer(b)
deletes the buffer with
YY_BUFFER_STATE b
. b
must have been previously returned by
one of the buffer creation actions.
The canonical name YY_DELETE_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with
prototype
void yyDeleteBuffer(YYDataHandle d, YYBufHandle b);
can be used to delete buffer with handle b
for the scanner with
handle d
.
yy_flush_buffer
yy_flush_buffer(b)
flushes the buffer with
YY_BUFFER_STATE b
. When the scanner subsequently tries to read a
character from the buffer, the buffer will be refreshed. There is no
canonical name for the yy_flush_buffer
macro as, for backwards
compatibility with flex
, the name YY_FLUSH_BUFFER
does something
somewhat different: specifically, it is used without any arguments to
specify an action to flush the current buffer (equivalent to
yy_flush_buffer(YY_CURRENT_BUFFER)
).
From files other than the Zlex source file, the library function with prototype
void yyFlushBuffer(YYDataHandle d, YY_BUFFER_STATE b);
can be used to flush buffer b
for the scanner whose internal state is
encapsulated in d
.
yy_scan_buffer
yy_scan_buffer(memBuf, len)
creates and returns a
YY_BUFFER_STATE
which contains the contents of char *memBuf
having a total of yy_size_t len
bytes. memBuf
is not copied:
hence the programmer should ensure that memBuf
is retained until the
processing of the YY_BUFFER_STATE
returned by yy_scan_buffer
is completed. memBuf
will be used when the newly created buffer is
scanned: in fact, memBuf
may even be temporarily modified during the
course of scanning.
The last two bytes of memBuf
must be sentinel characters (the
sentinel character defaults to '\0'
unless changed by the
`--sentinel' option (see section Alphabetical Listing of all Options)). If this is not true, then a
NULL
YY_BUFFER_STATE
is returned. These two sentinel
characters will not be scanned when the scanner switches to this buffer:
hence the characters which will be scanned will be memBuf[0]
...
memBuf[len - 3]
inclusive.
The canonical name YY_MEM_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with prototype
YY_BUFFER_STATE yyMemBuffer(YYDataHandle d, char *memBuf, yy_size_t len);
can be used to create a memory buffer for the scanner whose state is
encapsulated in d
, with the other arguments being as defined for the
macro. It returns the YY_BUFFER_STATE
handle for the created buffer;
NULL
if memBuf
does not have the two sentinel characters at
its end; it aborts with an error message if it cannot create the buffer
because it is out of memory.
It is important to realize that if the same memory area is used to create multiple Zlex buffers, then each Zlex buffer must be deleted before a new Zlex buffer is created from the same memory area.
The following function illustrates the use of in-memory buffers to paste
tokens together as is required by the ##
operator in a C
preprocessor. We assume that a Token
is a struct
with two
fields: a small integer tok
giving the token number, and another
small integer id
which gives the text associated with the token. We
also assume the existence of the following routines:
getIDString(id)
id
.
getIDLen(id)
id
.
MALLOC()
FREE()
malloc()
and free()
respectively.
error()
yylex()
Token
instead of simply a int
(see section Return Value on Termination: YY_EOF_OUT
).
Token tokenPaste(Token token1, Token token2) /* Paste tokens token1 and token2 together, returning resulting token. * Signal an error if the pasted token is not proper. */ { const unsigned id1= token1.id; const unsigned id2= token2.id; const unsigned len1= getIDLen(id1); const unsigned len2= getIDLen(id2); const unsigned bufSize= len1 + len2 + 1 + 2; /* 1 '\n' + 2 sentinel chars. */ enum { AUTO_BUF_SIZE= 100 }; char autoBuffer[AUTO_BUF_SIZE]; char *const autoBuf= autoBuffer; char *const dynamicBuf= (bufSize <= AUTO_BUF_SIZE) ? NULL : MALLOC(bufSize); char *const *bufP= (dynamicBuf) ? &dynamicBuf : &autoBuf; Token tokenZ, eolToken; YY_BUFFER_STATE oldBuf= YY_CURRENT_BUFFER; YY_BUFFER_STATE pasteBuf; strncpy(*bufP, getIDString(id1), len1); strncpy(*bufP + len1, getIDString(id2), len2); *(*bufP + bufSize - 3)= '\n'; *(*bufP + bufSize - 2)= *(*bufP + bufSize - 1)= '\0'; pasteBuf= yy_scan_buffer(*bufP, bufSize); yy_switch_to_buffer(pasteBuf); tokenZ= yylex(); eolToken= yylex(); if (eolToken.tok != '\n') { error("Invalid token produced by ## pasting of `%s' and `%s'.", getIDString(id1), getIDString(id2)); } yy_delete_buffer(pasteBuf); yy_switch_to_buffer(oldBuf); if (dynamicBuf) FREE(dynamicBuf); return tokenZ; }The function creates the in-memory buffer on the runtime stack if the required amount of memory is smaller than a predetermined amount; otherwise it creates the in-memory buffer on the heap. It uses
bufP
to point
to the chosen buffer. It remembers the original Zlex buffer in the
YY_BUFFER_STATE
variable oldBuf
. It then uses the standard
library function strncpy()
to copy the text of the tokens to be
catenated into the chosen buffer. It terminates the copied text by a
'\n'
followed by the two required '\0'
sentinel characters.
It then creates a Zlex buffer using yy_scan_buffer()
. It switches to
the newly created buffer (see section Switching Buffers: yy_switch_to_buffer
) and then reads two
tokens from it: it expects the first token to be the catenated token which
is desired, and the second token to be a '\n'
. It then deletes the
created Zlex buffer and switches back to the original Zlex buffer
oldBuf
. Finally, if the in-memory buffer was allocated on the heap
it frees it.
yy_scan_bytes
yy_scan_bytes(bytes, len)
creates and returns a
YY_BUFFER_STATE
which contains the contents of char *bytes
having a total of yy_size_t len
bytes. The contents of bytes
is copied into the newly created buffer. bytes
itself will not be
used at all when the newly created buffer is scanned and can be destroyed
once the buffer has been created.
The canonical name YY_BYTES_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with prototype
YY_BUFFER_STATE yyBytesBuffer(YYDataHandle d, char *bytes, yy_size_t len);
can be used to create a memory buffer for the scanner whose state is
encapsulated in d
, with the other arguments being as defined for the
macro. It returns the YY_BUFFER_STATE
handle for the created buffer,
aborting with an error message if it cannot create the buffer because it is
out of memory.
yy_scan_string
yy_scan_string(str)
creates and returns a
YY_BUFFER_STATE
which contains the contents of the
NUL-terminated C-string char *str
having a total of
yy_size_t len
bytes (not counting the terminating NUL). The
contents of str
is copied into the newly created buffer. str
itself will not be used at all when the newly created buffer is scanned
and can be destroyed once the buffer has been created.
The canonical name YY_STRING_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with prototype
YY_BUFFER_STATE yyStringBuffer(YYDataHandle d, char *str);
can be used to create a memory buffer for the scanner whose state is
encapsulated in d
, with the str
argument as for the
macro. It returns the YY_BUFFER_STATE
handle for the created buffer,
aborting with an error message if it cannot create the buffer because it is
out of memory.
yy_switch_to_buffer
yy_switch_to_buffer(b)
sets up the scanner to scan from
the previously created buffer identied by the YY_BUFFER_STATE
b
. The contents of either buffer are not affected.
The canonical name YY_SWITCH_TO_BUFFER
can also be used for this macro.
From files other than the Zlex source file, the library function with prototype
void yyStringBuffer(YYDataHandle d, YY_BUFFER_STATE b);
can be used to create a memory buffer for the scanner whose state is
encapsulated in d
, with the b
argument as for the
macro.
yy_switch_to_buffer
should be the only way the Zlex programmer
changes the current buffer.
%{ enum { MAX_INCL_DEPTH= 3 }; static YY_BUFFER_STATE inclStk[MAX_INCL_DEPTH]; static unsigned inclSP= 0; static void includeFile(char *fName); %} fileName [0-9a-zA-Z./]+ %x INCLUDE %% ^[\t ]*#[\t ]*include BEGIN INCLUDE; <INCLUDE>{fileName} includeFile(yytext); BEGIN INITIAL; <INCLUDE>[\t ]+ /* No action. */ <INCLUDE>\n BEGIN INITIAL; <<EOF>> { if (inclSP == 0) yyterminate(); else { yy_switch_to_buffer(inclStk[--inclSP]); BEGIN INCLUDE; } } %% static void includeFile(char *fName) { if (inclSP == MAX_INCL_DEPTH) { fprintf(stderr, "Includes nested too deeply.\n"); return; } inclStk[inclSP++]= YY_CURRENT_BUFFER; yyin= fopen(fName, "r"); if (!yyin) { fprintf(stderr, "Could not open %s.\n", fName); exit(1); } yy_switch_to_buffer(yy_create_buffer(yyin, YY_BUF_SIZE)); }
Sometimes it is necessary to do pre-lexical processing on the characters
scanned by the generated scanner before they are tokenized by the scanner.
An example of pre-lexical processing would be mapping certain sequences of
characters into others. The easiest way to do so is to intercept the input
to the scanner: Zlex provides a way for the Zlex programmer to do just that
by redefining the YY_INPUT()
macro. See section Redefining the Input Macro YY_INPUT
. Though this is adequate for most situations, it is not appropriate
for all situations since Zlex buffers its input. The following example
illustrates the problem.
ANSI-C requires that all occurrences of escaped newlines (a `\'
followed by a newline) in the input be deleted. It is required that this be
done before tokens be recognized: for example a `/' followed by an
escaped newline followed immediately by a `*' should be recognized as
the start of a comment. This can be done relatively easily by defining
YY_INPUT()
to read the C source file into a buffer which is then
processed to delete escaped newlines. Unfortunately this does not allow the
scanner to keep track of the source line number of the current token for
proper reporting of error messages. What is needed is to perform the
pre-lexical processing incrementally as each token is scanned.
Intra-token patterns allow this type of incremental pre-lexical processing. When a intra-token pattern is recognized within a token, the scanning of the token is suspended and the action associated with the intra-token pattern is executed. Then the suspended scanning of the token is resumed. Thus, the only effects of recognizing the intra-token pattern are the possible side-effects (if any) of the intra-token action.
Syntactically, an intra-token pattern consist of the `+' character followed by a regular expression subject to the following restrictions:
For example, +\\\n
is an intra-token pattern which matches an
escaped newline.
yytext
and yyleng
(see section Accessing the Current Lexeme) are
available as usual within intra-token pattern actions. The special
YY_BACKUP
action (see section Backing Up Within a Intra-Token Pattern: YY_BACKUP
) is also available within
intra-token pattern actions. Since an intra-token pattern represents
a interrupted scan of a token, the action for an intra-token pattern are
subject to the following rather severe restrictions:
return
.
YY_BACKUP
Intra-token patterns are intended for doing pre-lexical processing which
needs to be done incrementally during scanning. It is expected that this
processing will usually involve translating the token matching the
intra-token pattern to another token and then continue scanning. The
YY_BACKUP
action is tailored for such processing.
The macro YY_BACKUP(len, string)
specifies an action to
be used only within actions for intra-pattern tokens. The length len
must be no greater than the length of the intra-token pattern and
string should be a NUL-terminated C-string. The effect of
YY_BACKUP(len, string)
is to backup the scanner automaton
over the last len characters of the intra-token pattern, replacing
those len characters by string. YY_BACKUP
may perform
a control transfer: hence it should not be followed by any code to which
control is expected to fall through after YY_BACKUP
.
For example, to delete escaped newlines in a C-scanner, we could use the following pattern-action pair.
+\\\n YY_BACKUP(2, "");
The generated scanner normally terminates when an EOF
is received on
the input stream. This default action can be changed in two ways:
EOF
on the input stream need not terminate the scanner provided
the Zlex programmer sets up the input to come from another source.
EOF
is received from the input
stream.
yywrap
int yywrap(void)
is called by
the scanner when it detects end-of-file on its current input stream
yyin
. If the function returns non-zero then the
scanner proceeds to wrap-up its processing; it processes its
<<EOF>>
actions if any (see section End of File Patterns) and if these
actions do not change its flow of control, it returns YY_EOF_OUT
(which defaults to 0
) indicating an end-of-file token.
If the call to yywrap
returns 0
, then the scanner assumes that
the function has set up yyin
to continue scanning. It does not
execute the actions associated with any <<EOF>>
patterns but merely
continues scanning.
Like Zlex variable names, the name of this function can be changed to an
arbitrary name by defining the macro YY_WRAP
to the new name.
Alternatively, the prefix used for the name can be changed by using the
`--prefix' option (see section Function Names).
The Zlex library provides a yywrap()
function which simply returns
1
.
<<EOF>>
Pattern Actionsyywrap
function (see section Wrapping Up: yywrap
) returns non-zero, then
the scanner executes the actions associated with its <<EOF>>
actions
if any. These actions provide another opportunity for the programmer to
reset the scanner so as to continue scanning. The following points need to
be noted:
yytext
and yyleng
are not defined for <<EOF>>
patterns.
Hence the actions for <<EOF>>
patterns should not refer to these
variables.
yyin
is pointed to a new FILE
pointer within an
<<EOF>>
action, then scanning will continue. For compatibility
with old versions of flex
, the YY_NEW_FILE
and
YY_RESTART
actions (see section Restarting a Scanner: yyrestart
) may be used
after resetting yyin
but is not necessary.
yy_switch_to_buffer
action (see section Switching Buffers: yy_switch_to_buffer
).
yyterminate()
action (see section Terminating a Scanner: yyterminate
).
<<EOF>>
action. In that
case, the scanner will return to its caller with a YY_EOF_OUT
(which
defaults to YY_NULL
).
yyterminate
yyterminate()
terminates the scanner and returns a YY_EOF_OUT
(which defaults to 0) to the caller. Subsequent calls to the scanner will
continue to return with YY_EOF_OUT
, until a yyrestart
,
YY_NEW_FILE
or yy_switch_to_buffer
action is executed.
The canonical form YY_TERMINATE()
can also be used instead.
yyrestart
yyrestart(fileP)
restarts scanning, taking input from the
file with the stdio
FILE
pointer fileP
. The current
contents of the Zlex buffer are discarded. The canonical name
YY_RESTART
can also be used instead.
.
If yyin
has been pointed to a new file, then the action
YY_NEW_FILE
(without any arguments) tells the scanner that a new file
has been setup in yyin
. YY_NEW_FILE
is equivalent to
yyrestart(yyin)
.
The top-level scanner function in a Zlex scanner has default name
yylex
. The user can customize the name and declaration of the
function, as well as define macros to cause certain actions to be taken.
yylex
The name of the main scanning function defaults to yylex
, but
like other names, the name of this function can be changed to an
arbitrary name by defining the macro YY_LEX
to the new name.
Alternatively, the prefix used for the name can be changed by using the
`--prefix' option (see section Function Names). Hence to change the
name of the main scanning function to scan
, the programmer merely
need have the line
#define YY_LEX scanin the C-code section of section 1 of the Zlex file.
YY_DECL
YY_DECL
gives the default declaration of the scanner
function. Its definition is equivalent to:
#ifndef YY_DECL #define YY_DECL int YY_LEX(void) #endif
The programmer can change this declaration by suitably `#define'ing
YY_DECL
in the C-code section of section 1 of the Zlex file. See
section Return Value on Termination: YY_EOF_OUT
for how to use YY_DECL
to declare a
scanner function which returns a struct
rather than a int
.
YY_USER_INIT
YY_USER_INIT
can be defined by the programmer to be code
which will be executed when the scanner first starts up, before the first
scan. Its definition defaults to empty code. It is useful for initializing
variables used by the programmer. Before the first call to the main scanner
function, the only Zlex actions guaranteed to work are the buffer creation
and switching routines (see section Buffer Management).
The scanner's buffer is statically initialized to a special initialization
state. If there is no buffer switching action before the first call to the
scanner function, then the scanner buffer will still be in this special
initialization state at the first call. The first call to the scanner
checks whether the buffer is in this special initialization state: if it is,
it creates a new buffer corresponding to the current yyin
; if it not,
then it assumes that the programmer has created and switched to a valid
buffer and uses that buffer without modification. Note that the effect of
using such a special initialization buffer in subsequent scanner calls is
undefined.
YY_USER_ACTION
YY_USER_ACTION
can be defined by the programmer to be code
which will always be executed before any matched rule action. Its
definition defaults to empty code.
YY_BREAK
YY_BREAK
is used
to separate the actions within the switch
statement. Its definition
defaults to break
.
Redefining this macro appears to be of limited utility. This feature is
included for compatibility with flex
. The rationale for
including this feature in flex
was to prevent unreachable
statement warnings when a user action naturally terminates with a
control transfer like a return
. With this feature, the user can
define YY_BREAK
to be empty while ensuring that every action
terminates with a transfer of control (inserting explicit breaks, if
necessary), thus avoiding the warnings.
YY_EOF_OUT
The macro YY_EOF_OUT specifies the value to be returned by the scanner on
end-of-file after yywrap
returns 1 and the <<EOF>>
actions (if
any) do not reset the input yyin
. Its definition defaults to
YY_NULL
(see section The Null Value: YY_NULL
) but it can be redefined by the
programmer in section 1 of the Zlex file.
For example, by using YY_DECL
(see section Declaring the Scanner Function: YY_DECL
) macro, it is possible for the Zlex programmer to make the scanner
return a struct
rather than a int
. If this is done, then the
value returned on end-of-file must also be a suitable struct
: this
can be achieved by defining YY_EOF_OUT
to a call of a suitable
function returning the suitable struct
. Appropriate definitions and
declarations are shown below:
%{ /* Define the type returned by the scanner. */ typedef struct { ... } Token; /* Declare a function returning a special EOF Token struct. */ Token eofToken(void); /* Scanner declaration. */ #define YY_DECL Token YY_LEX(void) /* EOF return value definition. */ #define YY_EOF_OUT eofToken() %}
yy_act
and YY_NUM_RULES
Within the actions, the macro yy_act
refers to the pattern number
which is currently being matched where the patterns from the Zlex source
file are numbered starting at 1. The macro YY_NUM_RULES
refers to
the total number of patterns for which actions exist in the generated
scanner; this will usually be greater than the number of patterns explicitly
specified by the programmer in the Zlex source file, since Zlex uses several
pseudo-actions for its own purposes.
Several equivalent macros and a single variable control whether debugging
messages are output to stderr
as patterns are matched.
If the `--debug' option is specified when the scanner file is generated
(see section Alphabetical Listing of all Options), or if the macro YYDEBUG
is defined when the
generated scanner file is compiled, and if at runtime, the variable with
default name yy_Zlex_debug
has a non-zero value, then messages are
printed on stderr
as patterns are matched. Where applicable, the
printed messages include the source file name and line number of the matched
pattern, as well as the contents of yytext
. The format is similar to
that of compiler error messages of popular compilers like gcc
; this
makes it possible to use tools like emacs
compile-mode
to
point to the appropriate pattern in the source file. See section `Compiling within emacs' in The GNU Emacs Manual.
.
The macro YY_ZL_DEBUG
is equivalent to YYDEBUG
. This
alternate name is useful when a project uses both Zlex as well as a parser
generated by a member of the yacc
-family of parser generators. The
reason is that YYDEBUG
is also used for similar purposes by such
parser-generators; if the option -DYYDEBUG
is passed as a C-compiler
option to a `Makefile' for the project, both the generated Zlex scanner
as well as the parser will run in debug-mode, resulting in rather confusing
output.
.
The extern
variable with default name yy_zlex_debug
allows
debugging messages to be turned on and off dynamically: messages are printed
only when the variable has a non-zero value. When debugging is turned on
as described above, the variable is declared in the generated scanner and
initialized to 1
: hence message printing is initially enabled.
.
Like all other variable names, the name of this variable can be changed
to an arbitrary name by defining the macro YY_ZLEX_DEBUG
to the
new name. Alternatively, the prefix used for the name can be changed by
using the `--prefix' option (see section Variable Names).
Assume that the following scanner is defined in file `debug.l'.
01 /* Scanner which illustrates debugging messages. */ 02 03 %option debug 04 05 %% 06 07 [[:digit:]]+ | 08 [[:alpha:]]+ REJECT; 09 .|\n
If the generated scanner is compiled and run on the input consisting of the single line
ab12
the following output is produced on stderr
.
debug.l:8: yytext= `ab'. debug.l:8: yytext= `a'. debug.l:9: yytext= `a'. debug.l:8: yytext= `b'. debug.l:9: yytext= `b'. debug.l:7: yytext= `12'. debug.l:7: yytext= `1'. debug.l:9: yytext= `1'. debug.l:7: yytext= `2'. debug.l:9: yytext= `2'. debug.l:9: yytext= ` '. --EOF.
YYTRACE
#define
d,
then when the scanner is run it provides a detailed trace showing the action
it takes at every character it scans. The trace shows the transitions
of the underlying finite automaton. To decipher this trace, it is
necessary to have compiled the scanner using the `--trace' option
(see section Alphabetical Listing of all Options). This is useful mainly for maintaining Zlex.
yyerr
It is possible for Zlex to encounter runtime errors under several conditions:
NULL
buffer.
--suppress-default
option (see section Alphabetical Listing of all Options) has been specified.
When a runtime error is encountered, the generated scanner writes a message
on the FILE
pointer yyerr
and terminates execution of the
program.
.
.
Like all other variable names, the name of this variable can be changed
to an arbitrary name by defining the macro YY_ERR
to the
new name. Alternatively, the prefix used for the name can be changed by
using the `--prefix' option (see section Variable Names).
When the scanner function is first entered it initializes yyerr
to
stderr
, unless the programmer has already initialized it to a
non-NULL
FILE
pointer. So if the generated scanner should
output error messages to a file other than the standard error, the
programmer need only initialize yyerr
to a suitable FILE
pointer.
.
All error messages are preceeded by the string which is the value of the
macro YY_PROGRAM_NAME
which defaults to "Zlex scanner"
. This
macro can be redefined by the programmer in section 1 of the Zlex source file.
Sometimes it is necessary to include multiple scanners in a program. For example, an application may need one scanner to scan a data file and another scanner to scan interactive user input.
Most of the global objects used by a generated scanner are declared
static
. Hence their names are local to the generated C file. Since
different scanners are generated in different C files, the semantics of C
preclude the possibility of a clash between the static
names used in
different scanners. However there will be a link-time clash between the
extern
names used for global objects declared in different scanners.
To circumvent this problem, Zlex allows the programmer to choose the names
for the extern
objects using one of the following schemes:
extern
object.
extern
scanner object starts with the
prefix `yy'. This prefix can be changed by using the `--prefix'
option (see section Alphabetical Listing of all Options). The names which are affected are:
yy_current_buffer
yy_current_buffer
.
yy_Zlex_debug
yydataP
yyerr
yyerr
.
yyin
yyin
.
yyleng
yyleng
.
yylex
yylex
.
yylineno
yylineno
.
yyout
yyout
.
yytext
yytext
.
yywrap
yywrap
.
extern
objects do not
clash between multiple scanners is to redefine the macros which specify the
names of these objects. This should be done in a C-code section in section
1 of the Zlex file. The macros and the objects they name are:
YY_CURRENT_BUFFER
yy_current_buffer
(see section The Current Buffer: yy_current_buffer
).
YY_DATA_P
yydataP
(see section Passing the Scanner State to Zlex Library Routines).
YY_ERR
yyerr
(see section Runtime Errors: yyerr
).
YY_IN
yyin
(see section Input File Pointer: yyin
).
YY_LENG
yyleng
(see section Current Lexeme Length: yyleng
).
YY_LEX
yylex
(see section The Main Scanner Function: yylex
).
YY_LINENO
yylineno
(see section Current Line Number: yylineno
).
YY_OUT
yyout
(see section Output File Pointer: yyout
).
YY_TEXT
yytext
(see section Current Lexeme Text: yytext
).
YY_WRAP
yywrap
(see section Wrapping Up: yywrap
).
YY_ZLEX_DEBUG
yy_Zlex_debug
(see section Debugging Control).
The generated scanner sets these macros to their default values (the default values factor in the `--prefix' option if it has been specified) only if they have not already been defined in section 1 of the Zlex file. Hence the Zlex programmer can easily make the scanner use different external names, by simply defining these macros to suitable names in section 1 of the Zlex file.
yylex1
,
yytext1
, yyleng1
, yyin1
, yyout1
for some of the
extern
objects, section 1 of the Zlex file would contain the
following #define
's:
%{ #define YY_LEX yylex1 #define YY_TEXT yytext1 #define YY_LENG yyleng1 #define YY_IN yyin1 #define YY_OUT yyout1 %}
The rest of the program would access this generated scanner as follows: The
main scanning function would be called as yylex1()
. The lexeme text
and length of the current token would be found in yytext1
and
yyleng1
. The FILE
pointers yyin1
and yyout1
would be used for the input and output files of the generated scanner. The
function yywrap()
will be called on end-of-file; this will be the
yywrap()
provided by the Zlex library, unless the Zlex programmer
defines one elsewhere in the program.
The command line needed to invoke Zlex has the format:
zlex [Options List] lex-file [lex-file...]
A word which constitutes a command-line argument has two possible types: it is a option word if it begin with a `-' or `--' (with certain exceptions noted below), or if it follows an option word which requires an argument. Otherwise it is a non-option word. An option word specifies the value of a Zlex option; a non-option word specifies a file name.
Besides the command-line, Zlex can read its options from several different sources. In order of increasing priority these sources are the following:
ZLEX_OPTIONS
. If this variable is set, then
its value should contain only options and option values separated by
whitespace as on the command-line. The procedure for setting environment
variables depends on the system you are using: under the UNIX shell
csh
the setenv
command can be used, under the MS-DOS
command-interpreter the set
command can be used; under the UNIX shell
sh
or ksh
the export
command can be used.
%option
, %array
or %pointer
directives (see section Declarations Section).
Options specified by the environment variable ZLEX_OPTIONS
overrides
the options specified in the `zlex.opt' file. Options specified in the
Zlex source file override options specified in the `zlex.opt' file or
ZLEX_OPTIONS
environment variable. Finally, command-line options
always override options specified by all other sources.
Zlex provides a largely orthogonal set of options. We can roughly classify the options according to which aspect of Zlex's functionality they affect.
These options control how the generated Zlex scanner treats its input.
--16-bit
--7-bit
-7
--8-bit
-8
--ignore-case[=1|0]
-i[1|0]
--caseless[=1|0]
--case-insensitive[=1|0]
yytext
. If this option is not specified, then the
generated scanner will be case-sensitive.
--sentinel=CHAR-CODE
-S CHAR-CODE
CHAR-CODE
as the sentinel
character. Scanning the sentinel character is likely to be slower than
scanning non-sentinel characters; this option allows the programmer to
change the sentinel character to a character which may not occur frequently
in the scanner input. If this option is not specified, then the sentinel
character defaults to the character whose code is 0.
--stdio[=1|0]
read()
function to read its input. If this option is specified, it uses a the
stdio
fread()
function instead. It may be necessary to
specify this option if your system does not take kindly to mixing
stdio
FILE
descriptors with read()
. The generated
scanner may be somewhat slower, and its interactive operation may suffer,
depending on the implementation of the fread()
function provided by
the stdio
library. This option will not have any effect if the
programmer has redefined the YY_INPUT
macro (see section Redefining the Input Macro YY_INPUT
).
The options described in this section affect aspects of the algorithm used by the generated scanner. Options which affect scanner tables are described in section Table Scanner Options. Options which affect generated scanners which minimize their use of tables are described in section Code Scanner Options.
--array[=1|0]
yytext
as an array instead of the default pointer. This
will usually lead to a slower scanner.
--backup-optimize[=1|0]
REJECT
. By default, backup-optimization is off.
See section Efficiency.
--default-action echo|error|fatal|ignore
echo
yyout
. This is the default.
error
yyerr
and scanning continues.
fatal
yyerr
and the program is terminated.
ignore
--equiv-classes[=1|0]
--ecs[=1|0]
-E[1|0]
--prefix PREFIX
-P PREFIX
--reject[=1|0]
--reject=0
is
specified, then any REJECT
action in the source file will result in a
compile-time errors when the generated scanner is compiled. The default is
to support REJECT
actions.
--yylineno[=1|0]
yylineno
(see section Current Line Number: yylineno
). The default action is to not
support yylineno
. Using yylineno
may lead to a somewhat
larger scanner but will not slow down the matching of patterns which do not
contain newlines. It is the programmer's responsibility to suitably update
yylineno
after scanner actions like yyless
or unput
(see section Modifying Characters in the Input Stream). yylineno is maintained on a per buffer basis and
is automatically saved and restored on a buffer switch.
Zlex supports the generation of code-scanners which do not use an explicit scanner state or scan tables. Instead these scanners use the program counter to implicitly maintain the scanner state. The current implementation is disappointing: the generated scanners are fairly large but are not appreciably faster.
Some limitatitions are imposed by the current implementation on code scanners:
--backup-optimize=1
.
When Zlex builds a code scanner it analyzes each state before deciding what kind of code to build for a state. The kinds of code built for a state are:
The options for generating code scanners allow the programmer to control the parameters of the algorithm Zlex uses for choosing between the above code alternatives:
--code-scan[=1|0]
--bin-code-param N
N
. Otherwise switch
code is used. The default value of N
is 16.
--lin-code-param N
N
. Otherwise binary search or switch code
is used. The default value of N
is 4.
--transition-compress[=1|0]
If both --bin-code-param
and --lin-code-param
are specified as
0, then only switch code is produced for all state transitions.
In addition, if the compiler supports labels as first class objects and
provided a method to access the addresses of code labels, switches are
directly coded as a branch through a jump table. For example, gcc
allows taking the address of a label using &&label
.
See section `Labels as Values' in Using and Porting GNU cc.
Specifically, the following macros can be used to support this:
YY_LABEL_VARS
YY_LABEL_TYPEDEF(type)
typedef
used to define the type
used to represent the address of a code label. For gcc
, the default
definition is:
#ifndef YY_LABEL_TYPEDEF /* How compiler declares vars containing labels. */ #define YY_LABEL_TYPEDEF(type) typedef void *type #endif
YY_LABEL_ADDR(label)
gcc
, the
default definition is:
#ifndef YY_LABEL_ADDR /* How compiler takes the address of a label. */ #define YY_LABEL_ADDR(label) &&label #endif
Zlex supports 3 options which control the details of the table compression algorithm used and 3 options which control the type of table entries. This leads to a total of 9 different kinds of table accesses.
The options are the following:
--compress compression_algorithm
none
comb
iterative
--table entry_type
address
difference
state
--col-waste-percent percent
--compress=no
and --table=state
, use a 2-dimensional
table with the # of columns a power of 2 if the percentage of wasted columns
is <= percent. The default value is 50%.
--align[=1|0]
-a[1|0]
int
.
This section describes miscellaneous Zlex options including options to print out generated scanner statistics and options to turn on runtime tracing in the generated scanner.
--debug[=1|0]
-d[1|0]
YYDEBUG
or YY_ZL_DEBUG
when compiling the
scanner (see section Debugging and Errors).
--help
-h
--lex-compat
-l
--array
--yylineno
--reject
). Also if no lex-files are specified on the
command-line, then read the zlex source file from the standard input.
--line-dir
#line
directives to the generated scanner (default).
--output filename
-o filename
--to-stdout[=1|0]
-t[1|0]
--trace[=trace_file]
DO_TRACE
is defined when Zlex is built. If
trace_file is omitted, then the trace is produced in a file whose name
is the basename of the first source file with its extension `.l' (if any)
removed and extension `.trc' added.
--verbose[=1|0]
-v[1|0]
--version
-V
--whitespace
-w
This section contains a short description of all options, sorted by long option name. Each option contains a reference to the section where it is discussed in more detail.
--16-bit
--7-bit
-7
--8-bit
-8
--align[=1|0]
-a[1|0]
0
).
See section Table Scanner Options.
--array[=1|0]
0
).
See section Runtime Algorithm Options.
--backup-optimize[=1|0]
0
).
See section Runtime Algorithm Options.
-c
--bin-code-param n
16
).
See section Code Scanner Options.
--code-scan[=1|0]
0
).
See section Code Scanner Options.
--compress type
-C type
no
, comb
or
iterative
(default: comb
).
See section Table Scanner Options.
--col-waste-percent percent
0
<= percent and percent <= 100)
(default: 50
).
See section Table Scanner Options.
--debug[=1|0]
-d[1|0]
0
).
See section Miscellaneous Options.
--default-action act
-s act
echo
,
error
, fatal
or ignore
(default: echo
).
See section Runtime Algorithm Options.
--ecs[=1|0]
-E[1|0]
--equiv-classes[=1|0]
1
).
See section Runtime Algorithm Options.
--help
-h
--ignore-case[=1|0]
-i[1|0]
--caseless[=1|0]
--case-insensitive[=1|0]
0
).
See section Runtime Input Options.
--lex-compat
-l
--lin-code-param n
4
).
See section Code Scanner Options.
--line-dir[=1|0]
1
).
See section Miscellaneous Options.
-n
--output filename
-o filename
lex.yy.c
).
See section Miscellaneous Options.
--prefix prefix
-P prefix
yy
).
See section Runtime Algorithm Options.
--sentinel char-code
-S char-code
0
).
See section Runtime Input Options.
--reject[=1|0]
1
).
See section Runtime Algorithm Options.
--stdio[=1|0]
0
).
See section Runtime Input Options.
--table type
-T type
address
,
difference
or state
(default: state
).
See section Table Scanner Options.
--to-stdout[=1|0]
-t[1|0]
0
).
See section Miscellaneous Options.
--trace[=trace_file]
-T[trace_file]
--transition-compress[=1|0]
0
).
See section Code Scanner Options.
--verbose[=1|0]
-v[1|0]
0
).
See section Miscellaneous Options.
--version
-V
--whitespace[=1|0]
-w[1|0]
0
).
See section Miscellaneous Options.
--yylineno[=1|0]
0
).
See section Runtime Algorithm Options.
When Zlex is run, it looks for certain data files (a skeleton file `zlexskl.c' and an options file `zlex.opt' (see section Option Sources)) in certain standard directories (the skeleton file must exist, but the option file need not exist). The search list specifying these standard directories is fixed when Zlex is installed; it can be printed out using Zlex's the `--help' option (see section Alphabetical Listing of all Options).
The search list consists of a list of colon-separated directory names (the
directory names may or may not have terminating slashes) or environment
variables (starting with a `$'). If a directory name starts with a
`$', then the first (only the first) `$' must be repeated. An
empty component in the search list specifies the current directory.
Typically the search list contains the current directory.
Also typically, the environment variable ZLEX_SEARCH_PATH
is present
in the search list -- this causes Zlex to check if the variable is set in
the environment. If it is, then Zlex expects it to specify a search list
which it recursively searches.
Typically, the search list compiled into Zlex looks something like the following:
$ZLEX_SEARCH_PATH:.:$HOME:/usr/local/share/zlex-Version
Since the search list will typically contain an environment variable like
ZLEX_SEARCH_PATH
it is possible to change the set of standard
directories searched by Zlex even after installation by specifying a
value for the variable. For example, if with the above search list,
ZLEX_SEARCH_PATH
is set to /usr/lib:/usr/opt/lib
, then
the effective search list becomes:
/usr/lib:/usr/opt/lib:.:$HOME:/usr/local/share/zlex-Version
To produce a high performance scanner, the Zlex programmer needs to understand the performance tradeoffs between different Zlex features.
The primary consideration used when designing Zlex was to maximize the performance in the basic task of a scanner: recognizing tokens. The performance of special actions was a secondary consideration, except that the presence of such actions was not allowed to impact the performance of those parts of the scanner which did not depend on them.
These design decisions make it desirable for the Zlex programmer to use
patterns rather than actions whenever possible. Many of the actions involve
a function call with its consequent overhead. For example, it is
preferable to process comments using start states (see section Start States Example: C comments), rather than processing them using input
(see section Direct Input: input
).
There is some overhead involved in setting up for scanning a token and completing a token. To minimize this overhead, it is preferable to maximize the token length. For example, when scanning through a comment (see section Start States Example: C comments), it is preferable to process the comment a line at a time, rather than a single character at a time.
Backtracking (both that caused during scanning due to overlapping patterns,
and that forced by explicit REJECT
s) will naturally lead to somewhat
lower performance because characters will be scanned multiple times. The
backtracking performance of Zlex is reasonably good and backtracking can be
used moderately within a scanner without impacting the overall performance
terribly.
In a generalized right context pattern of the form
RE/context
the efficiency of the pattern matching depends
on the form of RE and context. If the length of the string
which matches RE is m, and the length of the string matching
context is n then:
m +
n
.
2 * (m + n)
.
Hence generalized right context with the context overlapping the regular expression should be avoided if possible.
When a program is completed and is ready for distribution, there are two common distribution models used:
There are two possibilities for distributing the Zlex library sources:
(1) is conceptually straightforward. (2) is also straightforward, except for figuring out which parts of the Zlex library are required by the generated scanner. Fortunately, the Zlex distribution comes with a shell script which automates that task. .
The script understands the interdependencies of the modules which constitute the Zlex library. When it is run, it analyzes the object file for the generated scanner and produces a C file which contains all the source code which will be required by the generated scanner. The distribution should include this C file as well as the file `libZlexp.h' which will be found in the Zlex library source directory. .
The script is called mklibsrc
; it usually resides in the Zlex library
source directory. It can be invoked as:
SCRIPT_PATH/mklibsrc OFILE [LIB_SRC_DIR] [DEST_FILE]
where the parameters are defined as follows:
mklibsrc
script.
For example, if we assume that the environmental variable ZLEX_LIB_SRC
contains the path to the Zlex library source directory, then the invocation:
$ZLEX_LIB_SRC/mklibsrc scan.o
will produce a file `libsrc.c' which contains all the source code required from the Zlex library for the scanner object file `scan.o'.
It is unlikely but possible that the generated scanner can be compiled with different options which affect which routines will be required from the Zlex library. In that case, it is necessary to repeat the above procedure for each scanner object file produced using the different options. The script will accumulate the code in the specified C file.
The mklibsrc
script is automatically generated by m4
using a
skeletal script. The interdependencies among the library modules are
automatically extracted using a Perl
script. The main portability
problem within the mklibsrc
script is likely to be the command
`nm -u' which is used to analyze the scanner object file.
Internal versions of the Zlex scanner generator have been used by me since late 1993. In 1996, it was used by about 20 students in a compiler course: they uncoverd 2 bugs. There has been a major rewrite since then.
assert
ion failures caused by an erroneous
Zlex source file. In that case, the Zlex programmer can simply correct the
error in the Zlex source file and continue on with reasonable confidence
(after submitting a bug report of course).
First you will need to be sure that you have found a Zlex bug:
If you are sure that you have uncovered a bug, try to distil it down to a test program which is as short as possible while still exhibiting the bug. Record a log which exhibits the bug. Make sure that you mention the version of Zlex you are using in your bug report.
Bug reports can be mailed to:
zdu@acm.org
An informal description of the lexical and grammatical syntax of Zlex programs follows:
This is an informal description of the lexical syntax of non-trivial Zlex tokens.
ACT_TOK
CHAR_TOK
COLON_BEGIN_TOK
"[:"
.
COLON_END_TOK
":]"
.
EOF_PAT_TOK
"<<EOF>>"
.
ID_TOK
LEX_DIR_TOK
MACRO_TOK
NEXT_ACT_TOK
NL_TOK
NUM_TOK
OPTION_LINE_TOK
%option
.
OPTION_TOK
^"%option"
.
SEC_TOK
^"%%"
.
SS_ID_TOK
STARTX_TOK
^"%"[xX]
signalling the start of an exclusive start state declaration.
START_TOK
^"%"[sS]
signalling the start of an inclusive start state declaration.
X_OPTION_TOK
^("%array" | "%pointer")
.
lexProgram : section1 SEC_TOK section2 ; section1 : options restSection1 | options ; options : nonEmptyOptions | /* EMPTY */ ; nonEmptyOptions : optionLine | nonEmptyOptions optionLine ; optionLine : OPTION_TOK OPTION_LINE_TOK NL_TOK | X_OPTION_TOK NL_TOK ; restSection1 : section1Line | restSection1 section1Line ; section1Line : startDec | def | LEX_DIR_TOK OPTION_LINE_TOK NL_TOK ; startDec : START_TOK ssDefList | STARTX_TOK ssDefList ; ssDefList : ssDefList SS_ID_TOK | /* EMPTY */ ; def : ID_TOK regExp | ID_TOK ; section2 : ACT_TOK sec2Patterns ; sec2Patterns : sec2Patterns actPatterns | /* EMPTY */ ; actPatterns : patternActions | '+' regExp ACT_TOK ; patternActions : pattern ACT_TOK | pattern NEXT_ACT_TOK patternActions ; pattern : optSSList regExp optRightContext | optSSList '^' regExp optRightContext | optSSList rightContext | optSSList '^' rightContext | optSSList EOF_PAT_TOK ; optRightContext : rightContext | /* EMPTY */ ; rightContext : '$' | '/' regExp ; optSSList : /* EMPTY */ | '<' ssUseList '>' ; ssUseList : ssUseList ',' SS_ID_TOK | SS_ID_TOK ; regExp : regExp '|' catRegExp | catRegExp ; catRegExp : catRegExp postRegExp | postRegExp ; postRegExp : postRegExp '*' | postRegExp '?' | postRegExp '+' | postRegExp numRange | baseRegExp ; baseRegExp : '(' regExp ')' | '.' | CHAR_TOK | MACRO_TOK | '[' classElements ']' | '[' '^' classElements ']' ; classElements : classElement | classElements classElement ; classElement : CHAR_TOK | CHAR_TOK '-' CHAR_TOK | COLON_BEGIN_TOK ID_TOK COLON_END_TOK ; numRange : '{' NUM_TOK '}' | '{' NUM_TOK ',' '}' | '{' NUM_TOK ',' NUM_TOK '}' ;
Zlex: A lex
/flex
compatible scanner generator.
Copyright (C) 1995 Zerksis D. Umrigar
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation with one ADDENDUM mentioned below; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program in the file GPL included in the Zlex distribution; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
The addendum to the GNU General Public License is as follows: Permission is hereby given to use the output of Zlex in non-free programs.
The reason for the addendum is that the output of Zlex -- the generated scanner file -- contains code chunks which are verbatim copies of sizable sections of Zlex sources. These chunks include the code for parts of the `yylex' function, as well as code for Zlex library functions. If only the terms of the GPL were to be applied to the code within the generated scanner file, the effect would be to restrict the use of Zlex output to free software. Hence this document amends the terms of the GNU General Public License to explicitly allow the use of the output of Zlex in non-free programs.
The addendum has not been added because of sympathy for people who want to make software proprietary. Software should be free. Unfortunately, it appears that limiting Zlex's use to free software does little to encourage people to make other software free. So the addendum makes the practical conditions for using Zlex match the practical conditions for using other free tools.
Questions and comments regarding Zlex can be directed to me at zdu@acm.org
The above conditions were derived from the copying conditions published
for bison
by the Free Software Foundation, Inc.
$
End of line anchor
*
operator
+
operator
-
character class operator
.
regular expression
<ctype.h>
character class element
?
operator
^
character class operator
^
Start of line anchor
FILE
FILE
yylex
|
operator