source: trunk/minix/man/man1/flexdoc.1@ 9

Last change on this file since 9 was 9, checked in by Mattia Monga, 13 years ago

Minix 3.1.2a

File size: 63.9 KB
Line 
1.TH FLEX 1 "26 May 1990" "Version 2.3"
2.SH NAME
3flexdoc - fast lexical analyzer generator
4.SH SYNOPSIS
5.B flex
6.B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
7.I [filename ...]
8.SH DESCRIPTION
9.I flex
10is a tool for generating
11.I scanners:
12programs which recognized lexical patterns in text.
13.I flex
14reads
15the given input files, or its standard input if no file names are given,
16for a description of a scanner to generate. The description is in
17the form of pairs
18of regular expressions and C code, called
19.I rules. flex
20generates as output a C source file,
21.B lex.yy.c,
22which defines a routine
23.B yylex().
24This file is compiled and linked with the
25.B -lfl
26library to produce an executable. When the executable is run,
27it analyzes its input for occurrences
28of the regular expressions. Whenever it finds one, it executes
29the corresponding C code.
30.SH SOME SIMPLE EXAMPLES
31.LP
32First some simple examples to get the flavor of how one uses
33.I flex.
34The following
35.I flex
36input specifies a scanner which whenever it encounters the string
37"username" will replace it with the user's login name:
38.nf
39
40 %%
41 username printf( "%s", getlogin() );
42
43.fi
44By default, any text not matched by a
45.I flex
46scanner
47is copied to the output, so the net effect of this scanner is
48to copy its input file to its output with each occurrence
49of "username" expanded.
50In this input, there is just one rule. "username" is the
51.I pattern
52and the "printf" is the
53.I action.
54The "%%" marks the beginning of the rules.
55.LP
56Here's another simple example:
57.nf
58
59 int num_lines = 0, num_chars = 0;
60
61 %%
62 \\n ++num_lines; ++num_chars;
63 . ++num_chars;
64
65 %%
66 main()
67 {
68 yylex();
69 printf( "# of lines = %d, # of chars = %d\\n",
70 num_lines, num_chars );
71 }
72
73.fi
74This scanner counts the number of characters and the number
75of lines in its input (it produces no output other than the
76final report on the counts). The first line
77declares two globals, "num_lines" and "num_chars", which are accessible
78both inside
79.B yylex()
80and in the
81.B main()
82routine declared after the second "%%". There are two rules, one
83which matches a newline ("\\n") and increments both the line count and
84the character count, and one which matches any character other than
85a newline (indicated by the "." regular expression).
86.LP
87A somewhat more complicated example:
88.nf
89
90 /* scanner for a toy Pascal-like language */
91
92 %{
93 /* need this for the call to atof() below */
94 #include <math.h>
95 %}
96
97 DIGIT [0-9]
98 ID [a-z][a-z0-9]*
99
100 %%
101
102 {DIGIT}+ {
103 printf( "An integer: %s (%d)\\n", yytext,
104 atoi( yytext ) );
105 }
106
107 {DIGIT}+"."{DIGIT}* {
108 printf( "A float: %s (%g)\\n", yytext,
109 atof( yytext ) );
110 }
111
112 if|then|begin|end|procedure|function {
113 printf( "A keyword: %s\\n", yytext );
114 }
115
116 {ID} printf( "An identifier: %s\\n", yytext );
117
118 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
119
120 "{"[^}\\n]*"}" /* eat up one-line comments */
121
122 [ \\t\\n]+ /* eat up whitespace */
123
124 . printf( "Unrecognized character: %s\\n", yytext );
125
126 %%
127
128 main( argc, argv )
129 int argc;
130 char **argv;
131 {
132 ++argv, --argc; /* skip over program name */
133 if ( argc > 0 )
134 yyin = fopen( argv[0], "r" );
135 else
136 yyin = stdin;
137
138 yylex();
139 }
140
141.fi
142This is the beginnings of a simple scanner for a language like
143Pascal. It identifies different types of
144.I tokens
145and reports on what it has seen.
146.LP
147The details of this example will be explained in the following
148sections.
149.SH FORMAT OF THE INPUT FILE
150The
151.I flex
152input file consists of three sections, separated by a line with just
153.B %%
154in it:
155.nf
156
157 definitions
158 %%
159 rules
160 %%
161 user code
162
163.fi
164The
165.I definitions
166section contains declarations of simple
167.I name
168definitions to simplify the scanner specification, and declarations of
169.I start conditions,
170which are explained in a later section.
171.LP
172Name definitions have the form:
173.nf
174
175 name definition
176
177.fi
178The "name" is a word beginning with a letter or an underscore ('_')
179followed by zero or more letters, digits, '_', or '-' (dash).
180The definition is taken to begin at the first non-white-space character
181following the name and continuing to the end of the line.
182The definition can subsequently be referred to using "{name}", which
183will expand to "(definition)". For example,
184.nf
185
186 DIGIT [0-9]
187 ID [a-z][a-z0-9]*
188
189.fi
190defines "DIGIT" to be a regular expression which matches a
191single digit, and
192"ID" to be a regular expression which matches a letter
193followed by zero-or-more letters-or-digits.
194A subsequent reference to
195.nf
196
197 {DIGIT}+"."{DIGIT}*
198
199.fi
200is identical to
201.nf
202
203 ([0-9])+"."([0-9])*
204
205.fi
206and matches one-or-more digits followed by a '.' followed
207by zero-or-more digits.
208.LP
209The
210.I rules
211section of the
212.I flex
213input contains a series of rules of the form:
214.nf
215
216 pattern action
217
218.fi
219where the pattern must be unindented and the action must begin
220on the same line.
221.LP
222See below for a further description of patterns and actions.
223.LP
224Finally, the user code section is simply copied to
225.B lex.yy.c
226verbatim.
227It is used for companion routines which call or are called
228by the scanner. The presence of this section is optional;
229if it is missing, the second
230.B %%
231in the input file may be skipped, too.
232.LP
233In the definitions and rules sections, any
234.I indented
235text or text enclosed in
236.B %{
237and
238.B %}
239is copied verbatim to the output (with the %{}'s removed).
240The %{}'s must appear unindented on lines by themselves.
241.LP
242In the rules section,
243any indented or %{} text appearing before the
244first rule may be used to declare variables
245which are local to the scanning routine and (after the declarations)
246code which is to be executed whenever the scanning routine is entered.
247Other indented or %{} text in the rule section is still copied to the output,
248but its meaning is not well-defined and it may well cause compile-time
249errors (this feature is present for
250.I POSIX
251compliance; see below for other such features).
252.LP
253In the definitions section, an unindented comment (i.e., a line
254beginning with "/*") is also copied verbatim to the output up
255to the next "*/". Also, any line in the definitions section
256beginning with '#' is ignored, though this style of comment is
257deprecated and may go away in the future.
258.SH PATTERNS
259The patterns in the input are written using an extended set of regular
260expressions. These are:
261.nf
262
263 x match the character 'x'
264 . any character except newline
265 [xyz] a "character class"; in this case, the pattern
266 matches either an 'x', a 'y', or a 'z'
267 [abj-oZ] a "character class" with a range in it; matches
268 an 'a', a 'b', any letter from 'j' through 'o',
269 or a 'Z'
270 [^A-Z] a "negated character class", i.e., any character
271 but those in the class. In this case, any
272 character EXCEPT an uppercase letter.
273 [^A-Z\\n] any character EXCEPT an uppercase letter or
274 a newline
275 r* zero or more r's, where r is any regular expression
276 r+ one or more r's
277 r? zero or one r's (that is, "an optional r")
278 r{2,5} anywhere from two to five r's
279 r{2,} two or more r's
280 r{4} exactly 4 r's
281 {name} the expansion of the "name" definition
282 (see above)
283 "[xyz]\\"foo"
284 the literal string: [xyz]"foo
285 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286 then the ANSI-C interpretation of \\x.
287 Otherwise, a literal 'X' (used to escape
288 operators such as '*')
289 \\123 the character with octal value 123
290 \\x2a the character with hexadecimal value 2a
291 (r) match an r; parentheses are used to override
292 precedence (see below)
293
294
295 rs the regular expression r followed by the
296 regular expression s; called "concatenation"
297
298
299 r|s either an r or an s
300
301
302 r/s an r but only if it is followed by an s. The
303 s is not part of the matched text. This type
304 of pattern is called as "trailing context".
305 ^r an r, but only at the beginning of a line
306 r$ an r, but only at the end of a line. Equivalent
307 to "r/\\n".
308
309
310 <s>r an r, but only in start condition s (see
311 below for discussion of start conditions)
312 <s1,s2,s3>r
313 same, but in any of start conditions s1,
314 s2, or s3
315
316
317 <<EOF>> an end-of-file
318 <s1,s2><<EOF>>
319 an end-of-file when in start condition s1 or s2
320
321.fi
322The regular expressions listed above are grouped according to
323precedence, from highest precedence at the top to lowest at the bottom.
324Those grouped together have equal precedence. For example,
325.nf
326
327 foo|bar*
328
329.fi
330is the same as
331.nf
332
333 (foo)|(ba(r*))
334
335.fi
336since the '*' operator has higher precedence than concatenation,
337and concatenation higher than alternation ('|'). This pattern
338therefore matches
339.I either
340the string "foo"
341.I or
342the string "ba" followed by zero-or-more r's.
343To match "foo" or zero-or-more "bar"'s, use:
344.nf
345
346 foo|(bar)*
347
348.fi
349and to match zero-or-more "foo"'s-or-"bar"'s:
350.nf
351
352 (foo|bar)*
353
354.fi
355.LP
356Some notes on patterns:
357.IP -
358A negated character class such as the example "[^A-Z]"
359above
360.I will match a newline
361unless "\\n" (or an equivalent escape sequence) is one of the
362characters explicitly present in the negated character class
363(e.g., "[^A-Z\\n]"). This is unlike how many other regular
364expression tools treat negated character classes, but unfortunately
365the inconsistency is historically entrenched.
366Matching newlines means that a pattern like [^"]* can match an entire
367input (overflowing the scanner's input buffer) unless there's another
368quote in the input.
369.IP -
370A rule can have at most one instance of trailing context (the '/' operator
371or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373cannot be grouped inside parentheses. A '^' which does not occur at
374the beginning of a rule or a '$' which does not occur at the end of
375a rule loses its special properties and is treated as a normal character.
376.IP
377The following are illegal:
378.nf
379
380 foo/bar$
381 <sc1>foo<sc2>bar
382
383.fi
384Note that the first of these, can be written "foo/bar\\n".
385.IP
386The following will result in '$' or '^' being treated as a normal character:
387.nf
388
389 foo|(bar$)
390 foo|^bar
391
392.fi
393If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394could be used (the special '|' action is explained below):
395.nf
396
397 foo |
398 bar$ /* action goes here */
399
400.fi
401A similar trick will work for matching a foo or a
402bar-at-the-beginning-of-a-line.
403.SH HOW THE INPUT IS MATCHED
404When the generated scanner is run, it analyzes its input looking
405for strings which match any of its patterns. If it finds more than
406one match, it takes the one matching the most text (for trailing
407context rules, this includes the length of the trailing part, even
408though it will then be returned to the input). If it finds two
409or more matches of the same length, the
410rule listed first in the
411.I flex
412input file is chosen.
413.LP
414Once the match is determined, the text corresponding to the match
415(called the
416.I token)
417is made available in the global character pointer
418.B yytext,
419and its length in the global integer
420.B yyleng.
421The
422.I action
423corresponding to the matched pattern is then executed (a more
424detailed description of actions follows), and then the remaining
425input is scanned for another match.
426.LP
427If no match is found, then the
428.I default rule
429is executed: the next character in the input is considered matched and
430copied to the standard output. Thus, the simplest legal
431.I flex
432input is:
433.nf
434
435 %%
436
437.fi
438which generates a scanner that simply copies its input (one character
439at a time) to its output.
440.SH ACTIONS
441Each pattern in a rule has a corresponding action, which can be any
442arbitrary C statement. The pattern ends at the first non-escaped
443whitespace character; the remainder of the line is its action. If the
444action is empty, then when the pattern is matched the input token
445is simply discarded. For example, here is the specification for a program
446which deletes all occurrences of "zap me" from its input:
447.nf
448
449 %%
450 "zap me"
451
452.fi
453(It will copy all other characters in the input to the output since
454they will be matched by the default rule.)
455.LP
456Here is a program which compresses multiple blanks and tabs down to
457a single blank, and throws away whitespace found at the end of a line:
458.nf
459
460 %%
461 [ \\t]+ putchar( ' ' );
462 [ \\t]+$ /* ignore this token */
463
464.fi
465.LP
466If the action contains a '{', then the action spans till the balancing '}'
467is found, and the action may cross multiple lines.
468.I flex
469knows about C strings and comments and won't be fooled by braces found
470within them, but also allows actions to begin with
471.B %{
472and will consider the action to be all the text up to the next
473.B %}
474(regardless of ordinary braces inside the action).
475.LP
476An action consisting solely of a vertical bar ('|') means "same as
477the action for the next rule." See below for an illustration.
478.LP
479Actions can include arbitrary C code, including
480.B return
481statements to return a value to whatever routine called
482.B yylex().
483Each time
484.B yylex()
485is called it continues processing tokens from where it last left
486off until it either reaches
487the end of the file or executes a return. Once it reaches an end-of-file,
488however, then any subsequent call to
489.B yylex()
490will simply immediately return, unless
491.B yyrestart()
492is first called (see below).
493.LP
494Actions are not allowed to modify yytext or yyleng.
495.LP
496There are a number of special directives which can be included within
497an action:
498.IP -
499.B ECHO
500copies yytext to the scanner's output.
501.IP -
502.B BEGIN
503followed by the name of a start condition places the scanner in the
504corresponding start condition (see below).
505.IP -
506.B REJECT
507directs the scanner to proceed on to the "second best" rule which matched the
508input (or a prefix of the input). The rule is chosen as described
509above in "How the Input is Matched", and
510.B yytext
511and
512.B yyleng
513set up appropriately.
514It may either be one which matched as much text
515as the originally chosen rule but came later in the
516.I flex
517input file, or one which matched less text.
518For example, the following will both count the
519words in the input and call the routine special() whenever "frob" is seen:
520.nf
521
522 int word_count = 0;
523 %%
524
525 frob special(); REJECT;
526 [^ \\t\\n]+ ++word_count;
527
528.fi
529Without the
530.B REJECT,
531any "frob"'s in the input would not be counted as words, since the
532scanner normally executes only one action per token.
533Multiple
534.B REJECT's
535are allowed, each one finding the next best choice to the currently
536active rule. For example, when the following scanner scans the token
537"abcd", it will write "abcdabcaba" to the output:
538.nf
539
540 %%
541 a |
542 ab |
543 abc |
544 abcd ECHO; REJECT;
545 .|\\n /* eat up any unmatched character */
546
547.fi
548(The first three rules share the fourth's action since they use
549the special '|' action.)
550.B REJECT
551is a particularly expensive feature in terms scanner performance;
552if it is used in
553.I any
554of the scanner's actions it will slow down
555.I all
556of the scanner's matching. Furthermore,
557.B REJECT
558cannot be used with the
559.I -f
560or
561.I -F
562options (see below).
563.IP
564Note also that unlike the other special actions,
565.B REJECT
566is a
567.I branch;
568code immediately following it in the action will
569.I not
570be executed.
571.IP -
572.B yymore()
573tells the scanner that the next time it matches a rule, the corresponding
574token should be
575.I appended
576onto the current value of
577.B yytext
578rather than replacing it. For example, given the input "mega-kludge"
579the following will write "mega-mega-kludge" to the output:
580.nf
581
582 %%
583 mega- ECHO; yymore();
584 kludge ECHO;
585
586.fi
587First "mega-" is matched and echoed to the output. Then "kludge"
588is matched, but the previous "mega-" is still hanging around at the
589beginning of
590.B yytext
591so the
592.B ECHO
593for the "kludge" rule will actually write "mega-kludge".
594The presence of
595.B yymore()
596in the scanner's action entails a minor performance penalty in the
597scanner's matching speed.
598.IP -
599.B yyless(n)
600returns all but the first
601.I n
602characters of the current token back to the input stream, where they
603will be rescanned when the scanner looks for the next match.
604.B yytext
605and
606.B yyleng
607are adjusted appropriately (e.g.,
608.B yyleng
609will now be equal to
610.I n
611). For example, on the input "foobar" the following will write out
612"foobarbar":
613.nf
614
615 %%
616 foobar ECHO; yyless(3);
617 [a-z]+ ECHO;
618
619.fi
620An argument of 0 to
621.B yyless
622will cause the entire current input string to be scanned again. Unless you've
623changed how the scanner will subsequently process its input (using
624.B BEGIN,
625for example), this will result in an endless loop.
626.IP -
627.B unput(c)
628puts the character
629.I c
630back onto the input stream. It will be the next character scanned.
631The following action will take the current token and cause it
632to be rescanned enclosed in parentheses.
633.nf
634
635 {
636 int i;
637 unput( ')' );
638 for ( i = yyleng - 1; i >= 0; --i )
639 unput( yytext[i] );
640 unput( '(' );
641 }
642
643.fi
644Note that since each
645.B unput()
646puts the given character back at the
647.I beginning
648of the input stream, pushing back strings must be done back-to-front.
649.IP -
650.B input()
651reads the next character from the input stream. For example,
652the following is one way to eat up C comments:
653.nf
654
655 %%
656 "/*" {
657 register int c;
658
659 for ( ; ; )
660 {
661 while ( (c = input()) != '*' &&
662 c != EOF )
663 ; /* eat up text of comment */
664
665 if ( c == '*' )
666 {
667 while ( (c = input()) == '*' )
668 ;
669 if ( c == '/' )
670 break; /* found the end */
671 }
672
673 if ( c == EOF )
674 {
675 error( "EOF in comment" );
676 break;
677 }
678 }
679 }
680
681.fi
682(Note that if the scanner is compiled using
683.B C++,
684then
685.B input()
686is instead referred to as
687.B yyinput(),
688in order to avoid a name clash with the
689.B C++
690stream by the name of
691.I input.)
692.IP -
693.B yyterminate()
694can be used in lieu of a return statement in an action. It terminates
695the scanner and returns a 0 to the scanner's caller, indicating "all done".
696Subsequent calls to the scanner will immediately return unless preceded
697by a call to
698.B yyrestart()
699(see below).
700By default,
701.B yyterminate()
702is also called when an end-of-file is encountered. It is a macro and
703may be redefined.
704.SH THE GENERATED SCANNER
705The output of
706.I flex
707is the file
708.B lex.yy.c,
709which contains the scanning routine
710.B yylex(),
711a number of tables used by it for matching tokens, and a number
712of auxiliary routines and macros. By default,
713.B yylex()
714is declared as follows:
715.nf
716
717 int yylex()
718 {
719 ... various definitions and the actions in here ...
720 }
721
722.fi
723(If your environment supports function prototypes, then it will
724be "int yylex( void )".) This definition may be changed by redefining
725the "YY_DECL" macro. For example, you could use:
726.nf
727
728 #undef YY_DECL
729 #define YY_DECL float lexscan( a, b ) float a, b;
730
731.fi
732to give the scanning routine the name
733.I lexscan,
734returning a float, and taking two floats as arguments. Note that
735if you give arguments to the scanning routine using a
736K&R-style/non-prototyped function declaration, you must terminate
737the definition with a semi-colon (;).
738.LP
739Whenever
740.B yylex()
741is called, it scans tokens from the global input file
742.I yyin
743(which defaults to stdin). It continues until it either reaches
744an end-of-file (at which point it returns the value 0) or
745one of its actions executes a
746.I return
747statement.
748In the former case, when called again the scanner will immediately
749return unless
750.B yyrestart()
751is called to point
752.I yyin
753at the new input file. (
754.B yyrestart()
755takes one argument, a
756.B FILE *
757pointer.)
758In the latter case (i.e., when an action
759executes a return), the scanner may then be called again and it
760will resume scanning where it left off.
761.LP
762By default (and for purposes of efficiency), the scanner uses
763block-reads rather than simple
764.I getc()
765calls to read characters from
766.I yyin.
767The nature of how it gets its input can be controlled by redefining the
768.B YY_INPUT
769macro.
770YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771action is to place up to
772.I max_size
773characters in the character array
774.I buf
775and return in the integer variable
776.I result
777either the
778number of characters read or the constant YY_NULL (0 on Unix systems)
779to indicate EOF. The default YY_INPUT reads from the
780global file-pointer "yyin".
781.LP
782A sample redefinition of YY_INPUT (in the definitions
783section of the input file):
784.nf
785
786 %{
787 #undef YY_INPUT
788 #define YY_INPUT(buf,result,max_size) \\
789 { \\
790 int c = getchar(); \\
791 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
792 }
793 %}
794
795.fi
796This definition will change the input processing to occur
797one character at a time.
798.LP
799You also can add in things like keeping track of the
800input line number this way; but don't expect your scanner to
801go very fast.
802.LP
803When the scanner receives an end-of-file indication from YY_INPUT,
804it then checks the
805.B yywrap()
806function. If
807.B yywrap()
808returns false (zero), then it is assumed that the
809function has gone ahead and set up
810.I yyin
811to point to another input file, and scanning continues. If it returns
812true (non-zero), then the scanner terminates, returning 0 to its
813caller.
814.LP
815The default
816.B yywrap()
817always returns 1. Presently, to redefine it you must first
818"#undef yywrap", as it is currently implemented as a macro. As indicated
819by the hedging in the previous sentence, it may be changed to
820a true function in the near future.
821.LP
822The scanner writes its
823.B ECHO
824output to the
825.I yyout
826global (default, stdout), which may be redefined by the user simply
827by assigning it to some other
828.B FILE
829pointer.
830.SH START CONDITIONS
831.I flex
832provides a mechanism for conditionally activating rules. Any rule
833whose pattern is prefixed with "<sc>" will only be active when
834the scanner is in the start condition named "sc". For example,
835.nf
836
837 <STRING>[^"]* { /* eat up the string body ... */
838 ...
839 }
840
841.fi
842will be active only when the scanner is in the "STRING" start
843condition, and
844.nf
845
846 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
847 ...
848 }
849
850.fi
851will be active only when the current start condition is
852either "INITIAL", "STRING", or "QUOTE".
853.LP
854Start conditions
855are declared in the definitions (first) section of the input
856using unindented lines beginning with either
857.B %s
858or
859.B %x
860followed by a list of names.
861The former declares
862.I inclusive
863start conditions, the latter
864.I exclusive
865start conditions. A start condition is activated using the
866.B BEGIN
867action. Until the next
868.B BEGIN
869action is executed, rules with the given start
870condition will be active and
871rules with other start conditions will be inactive.
872If the start condition is
873.I inclusive,
874then rules with no start conditions at all will also be active.
875If it is
876.I exclusive,
877then
878.I only
879rules qualified with the start condition will be active.
880A set of rules contingent on the same exclusive start condition
881describe a scanner which is independent of any of the other rules in the
882.I flex
883input. Because of this,
884exclusive start conditions make it easy to specify "mini-scanners"
885which scan portions of the input that are syntactically different
886from the rest (e.g., comments).
887.LP
888If the distinction between inclusive and exclusive start conditions
889is still a little vague, here's a simple example illustrating the
890connection between the two. The set of rules:
891.nf
892
893 %s example
894 %%
895 <example>foo /* do something */
896
897.fi
898is equivalent to
899.nf
900
901 %x example
902 %%
903 <INITIAL,example>foo /* do something */
904
905.fi
906.LP
907The default rule (to
908.B ECHO
909any unmatched character) remains active in start conditions.
910.LP
911.B BEGIN(0)
912returns to the original state where only the rules with
913no start conditions are active. This state can also be
914referred to as the start-condition "INITIAL", so
915.B BEGIN(INITIAL)
916is equivalent to
917.B BEGIN(0).
918(The parentheses around the start condition name are not required but
919are considered good style.)
920.LP
921.B BEGIN
922actions can also be given as indented code at the beginning
923of the rules section. For example, the following will cause
924the scanner to enter the "SPECIAL" start condition whenever
925.I yylex()
926is called and the global variable
927.I enter_special
928is true:
929.nf
930
931 int enter_special;
932
933 %x SPECIAL
934 %%
935 if ( enter_special )
936 BEGIN(SPECIAL);
937
938 <SPECIAL>blahblahblah
939 ...more rules follow...
940
941.fi
942.LP
943To illustrate the uses of start conditions,
944here is a scanner which provides two different interpretations
945of a string like "123.456". By default it will treat it as
946as three tokens, the integer "123", a dot ('.'), and the integer "456".
947But if the string is preceded earlier in the line by the string
948"expect-floats"
949it will treat it as a single token, the floating-point number
950123.456:
951.nf
952
953 %{
954 #include <math.h>
955 %}
956 %s expect
957
958 %%
959 expect-floats BEGIN(expect);
960
961 <expect>[0-9]+"."[0-9]+ {
962 printf( "found a float, = %f\\n",
963 atof( yytext ) );
964 }
965 <expect>\\n {
966 /* that's the end of the line, so
967 * we need another "expect-number"
968 * before we'll recognize any more
969 * numbers
970 */
971 BEGIN(INITIAL);
972 }
973
974 [0-9]+ {
975 printf( "found an integer, = %d\\n",
976 atoi( yytext ) );
977 }
978
979 "." printf( "found a dot\\n" );
980
981.fi
982Here is a scanner which recognizes (and discards) C comments while
983maintaining a count of the current input line.
984.nf
985
986 %x comment
987 %%
988 int line_num = 1;
989
990 "/*" BEGIN(comment);
991
992 <comment>[^*\\n]* /* eat anything that's not a '*' */
993 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
994 <comment>\\n ++line_num;
995 <comment>"*"+"/" BEGIN(INITIAL);
996
997.fi
998Note that start-conditions names are really integer values and
999can be stored as such. Thus, the above could be extended in the
1000following fashion:
1001.nf
1002
1003 %x comment foo
1004 %%
1005 int line_num = 1;
1006 int comment_caller;
1007
1008 "/*" {
1009 comment_caller = INITIAL;
1010 BEGIN(comment);
1011 }
1012
1013 ...
1014
1015 <foo>"/*" {
1016 comment_caller = foo;
1017 BEGIN(comment);
1018 }
1019
1020 <comment>[^*\\n]* /* eat anything that's not a '*' */
1021 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1022 <comment>\\n ++line_num;
1023 <comment>"*"+"/" BEGIN(comment_caller);
1024
1025.fi
1026One can then implement a "stack" of start conditions using an
1027array of integers. (It is likely that such stacks will become
1028a full-fledged
1029.I flex
1030feature in the future.) Note, though, that
1031start conditions do not have their own name-space; %s's and %x's
1032declare names in the same fashion as #define's.
1033.SH MULTIPLE INPUT BUFFERS
1034Some scanners (such as those which support "include" files)
1035require reading from several input streams. As
1036.I flex
1037scanners do a large amount of buffering, one cannot control
1038where the next input will be read from by simply writing a
1039.B YY_INPUT
1040which is sensitive to the scanning context.
1041.B YY_INPUT
1042is only called when the scanner reaches the end of its buffer, which
1043may be a long time after scanning a statement such as an "include"
1044which requires switching the input source.
1045.LP
1046To negotiate these sorts of problems,
1047.I flex
1048provides a mechanism for creating and switching between multiple
1049input buffers. An input buffer is created by using:
1050.nf
1051
1052 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1053
1054.fi
1055which takes a
1056.I FILE
1057pointer and a size and creates a buffer associated with the given
1058file and large enough to hold
1059.I size
1060characters (when in doubt, use
1061.B YY_BUF_SIZE
1062for the size). It returns a
1063.B YY_BUFFER_STATE
1064handle, which may then be passed to other routines:
1065.nf
1066
1067 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1068
1069.fi
1070switches the scanner's input buffer so subsequent tokens will
1071come from
1072.I new_buffer.
1073Note that
1074.B yy_switch_to_buffer()
1075may be used by yywrap() to sets things up for continued scanning, instead
1076of opening a new file and pointing
1077.I yyin
1078at it.
1079.nf
1080
1081 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1082
1083.fi
1084is used to reclaim the storage associated with a buffer.
1085.LP
1086.B yy_new_buffer()
1087is an alias for
1088.B yy_create_buffer(),
1089provided for compatibility with the C++ use of
1090.I new
1091and
1092.I delete
1093for creating and destroying dynamic objects.
1094.LP
1095Finally, the
1096.B YY_CURRENT_BUFFER
1097macro returns a
1098.B YY_BUFFER_STATE
1099handle to the current buffer.
1100.LP
1101Here is an example of using these features for writing a scanner
1102which expands include files (the
1103.B <<EOF>>
1104feature is discussed below):
1105.nf
1106
1107 /* the "incl" state is used for picking up the name
1108 * of an include file
1109 */
1110 %x incl
1111
1112 %{
1113 #define MAX_INCLUDE_DEPTH 10
1114 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1115 int include_stack_ptr = 0;
1116 %}
1117
1118 %%
1119 include BEGIN(incl);
1120
1121 [a-z]+ ECHO;
1122 [^a-z\\n]*\\n? ECHO;
1123
1124 <incl>[ \\t]* /* eat the whitespace */
1125 <incl>[^ \\t\\n]+ { /* got the include file name */
1126 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1127 {
1128 fprintf( stderr, "Includes nested too deeply" );
1129 exit( 1 );
1130 }
1131
1132 include_stack[include_stack_ptr++] =
1133 YY_CURRENT_BUFFER;
1134
1135 yyin = fopen( yytext, "r" );
1136
1137 if ( ! yyin )
1138 error( ... );
1139
1140 yy_switch_to_buffer(
1141 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1142
1143 BEGIN(INITIAL);
1144 }
1145
1146 <<EOF>> {
1147 if ( --include_stack_ptr < 0 )
1148 {
1149 yyterminate();
1150 }
1151
1152 else
1153 yy_switch_to_buffer(
1154 include_stack[include_stack_ptr] );
1155 }
1156
1157.fi
1158.SH END-OF-FILE RULES
1159The special rule "<<EOF>>" indicates
1160actions which are to be taken when an end-of-file is
1161encountered and yywrap() returns non-zero (i.e., indicates
1162no further files to process). The action must finish
1163by doing one of four things:
1164.IP -
1165the special
1166.B YY_NEW_FILE
1167action, if
1168.I yyin
1169has been pointed at a new file to process;
1170.IP -
1171a
1172.I return
1173statement;
1174.IP -
1175the special
1176.B yyterminate()
1177action;
1178.IP -
1179or, switching to a new buffer using
1180.B yy_switch_to_buffer()
1181as shown in the example above.
1182.LP
1183<<EOF>> rules may not be used with other
1184patterns; they may only be qualified with a list of start
1185conditions. If an unqualified <<EOF>> rule is given, it
1186applies to
1187.I all
1188start conditions which do not already have <<EOF>> actions. To
1189specify an <<EOF>> rule for only the initial start condition, use
1190.nf
1191
1192 <INITIAL><<EOF>>
1193
1194.fi
1195.LP
1196These rules are useful for catching things like unclosed comments.
1197An example:
1198.nf
1199
1200 %x quote
1201 %%
1202
1203 ...other rules for dealing with quotes...
1204
1205 <quote><<EOF>> {
1206 error( "unterminated quote" );
1207 yyterminate();
1208 }
1209 <<EOF>> {
1210 if ( *++filelist )
1211 {
1212 yyin = fopen( *filelist, "r" );
1213 YY_NEW_FILE;
1214 }
1215 else
1216 yyterminate();
1217 }
1218
1219.fi
1220.SH MISCELLANEOUS MACROS
1221The macro
1222.B YY_USER_ACTION
1223can be redefined to provide an action
1224which is always executed prior to the matched rule's action. For example,
1225it could be #define'd to call a routine to convert yytext to lower-case.
1226.LP
1227The macro
1228.B YY_USER_INIT
1229may be redefined to provide an action which is always executed before
1230the first scan (and before the scanner's internal initializations are done).
1231For example, it could be used to call a routine to read
1232in a data table or open a logging file.
1233.LP
1234In the generated scanner, the actions are all gathered in one large
1235switch statement and separated using
1236.B YY_BREAK,
1237which may be redefined. By default, it is simply a "break", to separate
1238each rule's action from the following rule's.
1239Redefining
1240.B YY_BREAK
1241allows, for example, C++ users to
1242#define YY_BREAK to do nothing (while being very careful that every
1243rule ends with a "break" or a "return"!) to avoid suffering from
1244unreachable statement warnings where because a rule's action ends with
1245"return", the
1246.B YY_BREAK
1247is inaccessible.
1248.SH INTERFACING WITH YACC
1249One of the main uses of
1250.I flex
1251is as a companion to the
1252.I yacc
1253parser-generator.
1254.I yacc
1255parsers expect to call a routine named
1256.B yylex()
1257to find the next input token. The routine is supposed to
1258return the type of the next token as well as putting any associated
1259value in the global
1260.B yylval.
1261To use
1262.I flex
1263with
1264.I yacc,
1265one specifies the
1266.B -d
1267option to
1268.I yacc
1269to instruct it to generate the file
1270.B y.tab.h
1271containing definitions of all the
1272.B %tokens
1273appearing in the
1274.I yacc
1275input. This file is then included in the
1276.I flex
1277scanner. For example, if one of the tokens is "TOK_NUMBER",
1278part of the scanner might look like:
1279.nf
1280
1281 %{
1282 #include "y.tab.h"
1283 %}
1284
1285 %%
1286
1287 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1288
1289.fi
1290.SH TRANSLATION TABLE
1291In the name of POSIX compliance,
1292.I flex
1293supports a
1294.I translation table
1295for mapping input characters into groups.
1296The table is specified in the first section, and its format looks like:
1297.nf
1298
1299 %t
1300 1 abcd
1301 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1302 52 0123456789
1303 6 \\t\\ \\n
1304 %t
1305
1306.fi
1307This example specifies that the characters 'a', 'b', 'c', and 'd'
1308are to all be lumped into group #1, upper-case letters
1309in group #2, digits in group #52, tabs, blanks, and newlines into
1310group #6, and
1311.I
1312no other characters will appear in the patterns.
1313The group numbers are actually disregarded by
1314.I flex;
1315.B %t
1316serves, though, to lump characters together. Given the above
1317table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1318They both say, "match any character in group #1, followed by
1319zero-or-more pairs of characters
1320from group #2, followed by a character from group #52." Thus
1321.B %t
1322provides a crude way for introducing equivalence classes into
1323the scanner specification.
1324.LP
1325Note that the
1326.B -i
1327option (see below) coupled with the equivalence classes which
1328.I flex
1329automatically generates take care of virtually all the instances
1330when one might consider using
1331.B %t.
1332But what the hell, it's there if you want it.
1333.SH OPTIONS
1334.I flex
1335has the following options:
1336.TP
1337.B -b
1338Generate backtracking information to
1339.I lex.backtrack.
1340This is a list of scanner states which require backtracking
1341and the input characters on which they do so. By adding rules one
1342can remove backtracking states. If all backtracking states
1343are eliminated and
1344.B -f
1345or
1346.B -F
1347is used, the generated scanner will run faster (see the
1348.B -p
1349flag). Only users who wish to squeeze every last cycle out of their
1350scanners need worry about this option. (See the section on PERFORMANCE
1351CONSIDERATIONS below.)
1352.TP
1353.B -c
1354is a do-nothing, deprecated option included for POSIX compliance.
1355.IP
1356.B NOTE:
1357in previous releases of
1358.I flex
1359.B -c
1360specified table-compression options. This functionality is
1361now given by the
1362.B -C
1363flag. To ease the the impact of this change, when
1364.I flex
1365encounters
1366.B -c,
1367it currently issues a warning message and assumes that
1368.B -C
1369was desired instead. In the future this "promotion" of
1370.B -c
1371to
1372.B -C
1373will go away in the name of full POSIX compliance (unless
1374the POSIX meaning is removed first).
1375.TP
1376.B -d
1377makes the generated scanner run in
1378.I debug
1379mode. Whenever a pattern is recognized and the global
1380.B yy_flex_debug
1381is non-zero (which is the default),
1382the scanner will write to
1383.I stderr
1384a line of the form:
1385.nf
1386
1387 --accepting rule at line 53 ("the matched text")
1388
1389.fi
1390The line number refers to the location of the rule in the file
1391defining the scanner (i.e., the file that was fed to flex). Messages
1392are also generated when the scanner backtracks, accepts the
1393default rule, reaches the end of its input buffer (or encounters
1394a NUL; at this point, the two look the same as far as the scanner's concerned),
1395or reaches an end-of-file.
1396.TP
1397.B -f
1398specifies (take your pick)
1399.I full table
1400or
1401.I fast scanner.
1402No table compression is done. The result is large but fast.
1403This option is equivalent to
1404.B -Cf
1405(see below).
1406.TP
1407.B -i
1408instructs
1409.I flex
1410to generate a
1411.I case-insensitive
1412scanner. The case of letters given in the
1413.I flex
1414input patterns will
1415be ignored, and tokens in the input will be matched regardless of case. The
1416matched text given in
1417.I yytext
1418will have the preserved case (i.e., it will not be folded).
1419.TP
1420.B -n
1421is another do-nothing, deprecated option included only for
1422POSIX compliance.
1423.TP
1424.B -p
1425generates a performance report to stderr. The report
1426consists of comments regarding features of the
1427.I flex
1428input file which will cause a loss of performance in the resulting scanner.
1429Note that the use of
1430.I REJECT
1431and variable trailing context (see the BUGS section in flex(1))
1432entails a substantial performance penalty; use of
1433.I yymore(),
1434the
1435.B ^
1436operator,
1437and the
1438.B -I
1439flag entail minor performance penalties.
1440.TP
1441.B -s
1442causes the
1443.I default rule
1444(that unmatched scanner input is echoed to
1445.I stdout)
1446to be suppressed. If the scanner encounters input that does not
1447match any of its rules, it aborts with an error. This option is
1448useful for finding holes in a scanner's rule set.
1449.TP
1450.B -t
1451instructs
1452.I flex
1453to write the scanner it generates to standard output instead
1454of
1455.B lex.yy.c.
1456.TP
1457.B -v
1458specifies that
1459.I flex
1460should write to
1461.I stderr
1462a summary of statistics regarding the scanner it generates.
1463Most of the statistics are meaningless to the casual
1464.I flex
1465user, but the
1466first line identifies the version of
1467.I flex,
1468which is useful for figuring
1469out where you stand with respect to patches and new releases,
1470and the next two lines give the date when the scanner was created
1471and a summary of the flags which were in effect.
1472.TP
1473.B -F
1474specifies that the
1475.I fast
1476scanner table representation should be used. This representation is
1477about as fast as the full table representation
1478.RB ( \-f ),
1479and for some sets of patterns will be considerably smaller (and for
1480others, larger). In general, if the pattern set contains both "keywords"
1481and a catch-all, "identifier" rule, such as in the set:
1482.nf
1483
1484 "case" return TOK_CASE;
1485 "switch" return TOK_SWITCH;
1486 ...
1487 "default" return TOK_DEFAULT;
1488 [a-z]+ return TOK_ID;
1489
1490.fi
1491then you're better off using the full table representation. If only
1492the "identifier" rule is present and you then use a hash table or some such
1493to detect the keywords, you're better off using
1494.BR \-F .
1495.IP
1496This option is equivalent to
1497.B -CF
1498(see below).
1499.TP
1500.B -I
1501instructs
1502.I flex
1503to generate an
1504.I interactive
1505scanner. Normally, scanners generated by
1506.I flex
1507always look ahead one
1508character before deciding that a rule has been matched. At the cost of
1509some scanning overhead,
1510.I flex
1511will generate a scanner which only looks ahead
1512when needed. Such scanners are called
1513.I interactive
1514because if you want to write a scanner for an interactive system such as a
1515command shell, you will probably want the user's input to be terminated
1516with a newline, and without
1517.B -I
1518the user will have to type a character in addition to the newline in order
1519to have the newline recognized. This leads to dreadful interactive
1520performance.
1521.IP
1522If all this seems to confusing, here's the general rule: if a human will
1523be typing in input to your scanner, use
1524.B -I,
1525otherwise don't; if you don't care about squeezing the utmost performance
1526from your scanner and you
1527don't want to make any assumptions about the input to your scanner,
1528use
1529.B -I.
1530.IP
1531Note,
1532.B -I
1533cannot be used in conjunction with
1534.I full
1535or
1536.I fast tables,
1537i.e., the
1538.B -f, -F, -Cf,
1539or
1540.B -CF
1541flags.
1542.TP
1543.B -L
1544instructs
1545.I flex
1546not to generate
1547.B #line
1548directives. Without this option,
1549.I flex
1550peppers the generated scanner
1551with #line directives so error messages in the actions will be correctly
1552located with respect to the original
1553.I flex
1554input file, and not to
1555the fairly meaningless line numbers of
1556.B lex.yy.c.
1557(Unfortunately
1558.I flex
1559does not presently generate the necessary directives
1560to "retarget" the line numbers for those parts of
1561.B lex.yy.c
1562which it generated. So if there is an error in the generated code,
1563a meaningless line number is reported.)
1564.TP
1565.B -T
1566makes
1567.I flex
1568run in
1569.I trace
1570mode. It will generate a lot of messages to
1571.I stdout
1572concerning
1573the form of the input and the resultant non-deterministic and deterministic
1574finite automata. This option is mostly for use in maintaining
1575.I flex.
1576.TP
1577.B -8
1578instructs
1579.I flex
1580to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1581characters. On some sites,
1582.I flex
1583is installed with this option as the default. On others, the default
1584is 7-bit characters. To see which is the case, check the verbose
1585.B (-v)
1586output for "equivalence classes created". If the denominator of
1587the number shown is 128, then by default
1588.I flex
1589is generating 7-bit characters. If it is 256, then the default is
15908-bit characters and the
1591.B -8
1592flag is not required (but may be a good idea to keep the scanner
1593specification portable). Feeding a 7-bit scanner 8-bit characters
1594will result in infinite loops, bus errors, or other such fireworks,
1595so when in doubt, use the flag. Note that if equivalence classes
1596are used, 8-bit scanners take only slightly more table space than
15977-bit scanners (128 bytes, to be exact); if equivalence classes are
1598not used, however, then the tables may grow up to twice their
15997-bit size.
1600.TP
1601.B -C[efmF]
1602controls the degree of table compression.
1603.IP
1604.B -Ce
1605directs
1606.I flex
1607to construct
1608.I equivalence classes,
1609i.e., sets of characters
1610which have identical lexical properties (for example, if the only
1611appearance of digits in the
1612.I flex
1613input is in the character class
1614"[0-9]" then the digits '0', '1', ..., '9' will all be put
1615in the same equivalence class). Equivalence classes usually give
1616dramatic reductions in the final table/object file sizes (typically
1617a factor of 2-5) and are pretty cheap performance-wise (one array
1618look-up per character scanned).
1619.IP
1620.B -Cf
1621specifies that the
1622.I full
1623scanner tables should be generated -
1624.I flex
1625should not compress the
1626tables by taking advantages of similar transition functions for
1627different states.
1628.IP
1629.B -CF
1630specifies that the alternate fast scanner representation (described
1631above under the
1632.B -F
1633flag)
1634should be used.
1635.IP
1636.B -Cm
1637directs
1638.I flex
1639to construct
1640.I meta-equivalence classes,
1641which are sets of equivalence classes (or characters, if equivalence
1642classes are not being used) that are commonly used together. Meta-equivalence
1643classes are often a big win when using compressed tables, but they
1644have a moderate performance impact (one or two "if" tests and one
1645array look-up per character scanned).
1646.IP
1647A lone
1648.B -C
1649specifies that the scanner tables should be compressed but neither
1650equivalence classes nor meta-equivalence classes should be used.
1651.IP
1652The options
1653.B -Cf
1654or
1655.B -CF
1656and
1657.B -Cm
1658do not make sense together - there is no opportunity for meta-equivalence
1659classes if the table is not being compressed. Otherwise the options
1660may be freely mixed.
1661.IP
1662The default setting is
1663.B -Cem,
1664which specifies that
1665.I flex
1666should generate equivalence classes
1667and meta-equivalence classes. This setting provides the highest
1668degree of table compression. You can trade off
1669faster-executing scanners at the cost of larger tables with
1670the following generally being true:
1671.nf
1672
1673 slowest & smallest
1674 -Cem
1675 -Cm
1676 -Ce
1677 -C
1678 -C{f,F}e
1679 -C{f,F}
1680 fastest & largest
1681
1682.fi
1683Note that scanners with the smallest tables are usually generated and
1684compiled the quickest, so
1685during development you will usually want to use the default, maximal
1686compression.
1687.IP
1688.B -Cfe
1689is often a good compromise between speed and size for production
1690scanners.
1691.IP
1692.B -C
1693options are not cumulative; whenever the flag is encountered, the
1694previous -C settings are forgotten.
1695.TP
1696.B -Sskeleton_file
1697overrides the default skeleton file from which
1698.I flex
1699constructs its scanners. You'll never need this option unless you are doing
1700.I flex
1701maintenance or development.
1702.SH PERFORMANCE CONSIDERATIONS
1703The main design goal of
1704.I flex
1705is that it generate high-performance scanners. It has been optimized
1706for dealing well with large sets of rules. Aside from the effects
1707of table compression on scanner speed outlined above,
1708there are a number of options/actions which degrade performance. These
1709are, from most expensive to least:
1710.nf
1711
1712 REJECT
1713
1714 pattern sets that require backtracking
1715 arbitrary trailing context
1716
1717 '^' beginning-of-line operator
1718 yymore()
1719
1720.fi
1721with the first three all being quite expensive and the last two
1722being quite cheap.
1723.LP
1724.B REJECT
1725should be avoided at all costs when performance is important.
1726It is a particularly expensive option.
1727.LP
1728Getting rid of backtracking is messy and often may be an enormous
1729amount of work for a complicated scanner. In principal, one begins
1730by using the
1731.B -b
1732flag to generate a
1733.I lex.backtrack
1734file. For example, on the input
1735.nf
1736
1737 %%
1738 foo return TOK_KEYWORD;
1739 foobar return TOK_KEYWORD;
1740
1741.fi
1742the file looks like:
1743.nf
1744
1745 State #6 is non-accepting -
1746 associated rule line numbers:
1747 2 3
1748 out-transitions: [ o ]
1749 jam-transitions: EOF [ \\001-n p-\\177 ]
1750
1751 State #8 is non-accepting -
1752 associated rule line numbers:
1753 3
1754 out-transitions: [ a ]
1755 jam-transitions: EOF [ \\001-` b-\\177 ]
1756
1757 State #9 is non-accepting -
1758 associated rule line numbers:
1759 3
1760 out-transitions: [ r ]
1761 jam-transitions: EOF [ \\001-q s-\\177 ]
1762
1763 Compressed tables always backtrack.
1764
1765.fi
1766The first few lines tell us that there's a scanner state in
1767which it can make a transition on an 'o' but not on any other
1768character, and that in that state the currently scanned text does not match
1769any rule. The state occurs when trying to match the rules found
1770at lines 2 and 3 in the input file.
1771If the scanner is in that state and then reads
1772something other than an 'o', it will have to backtrack to find
1773a rule which is matched. With
1774a bit of headscratching one can see that this must be the
1775state it's in when it has seen "fo". When this has happened,
1776if anything other than another 'o' is seen, the scanner will
1777have to back up to simply match the 'f' (by the default rule).
1778.LP
1779The comment regarding State #8 indicates there's a problem
1780when "foob" has been scanned. Indeed, on any character other
1781than a 'b', the scanner will have to back up to accept "foo".
1782Similarly, the comment for State #9 concerns when "fooba" has
1783been scanned.
1784.LP
1785The final comment reminds us that there's no point going to
1786all the trouble of removing backtracking from the rules unless
1787we're using
1788.B -f
1789or
1790.B -F,
1791since there's no performance gain doing so with compressed scanners.
1792.LP
1793The way to remove the backtracking is to add "error" rules:
1794.nf
1795
1796 %%
1797 foo return TOK_KEYWORD;
1798 foobar return TOK_KEYWORD;
1799
1800 fooba |
1801 foob |
1802 fo {
1803 /* false alarm, not really a keyword */
1804 return TOK_ID;
1805 }
1806
1807.fi
1808.LP
1809Eliminating backtracking among a list of keywords can also be
1810done using a "catch-all" rule:
1811.nf
1812
1813 %%
1814 foo return TOK_KEYWORD;
1815 foobar return TOK_KEYWORD;
1816
1817 [a-z]+ return TOK_ID;
1818
1819.fi
1820This is usually the best solution when appropriate.
1821.LP
1822Backtracking messages tend to cascade.
1823With a complicated set of rules it's not uncommon to get hundreds
1824of messages. If one can decipher them, though, it often
1825only takes a dozen or so rules to eliminate the backtracking (though
1826it's easy to make a mistake and have an error rule accidentally match
1827a valid token. A possible future
1828.I flex
1829feature will be to automatically add rules to eliminate backtracking).
1830.LP
1831.I Variable
1832trailing context (where both the leading and trailing parts do not have
1833a fixed length) entails almost the same performance loss as
1834.I REJECT
1835(i.e., substantial). So when possible a rule like:
1836.nf
1837
1838 %%
1839 mouse|rat/(cat|dog) run();
1840
1841.fi
1842is better written:
1843.nf
1844
1845 %%
1846 mouse/cat|dog run();
1847 rat/cat|dog run();
1848
1849.fi
1850or as
1851.nf
1852
1853 %%
1854 mouse|rat/cat run();
1855 mouse|rat/dog run();
1856
1857.fi
1858Note that here the special '|' action does
1859.I not
1860provide any savings, and can even make things worse (see
1861.B BUGS
1862in flex(1)).
1863.LP
1864Another area where the user can increase a scanner's performance
1865(and one that's easier to implement) arises from the fact that
1866the longer the tokens matched, the faster the scanner will run.
1867This is because with long tokens the processing of most input
1868characters takes place in the (short) inner scanning loop, and
1869does not often have to go through the additional work of setting up
1870the scanning environment (e.g.,
1871.B yytext)
1872for the action. Recall the scanner for C comments:
1873.nf
1874
1875 %x comment
1876 %%
1877 int line_num = 1;
1878
1879 "/*" BEGIN(comment);
1880
1881 <comment>[^*\\n]*
1882 <comment>"*"+[^*/\\n]*
1883 <comment>\\n ++line_num;
1884 <comment>"*"+"/" BEGIN(INITIAL);
1885
1886.fi
1887This could be sped up by writing it as:
1888.nf
1889
1890 %x comment
1891 %%
1892 int line_num = 1;
1893
1894 "/*" BEGIN(comment);
1895
1896 <comment>[^*\\n]*
1897 <comment>[^*\\n]*\\n ++line_num;
1898 <comment>"*"+[^*/\\n]*
1899 <comment>"*"+[^*/\\n]*\\n ++line_num;
1900 <comment>"*"+"/" BEGIN(INITIAL);
1901
1902.fi
1903Now instead of each newline requiring the processing of another
1904action, recognizing the newlines is "distributed" over the other rules
1905to keep the matched text as long as possible. Note that
1906.I adding
1907rules does
1908.I not
1909slow down the scanner! The speed of the scanner is independent
1910of the number of rules or (modulo the considerations given at the
1911beginning of this section) how complicated the rules are with
1912regard to operators such as '*' and '|'.
1913.LP
1914A final example in speeding up a scanner: suppose you want to scan
1915through a file containing identifiers and keywords, one per line
1916and with no other extraneous characters, and recognize all the
1917keywords. A natural first approach is:
1918.nf
1919
1920 %%
1921 asm |
1922 auto |
1923 break |
1924 ... etc ...
1925 volatile |
1926 while /* it's a keyword */
1927
1928 .|\\n /* it's not a keyword */
1929
1930.fi
1931To eliminate the back-tracking, introduce a catch-all rule:
1932.nf
1933
1934 %%
1935 asm |
1936 auto |
1937 break |
1938 ... etc ...
1939 volatile |
1940 while /* it's a keyword */
1941
1942 [a-z]+ |
1943 .|\\n /* it's not a keyword */
1944
1945.fi
1946Now, if it's guaranteed that there's exactly one word per line,
1947then we can reduce the total number of matches by a half by
1948merging in the recognition of newlines with that of the other
1949tokens:
1950.nf
1951
1952 %%
1953 asm\\n |
1954 auto\\n |
1955 break\\n |
1956 ... etc ...
1957 volatile\\n |
1958 while\\n /* it's a keyword */
1959
1960 [a-z]+\\n |
1961 .|\\n /* it's not a keyword */
1962
1963.fi
1964One has to be careful here, as we have now reintroduced backtracking
1965into the scanner. In particular, while
1966.I we
1967know that there will never be any characters in the input stream
1968other than letters or newlines,
1969.I flex
1970can't figure this out, and it will plan for possibly needing backtracking
1971when it has scanned a token like "auto" and then the next character
1972is something other than a newline or a letter. Previously it would
1973then just match the "auto" rule and be done, but now it has no "auto"
1974rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1975we could either duplicate all rules but without final newlines, or,
1976since we never expect to encounter such an input and therefore don't
1977how it's classified, we can introduce one more catch-all rule, this
1978one which doesn't include a newline:
1979.nf
1980
1981 %%
1982 asm\\n |
1983 auto\\n |
1984 break\\n |
1985 ... etc ...
1986 volatile\\n |
1987 while\\n /* it's a keyword */
1988
1989 [a-z]+\\n |
1990 [a-z]+ |
1991 .|\\n /* it's not a keyword */
1992
1993.fi
1994Compiled with
1995.B -Cf,
1996this is about as fast as one can get a
1997.I flex
1998scanner to go for this particular problem.
1999.LP
2000A final note:
2001.I flex
2002is slow when matching NUL's, particularly when a token contains
2003multiple NUL's.
2004It's best to write rules which match
2005.I short
2006amounts of text if it's anticipated that the text will often include NUL's.
2007.SH INCOMPATIBILITIES WITH LEX AND POSIX
2008.I flex
2009is a rewrite of the Unix
2010.I lex
2011tool (the two implementations do not share any code, though),
2012with some extensions and incompatibilities, both of which
2013are of concern to those who wish to write scanners acceptable
2014to either implementation. At present, the POSIX
2015.I lex
2016draft is
2017very close to the original
2018.I lex
2019implementation, so some of these
2020incompatibilities are also in conflict with the POSIX draft. But
2021the intent is that except as noted below,
2022.I flex
2023as it presently stands will
2024ultimately be POSIX conformant (i.e., that those areas of conflict with
2025the POSIX draft will be resolved in
2026.I flex's
2027favor). Please bear in
2028mind that all the comments which follow are with regard to the POSIX
2029.I draft
2030standard of Summer 1989, and not the final document (or subsequent
2031drafts); they are included so
2032.I flex
2033users can be aware of the standardization issues and those areas where
2034.I flex
2035may in the near future undergo changes incompatible with
2036its current definition.
2037.LP
2038.I flex
2039is fully compatible with
2040.I lex
2041with the following exceptions:
2042.IP -
2043The undocumented
2044.I lex
2045scanner internal variable
2046.B yylineno
2047is not supported. It is difficult to support this option efficiently,
2048since it requires examining every character scanned and reexamining
2049the characters when the scanner backs up.
2050Things get more complicated when the end of buffer or file is reached or a
2051NUL is scanned (since the scan must then be restarted with the proper line
2052number count), or the user uses the yyless(), unput(), or REJECT actions,
2053or the multiple input buffer functions.
2054.IP
2055The fix is to add rules which, upon seeing a newline, increment
2056yylineno. This is usually an easy process, though it can be a drag if some
2057of the patterns can match multiple newlines along with other characters.
2058.IP
2059yylineno is not part of the POSIX draft.
2060.IP -
2061The
2062.B input()
2063routine is not redefinable, though it may be called to read characters
2064following whatever has been matched by a rule. If
2065.B input()
2066encounters an end-of-file the normal
2067.B yywrap()
2068processing is done. A ``real'' end-of-file is returned by
2069.B input()
2070as
2071.I EOF.
2072.IP
2073Input is instead controlled by redefining the
2074.B YY_INPUT
2075macro.
2076.IP
2077The
2078.I flex
2079restriction that
2080.B input()
2081cannot be redefined is in accordance with the POSIX draft, but
2082.B YY_INPUT
2083has not yet been accepted into the draft (and probably won't; it looks
2084like the draft will simply not specify any way of controlling the
2085scanner's input other than by making an initial assignment to
2086.I yyin).
2087.IP -
2088.I flex
2089scanners do not use stdio for input. Because of this, when writing an
2090interactive scanner one must explicitly call fflush() on the
2091stream associated with the terminal after writing out a prompt.
2092With
2093.I lex
2094such writes are automatically flushed since
2095.I lex
2096scanners use
2097.B getchar()
2098for their input. Also, when writing interactive scanners with
2099.I flex,
2100the
2101.B -I
2102flag must be used.
2103.IP -
2104.I flex
2105scanners are not as reentrant as
2106.I lex
2107scanners. In particular, if you have an interactive scanner and
2108an interrupt handler which long-jumps out of the scanner, and
2109the scanner is subsequently called again, you may get the following
2110message:
2111.nf
2112
2113 fatal flex scanner internal error--end of buffer missed
2114
2115.fi
2116To reenter the scanner, first use
2117.nf
2118
2119 yyrestart( yyin );
2120
2121.fi
2122.IP -
2123.B output()
2124is not supported.
2125Output from the
2126.B ECHO
2127macro is done to the file-pointer
2128.I yyout
2129(default
2130.I stdout).
2131.IP
2132The POSIX draft mentions that an
2133.B output()
2134routine exists but currently gives no details as to what it does.
2135.IP -
2136.I lex
2137does not support exclusive start conditions (%x), though they
2138are in the current POSIX draft.
2139.IP -
2140When definitions are expanded,
2141.I flex
2142encloses them in parentheses.
2143With lex, the following:
2144.nf
2145
2146 NAME [A-Z][A-Z0-9]*
2147 %%
2148 foo{NAME}? printf( "Found it\\n" );
2149 %%
2150
2151.fi
2152will not match the string "foo" because when the macro
2153is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2154and the precedence is such that the '?' is associated with
2155"[A-Z0-9]*". With
2156.I flex,
2157the rule will be expanded to
2158"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2159Note that because of this, the
2160.B ^, $, <s>, /,
2161and
2162.B <<EOF>>
2163operators cannot be used in a
2164.I flex
2165definition.
2166.IP
2167The POSIX draft interpretation is the same as
2168.I flex's.
2169.IP -
2170To specify a character class which matches anything but a left bracket (']'),
2171in
2172.I lex
2173one can use "[^]]" but with
2174.I flex
2175one must use "[^\\]]". The latter works with
2176.I lex,
2177too.
2178.IP -
2179The
2180.I lex
2181.B %r
2182(generate a Ratfor scanner) option is not supported. It is not part
2183of the POSIX draft.
2184.IP -
2185If you are providing your own yywrap() routine, you must include a
2186"#undef yywrap" in the definitions section (section 1). Note that
2187the "#undef" will have to be enclosed in %{}'s.
2188.IP
2189The POSIX draft
2190specifies that yywrap() is a function and this is very unlikely to change; so
2191.I flex users are warned
2192that
2193.B yywrap()
2194is likely to be changed to a function in the near future.
2195.IP -
2196After a call to
2197.B unput(),
2198.I yytext
2199and
2200.I yyleng
2201are undefined until the next token is matched. This is not the case with
2202.I lex
2203or the present POSIX draft.
2204.IP -
2205The precedence of the
2206.B {}
2207(numeric range) operator is different.
2208.I lex
2209interprets "abc{1,3}" as "match one, two, or
2210three occurrences of 'abc'", whereas
2211.I flex
2212interprets it as "match 'ab'
2213followed by one, two, or three occurrences of 'c'". The latter is
2214in agreement with the current POSIX draft.
2215.IP -
2216The precedence of the
2217.B ^
2218operator is different.
2219.I lex
2220interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2221or 'bar' anywhere", whereas
2222.I flex
2223interprets it as "match either 'foo' or 'bar' if they come at the beginning
2224of a line". The latter is in agreement with the current POSIX draft.
2225.IP -
2226To refer to yytext outside of the scanner source file,
2227the correct definition with
2228.I flex
2229is "extern char *yytext" rather than "extern char yytext[]".
2230This is contrary to the current POSIX draft but a point on which
2231.I flex
2232will not be changing, as the array representation entails a
2233serious performance penalty. It is hoped that the POSIX draft will
2234be emended to support the
2235.I flex
2236variety of declaration (as this is a fairly painless change to
2237require of
2238.I lex
2239users).
2240.IP -
2241.I yyin
2242is
2243.I initialized
2244by
2245.I lex
2246to be
2247.I stdin;
2248.I flex,
2249on the other hand,
2250initializes
2251.I yyin
2252to NULL
2253and then
2254.I assigns
2255it to
2256.I stdin
2257the first time the scanner is called, providing
2258.I yyin
2259has not already been assigned to a non-NULL value. The difference is
2260subtle, but the net effect is that with
2261.I flex
2262scanners,
2263.I yyin
2264does not have a valid value until the scanner has been called.
2265.IP -
2266The special table-size declarations such as
2267.B %a
2268supported by
2269.I lex
2270are not required by
2271.I flex
2272scanners;
2273.I flex
2274ignores them.
2275.IP -
2276The name
2277.B FLEX_SCANNER
2278is #define'd so scanners may be written for use with either
2279.I flex
2280or
2281.I lex.
2282.LP
2283The following
2284.I flex
2285features are not included in
2286.I lex
2287or the POSIX draft standard:
2288.nf
2289
2290 yyterminate()
2291 <<EOF>>
2292 YY_DECL
2293 #line directives
2294 %{}'s around actions
2295 yyrestart()
2296 comments beginning with '#' (deprecated)
2297 multiple actions on a line
2298
2299.fi
2300This last feature refers to the fact that with
2301.I flex
2302you can put multiple actions on the same line, separated with
2303semi-colons, while with
2304.I lex,
2305the following
2306.nf
2307
2308 foo handle_foo(); ++num_foos_seen;
2309
2310.fi
2311is (rather surprisingly) truncated to
2312.nf
2313
2314 foo handle_foo();
2315
2316.fi
2317.I flex
2318does not truncate the action. Actions that are not enclosed in
2319braces are simply terminated at the end of the line.
2320.SH DIAGNOSTICS
2321.I reject_used_but_not_detected undefined
2322or
2323.I yymore_used_but_not_detected undefined -
2324These errors can occur at compile time. They indicate that the
2325scanner uses
2326.B REJECT
2327or
2328.B yymore()
2329but that
2330.I flex
2331failed to notice the fact, meaning that
2332.I flex
2333scanned the first two sections looking for occurrences of these actions
2334and failed to find any, but somehow you snuck some in (via a #include
2335file, for example). Make an explicit reference to the action in your
2336.I flex
2337input file. (Note that previously
2338.I flex
2339supported a
2340.B %used/%unused
2341mechanism for dealing with this problem; this feature is still supported
2342but now deprecated, and will go away soon unless the author hears from
2343people who can argue compellingly that they need it.)
2344.LP
2345.I flex scanner jammed -
2346a scanner compiled with
2347.B -s
2348has encountered an input string which wasn't matched by
2349any of its rules.
2350.LP
2351.I flex input buffer overflowed -
2352a scanner rule matched a string long enough to overflow the
2353scanner's internal input buffer (16K bytes by default - controlled by
2354.B YY_BUF_SIZE
2355in "flex.skel". Note that to redefine this macro, you must first
2356.B #undefine
2357it).
2358.LP
2359.I scanner requires -8 flag -
2360Your scanner specification includes recognizing 8-bit characters and
2361you did not specify the -8 flag (and your site has not installed flex
2362with -8 as the default).
2363.LP
2364.I
2365fatal flex scanner internal error--end of buffer missed -
2366This can occur in an scanner which is reentered after a long-jump
2367has jumped out (or over) the scanner's activation frame. Before
2368reentering the scanner, use:
2369.nf
2370
2371 yyrestart( yyin );
2372
2373.fi
2374.LP
2375.I too many %t classes! -
2376You managed to put every single character into its own %t class.
2377.I flex
2378requires that at least one of the classes share characters.
2379.SH DEFICIENCIES / BUGS
2380See flex(1).
2381.SH "SEE ALSO"
2382.LP
2383flex(1), lex(1), yacc(1), sed(1), awk(9).
2384.LP
2385M. E. Lesk and E. Schmidt,
2386.I LEX - Lexical Analyzer Generator
2387.SH AUTHOR
2388Vern Paxson, with the help of many ideas and much inspiration from
2389Van Jacobson. Original version by Jef Poskanzer. The fast table
2390representation is a partial implementation of a design done by Van
2391Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2392.LP
2393Thanks to the many
2394.I flex
2395beta-testers, feedbackers, and contributors, especially Casey
2396Leedom, benson@odi.com, Keith Bostic,
2397Frederic Brehm, Nick Christopher, Jason Coughlin,
2398Scott David Daniels, Leo Eskin,
2399Chris Faylor, Eric Goldman, Eric
2400Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2401Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2402Jef Poskanzer, Jim Roskind,
2403Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2404have slipped my marginal mail-archiving skills but whose contributions
2405are appreciated all the same.
2406.LP
2407Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2408Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2409headaches.
2410.LP
2411Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2412to Benson Margulies and Fred
2413Burke for C++ support; to Ove Ewerlid for the basics of support for
2414NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2415.LP
2416Work is being done on extending
2417.I flex
2418to generate scanners in which the
2419state machine is directly represented in C code rather than tables.
2420These scanners may well be substantially faster than those generated
2421using -f or -F. If you are working in this area and are interested
2422in comparing notes and seeing whether redundant work can be avoided,
2423contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2424.LP
2425This work was primarily done when I was at the Real Time Systems Group
2426at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2427for the support I received.
2428.LP
2429Send comments to:
2430.nf
2431
2432 Vern Paxson
2433 Computer Science Department
2434 4126 Upson Hall
2435 Cornell University
2436 Ithaca, NY 14853-7501
2437
2438 vern@cs.cornell.edu
2439 decvax!cornell!vern
2440
2441.fi
2442.\" ref. to awk(9) man page corrected -- ASW 2005-01-15
Note: See TracBrowser for help on using the repository browser.