Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

flexdoc.1@ 15

Last change on this file since 15 was 9, checked in by Mattia Monga, 14 years ago
Minix 3.1.2a
File size: 63.9 KB

Line
1	.TH FLEX 1 "26 May 1990" "Version 2.3"
2	.SH NAME
3	flexdoc - fast lexical analyzer generator
4	.SH SYNOPSIS
5	.B flex
6	.B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
7	.I [filename ...]
8	.SH DESCRIPTION
9	.I flex
10	is a tool for generating
11	.I scanners:
12	programs which recognized lexical patterns in text.
13	.I flex
14	reads
15	the given input files, or its standard input if no file names are given,
16	for a description of a scanner to generate. The description is in
17	the form of pairs
18	of regular expressions and C code, called
19	.I rules. flex
20	generates as output a C source file,
21	.B lex.yy.c,
22	which defines a routine
23	.B yylex().
24	This file is compiled and linked with the
25	.B -lfl
26	library to produce an executable. When the executable is run,
27	it analyzes its input for occurrences
28	of the regular expressions. Whenever it finds one, it executes
29	the corresponding C code.
30	.SH SOME SIMPLE EXAMPLES
31	.LP
32	First some simple examples to get the flavor of how one uses
33	.I flex.
34	The following
35	.I flex
36	input specifies a scanner which whenever it encounters the string
37	"username" will replace it with the user's login name:
38	.nf
39
40	%%
41	username printf( "%s", getlogin() );
42
43	.fi
44	By default, any text not matched by a
45	.I flex
46	scanner
47	is copied to the output, so the net effect of this scanner is
48	to copy its input file to its output with each occurrence
49	of "username" expanded.
50	In this input, there is just one rule. "username" is the
51	.I pattern
52	and the "printf" is the
53	.I action.
54	The "%%" marks the beginning of the rules.
55	.LP
56	Here's another simple example:
57	.nf
58
59	int num_lines = 0, num_chars = 0;
60
61	%%
62	\\n ++num_lines; ++num_chars;
63	. ++num_chars;
64
65	%%
66	main()
67	{
68	yylex();
69	printf( "# of lines = %d, # of chars = %d\\n",
70	num_lines, num_chars );
71	}
72
73	.fi
74	This scanner counts the number of characters and the number
75	of lines in its input (it produces no output other than the
76	final report on the counts). The first line
77	declares two globals, "num_lines" and "num_chars", which are accessible
78	both inside
79	.B yylex()
80	and in the
81	.B main()
82	routine declared after the second "%%". There are two rules, one
83	which matches a newline ("\\n") and increments both the line count and
84	the character count, and one which matches any character other than
85	a newline (indicated by the "." regular expression).
86	.LP
87	A somewhat more complicated example:
88	.nf
89
90	/* scanner for a toy Pascal-like language */
91
92	%{
93	/* need this for the call to atof() below */
94	#include <math.h>
95	%}
96
97	DIGIT [0-9]
98	ID [a-z][a-z0-9]*
99
100	%%
101
102	{DIGIT}+ {
103	printf( "An integer: %s (%d)\\n", yytext,
104	atoi( yytext ) );
105	}
106
107	{DIGIT}+"."{DIGIT}* {
108	printf( "A float: %s (%g)\\n", yytext,
109	atof( yytext ) );
110	}
111
112	if\|then\|begin\|end\|procedure\|function {
113	printf( "A keyword: %s\\n", yytext );
114	}
115
116	{ID} printf( "An identifier: %s\\n", yytext );
117
118	"+"\|"-"\|"*"\|"/" printf( "An operator: %s\\n", yytext );
119
120	"{"[^}\\n]"}" / eat up one-line comments */
121
122	[ \\t\\n]+ /* eat up whitespace */
123
124	. printf( "Unrecognized character: %s\\n", yytext );
125
126	%%
127
128	main( argc, argv )
129	int argc;
130	char **argv;
131	{
132	++argv, --argc; /* skip over program name */
133	if ( argc > 0 )
134	yyin = fopen( argv[0], "r" );
135	else
136	yyin = stdin;
137
138	yylex();
139	}
140
141	.fi
142	This is the beginnings of a simple scanner for a language like
143	Pascal. It identifies different types of
144	.I tokens
145	and reports on what it has seen.
146	.LP
147	The details of this example will be explained in the following
148	sections.
149	.SH FORMAT OF THE INPUT FILE
150	The
151	.I flex
152	input file consists of three sections, separated by a line with just
153	.B %%
154	in it:
155	.nf
156
157	definitions
158	%%
159	rules
160	%%
161	user code
162
163	.fi
164	The
165	.I definitions
166	section contains declarations of simple
167	.I name
168	definitions to simplify the scanner specification, and declarations of
169	.I start conditions,
170	which are explained in a later section.
171	.LP
172	Name definitions have the form:
173	.nf
174
175	name definition
176
177	.fi
178	The "name" is a word beginning with a letter or an underscore ('_')
179	followed by zero or more letters, digits, '_', or '-' (dash).
180	The definition is taken to begin at the first non-white-space character
181	following the name and continuing to the end of the line.
182	The definition can subsequently be referred to using "{name}", which
183	will expand to "(definition)". For example,
184	.nf
185
186	DIGIT [0-9]
187	ID [a-z][a-z0-9]*
188
189	.fi
190	defines "DIGIT" to be a regular expression which matches a
191	single digit, and
192	"ID" to be a regular expression which matches a letter
193	followed by zero-or-more letters-or-digits.
194	A subsequent reference to
195	.nf
196
197	{DIGIT}+"."{DIGIT}*
198
199	.fi
200	is identical to
201	.nf
202
203	([0-9])+"."([0-9])*
204
205	.fi
206	and matches one-or-more digits followed by a '.' followed
207	by zero-or-more digits.
208	.LP
209	The
210	.I rules
211	section of the
212	.I flex
213	input contains a series of rules of the form:
214	.nf
215
216	pattern action
217
218	.fi
219	where the pattern must be unindented and the action must begin
220	on the same line.
221	.LP
222	See below for a further description of patterns and actions.
223	.LP
224	Finally, the user code section is simply copied to
225	.B lex.yy.c
226	verbatim.
227	It is used for companion routines which call or are called
228	by the scanner. The presence of this section is optional;
229	if it is missing, the second
230	.B %%
231	in the input file may be skipped, too.
232	.LP
233	In the definitions and rules sections, any
234	.I indented
235	text or text enclosed in
236	.B %{
237	and
238	.B %}
239	is copied verbatim to the output (with the %{}'s removed).
240	The %{}'s must appear unindented on lines by themselves.
241	.LP
242	In the rules section,
243	any indented or %{} text appearing before the
244	first rule may be used to declare variables
245	which are local to the scanning routine and (after the declarations)
246	code which is to be executed whenever the scanning routine is entered.
247	Other indented or %{} text in the rule section is still copied to the output,
248	but its meaning is not well-defined and it may well cause compile-time
249	errors (this feature is present for
250	.I POSIX
251	compliance; see below for other such features).
252	.LP
253	In the definitions section, an unindented comment (i.e., a line
254	beginning with "/*") is also copied verbatim to the output up
255	to the next "*/". Also, any line in the definitions section
256	beginning with '#' is ignored, though this style of comment is
257	deprecated and may go away in the future.
258	.SH PATTERNS
259	The patterns in the input are written using an extended set of regular
260	expressions. These are:
261	.nf
262
263	x match the character 'x'
264	. any character except newline
265	[xyz] a "character class"; in this case, the pattern
266	matches either an 'x', a 'y', or a 'z'
267	[abj-oZ] a "character class" with a range in it; matches
268	an 'a', a 'b', any letter from 'j' through 'o',
269	or a 'Z'
270	[^A-Z] a "negated character class", i.e., any character
271	but those in the class. In this case, any
272	character EXCEPT an uppercase letter.
273	[^A-Z\\n] any character EXCEPT an uppercase letter or
274	a newline
275	r* zero or more r's, where r is any regular expression
276	r+ one or more r's
277	r? zero or one r's (that is, "an optional r")
278	r{2,5} anywhere from two to five r's
279	r{2,} two or more r's
280	r{4} exactly 4 r's
281	{name} the expansion of the "name" definition
282	(see above)
283	"[xyz]\\"foo"
284	the literal string: [xyz]"foo
285	\\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286	then the ANSI-C interpretation of \\x.
287	Otherwise, a literal 'X' (used to escape
288	operators such as '*')
289	\\123 the character with octal value 123
290	\\x2a the character with hexadecimal value 2a
291	(r) match an r; parentheses are used to override
292	precedence (see below)
293
294
295	rs the regular expression r followed by the
296	regular expression s; called "concatenation"
297
298
299	r\|s either an r or an s
300
301
302	r/s an r but only if it is followed by an s. The
303	s is not part of the matched text. This type
304	of pattern is called as "trailing context".
305	^r an r, but only at the beginning of a line
306	r$ an r, but only at the end of a line. Equivalent
307	to "r/\\n".
308
309
310	<s>r an r, but only in start condition s (see
311	below for discussion of start conditions)
312	<s1,s2,s3>r
313	same, but in any of start conditions s1,
314	s2, or s3
315
316
317	<<EOF>> an end-of-file
318	<s1,s2><<EOF>>
319	an end-of-file when in start condition s1 or s2
320
321	.fi
322	The regular expressions listed above are grouped according to
323	precedence, from highest precedence at the top to lowest at the bottom.
324	Those grouped together have equal precedence. For example,
325	.nf
326
327	foo\|bar*
328
329	.fi
330	is the same as
331	.nf
332
333	(foo)\|(ba(r*))
334
335	.fi
336	since the '*' operator has higher precedence than concatenation,
337	and concatenation higher than alternation ('\|'). This pattern
338	therefore matches
339	.I either
340	the string "foo"
341	.I or
342	the string "ba" followed by zero-or-more r's.
343	To match "foo" or zero-or-more "bar"'s, use:
344	.nf
345
346	foo\|(bar)*
347
348	.fi
349	and to match zero-or-more "foo"'s-or-"bar"'s:
350	.nf
351
352	(foo\|bar)*
353
354	.fi
355	.LP
356	Some notes on patterns:
357	.IP -
358	A negated character class such as the example "[^A-Z]"
359	above
360	.I will match a newline
361	unless "\\n" (or an equivalent escape sequence) is one of the
362	characters explicitly present in the negated character class
363	(e.g., "[^A-Z\\n]"). This is unlike how many other regular
364	expression tools treat negated character classes, but unfortunately
365	the inconsistency is historically entrenched.
366	Matching newlines means that a pattern like [^"]* can match an entire
367	input (overflowing the scanner's input buffer) unless there's another
368	quote in the input.
369	.IP -
370	A rule can have at most one instance of trailing context (the '/' operator
371	or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372	can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373	cannot be grouped inside parentheses. A '^' which does not occur at
374	the beginning of a rule or a '$' which does not occur at the end of
375	a rule loses its special properties and is treated as a normal character.
376	.IP
377	The following are illegal:
378	.nf
379
380	foo/bar$
381	<sc1>foo<sc2>bar
382
383	.fi
384	Note that the first of these, can be written "foo/bar\\n".
385	.IP
386	The following will result in '$' or '^' being treated as a normal character:
387	.nf
388
389	foo\|(bar$)
390	foo\|^bar
391
392	.fi
393	If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394	could be used (the special '\|' action is explained below):
395	.nf
396
397	foo \|
398	bar$ /* action goes here */
399
400	.fi
401	A similar trick will work for matching a foo or a
402	bar-at-the-beginning-of-a-line.
403	.SH HOW THE INPUT IS MATCHED
404	When the generated scanner is run, it analyzes its input looking
405	for strings which match any of its patterns. If it finds more than
406	one match, it takes the one matching the most text (for trailing
407	context rules, this includes the length of the trailing part, even
408	though it will then be returned to the input). If it finds two
409	or more matches of the same length, the
410	rule listed first in the
411	.I flex
412	input file is chosen.
413	.LP
414	Once the match is determined, the text corresponding to the match
415	(called the
416	.I token)
417	is made available in the global character pointer
418	.B yytext,
419	and its length in the global integer
420	.B yyleng.
421	The
422	.I action
423	corresponding to the matched pattern is then executed (a more
424	detailed description of actions follows), and then the remaining
425	input is scanned for another match.
426	.LP
427	If no match is found, then the
428	.I default rule
429	is executed: the next character in the input is considered matched and
430	copied to the standard output. Thus, the simplest legal
431	.I flex
432	input is:
433	.nf
434
435	%%
436
437	.fi
438	which generates a scanner that simply copies its input (one character
439	at a time) to its output.
440	.SH ACTIONS
441	Each pattern in a rule has a corresponding action, which can be any
442	arbitrary C statement. The pattern ends at the first non-escaped
443	whitespace character; the remainder of the line is its action. If the
444	action is empty, then when the pattern is matched the input token
445	is simply discarded. For example, here is the specification for a program
446	which deletes all occurrences of "zap me" from its input:
447	.nf
448
449	%%
450	"zap me"
451
452	.fi
453	(It will copy all other characters in the input to the output since
454	they will be matched by the default rule.)
455	.LP
456	Here is a program which compresses multiple blanks and tabs down to
457	a single blank, and throws away whitespace found at the end of a line:
458	.nf
459
460	%%
461	[ \\t]+ putchar( ' ' );
462	[ \\t]+$ /* ignore this token */
463
464	.fi
465	.LP
466	If the action contains a '{', then the action spans till the balancing '}'
467	is found, and the action may cross multiple lines.
468	.I flex
469	knows about C strings and comments and won't be fooled by braces found
470	within them, but also allows actions to begin with
471	.B %{
472	and will consider the action to be all the text up to the next
473	.B %}
474	(regardless of ordinary braces inside the action).
475	.LP
476	An action consisting solely of a vertical bar ('\|') means "same as
477	the action for the next rule." See below for an illustration.
478	.LP
479	Actions can include arbitrary C code, including
480	.B return
481	statements to return a value to whatever routine called
482	.B yylex().
483	Each time
484	.B yylex()
485	is called it continues processing tokens from where it last left
486	off until it either reaches
487	the end of the file or executes a return. Once it reaches an end-of-file,
488	however, then any subsequent call to
489	.B yylex()
490	will simply immediately return, unless
491	.B yyrestart()
492	is first called (see below).
493	.LP
494	Actions are not allowed to modify yytext or yyleng.
495	.LP
496	There are a number of special directives which can be included within
497	an action:
498	.IP -
499	.B ECHO
500	copies yytext to the scanner's output.
501	.IP -
502	.B BEGIN
503	followed by the name of a start condition places the scanner in the
504	corresponding start condition (see below).
505	.IP -
506	.B REJECT
507	directs the scanner to proceed on to the "second best" rule which matched the
508	input (or a prefix of the input). The rule is chosen as described
509	above in "How the Input is Matched", and
510	.B yytext
511	and
512	.B yyleng
513	set up appropriately.
514	It may either be one which matched as much text
515	as the originally chosen rule but came later in the
516	.I flex
517	input file, or one which matched less text.
518	For example, the following will both count the
519	words in the input and call the routine special() whenever "frob" is seen:
520	.nf
521
522	int word_count = 0;
523	%%
524
525	frob special(); REJECT;
526	[^ \\t\\n]+ ++word_count;
527
528	.fi
529	Without the
530	.B REJECT,
531	any "frob"'s in the input would not be counted as words, since the
532	scanner normally executes only one action per token.
533	Multiple
534	.B REJECT's
535	are allowed, each one finding the next best choice to the currently
536	active rule. For example, when the following scanner scans the token
537	"abcd", it will write "abcdabcaba" to the output:
538	.nf
539
540	%%
541	a \|
542	ab \|
543	abc \|
544	abcd ECHO; REJECT;
545	.\|\\n /* eat up any unmatched character */
546
547	.fi
548	(The first three rules share the fourth's action since they use
549	the special '\|' action.)
550	.B REJECT
551	is a particularly expensive feature in terms scanner performance;
552	if it is used in
553	.I any
554	of the scanner's actions it will slow down
555	.I all
556	of the scanner's matching. Furthermore,
557	.B REJECT
558	cannot be used with the
559	.I -f
560	or
561	.I -F
562	options (see below).
563	.IP
564	Note also that unlike the other special actions,
565	.B REJECT
566	is a
567	.I branch;
568	code immediately following it in the action will
569	.I not
570	be executed.
571	.IP -
572	.B yymore()
573	tells the scanner that the next time it matches a rule, the corresponding
574	token should be
575	.I appended
576	onto the current value of
577	.B yytext
578	rather than replacing it. For example, given the input "mega-kludge"
579	the following will write "mega-mega-kludge" to the output:
580	.nf
581
582	%%
583	mega- ECHO; yymore();
584	kludge ECHO;
585
586	.fi
587	First "mega-" is matched and echoed to the output. Then "kludge"
588	is matched, but the previous "mega-" is still hanging around at the
589	beginning of
590	.B yytext
591	so the
592	.B ECHO
593	for the "kludge" rule will actually write "mega-kludge".
594	The presence of
595	.B yymore()
596	in the scanner's action entails a minor performance penalty in the
597	scanner's matching speed.
598	.IP -
599	.B yyless(n)
600	returns all but the first
601	.I n
602	characters of the current token back to the input stream, where they
603	will be rescanned when the scanner looks for the next match.
604	.B yytext
605	and
606	.B yyleng
607	are adjusted appropriately (e.g.,
608	.B yyleng
609	will now be equal to
610	.I n
611	). For example, on the input "foobar" the following will write out
612	"foobarbar":
613	.nf
614
615	%%
616	foobar ECHO; yyless(3);
617	[a-z]+ ECHO;
618
619	.fi
620	An argument of 0 to
621	.B yyless
622	will cause the entire current input string to be scanned again. Unless you've
623	changed how the scanner will subsequently process its input (using
624	.B BEGIN,
625	for example), this will result in an endless loop.
626	.IP -
627	.B unput(c)
628	puts the character
629	.I c
630	back onto the input stream. It will be the next character scanned.
631	The following action will take the current token and cause it
632	to be rescanned enclosed in parentheses.
633	.nf
634
635	{
636	int i;
637	unput( ')' );
638	for ( i = yyleng - 1; i >= 0; --i )
639	unput( yytext[i] );
640	unput( '(' );
641	}
642
643	.fi
644	Note that since each
645	.B unput()
646	puts the given character back at the
647	.I beginning
648	of the input stream, pushing back strings must be done back-to-front.
649	.IP -
650	.B input()
651	reads the next character from the input stream. For example,
652	the following is one way to eat up C comments:
653	.nf
654
655	%%
656	"/*" {
657	register int c;
658
659	for ( ; ; )
660	{
661	while ( (c = input()) != '*' &&
662	c != EOF )
663	; /* eat up text of comment */
664
665	if ( c == '*' )
666	{
667	while ( (c = input()) == '*' )
668	;
669	if ( c == '/' )
670	break; /* found the end */
671	}
672
673	if ( c == EOF )
674	{
675	error( "EOF in comment" );
676	break;
677	}
678	}
679	}
680
681	.fi
682	(Note that if the scanner is compiled using
683	.B C++,
684	then
685	.B input()
686	is instead referred to as
687	.B yyinput(),
688	in order to avoid a name clash with the
689	.B C++
690	stream by the name of
691	.I input.)
692	.IP -
693	.B yyterminate()
694	can be used in lieu of a return statement in an action. It terminates
695	the scanner and returns a 0 to the scanner's caller, indicating "all done".
696	Subsequent calls to the scanner will immediately return unless preceded
697	by a call to
698	.B yyrestart()
699	(see below).
700	By default,
701	.B yyterminate()
702	is also called when an end-of-file is encountered. It is a macro and
703	may be redefined.
704	.SH THE GENERATED SCANNER
705	The output of
706	.I flex
707	is the file
708	.B lex.yy.c,
709	which contains the scanning routine
710	.B yylex(),
711	a number of tables used by it for matching tokens, and a number
712	of auxiliary routines and macros. By default,
713	.B yylex()
714	is declared as follows:
715	.nf
716
717	int yylex()
718	{
719	... various definitions and the actions in here ...
720	}
721
722	.fi
723	(If your environment supports function prototypes, then it will
724	be "int yylex( void )".) This definition may be changed by redefining
725	the "YY_DECL" macro. For example, you could use:
726	.nf
727
728	#undef YY_DECL
729	#define YY_DECL float lexscan( a, b ) float a, b;
730
731	.fi
732	to give the scanning routine the name
733	.I lexscan,
734	returning a float, and taking two floats as arguments. Note that
735	if you give arguments to the scanning routine using a
736	K&R-style/non-prototyped function declaration, you must terminate
737	the definition with a semi-colon (;).
738	.LP
739	Whenever
740	.B yylex()
741	is called, it scans tokens from the global input file
742	.I yyin
743	(which defaults to stdin). It continues until it either reaches
744	an end-of-file (at which point it returns the value 0) or
745	one of its actions executes a
746	.I return
747	statement.
748	In the former case, when called again the scanner will immediately
749	return unless
750	.B yyrestart()
751	is called to point
752	.I yyin
753	at the new input file. (
754	.B yyrestart()
755	takes one argument, a
756	.B FILE *
757	pointer.)
758	In the latter case (i.e., when an action
759	executes a return), the scanner may then be called again and it
760	will resume scanning where it left off.
761	.LP
762	By default (and for purposes of efficiency), the scanner uses
763	block-reads rather than simple
764	.I getc()
765	calls to read characters from
766	.I yyin.
767	The nature of how it gets its input can be controlled by redefining the
768	.B YY_INPUT
769	macro.
770	YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771	action is to place up to
772	.I max_size
773	characters in the character array
774	.I buf
775	and return in the integer variable
776	.I result
777	either the
778	number of characters read or the constant YY_NULL (0 on Unix systems)
779	to indicate EOF. The default YY_INPUT reads from the
780	global file-pointer "yyin".
781	.LP
782	A sample redefinition of YY_INPUT (in the definitions
783	section of the input file):
784	.nf
785
786	%{
787	#undef YY_INPUT
788	#define YY_INPUT(buf,result,max_size) \\
789	{ \\
790	int c = getchar(); \\
791	result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
792	}
793	%}
794
795	.fi
796	This definition will change the input processing to occur
797	one character at a time.
798	.LP
799	You also can add in things like keeping track of the
800	input line number this way; but don't expect your scanner to
801	go very fast.
802	.LP
803	When the scanner receives an end-of-file indication from YY_INPUT,
804	it then checks the
805	.B yywrap()
806	function. If
807	.B yywrap()
808	returns false (zero), then it is assumed that the
809	function has gone ahead and set up
810	.I yyin
811	to point to another input file, and scanning continues. If it returns
812	true (non-zero), then the scanner terminates, returning 0 to its
813	caller.
814	.LP
815	The default
816	.B yywrap()
817	always returns 1. Presently, to redefine it you must first
818	"#undef yywrap", as it is currently implemented as a macro. As indicated
819	by the hedging in the previous sentence, it may be changed to
820	a true function in the near future.
821	.LP
822	The scanner writes its
823	.B ECHO
824	output to the
825	.I yyout
826	global (default, stdout), which may be redefined by the user simply
827	by assigning it to some other
828	.B FILE
829	pointer.
830	.SH START CONDITIONS
831	.I flex
832	provides a mechanism for conditionally activating rules. Any rule
833	whose pattern is prefixed with "<sc>" will only be active when
834	the scanner is in the start condition named "sc". For example,
835	.nf
836
837	<STRING>[^"]* { /* eat up the string body ... */
838	...
839	}
840
841	.fi
842	will be active only when the scanner is in the "STRING" start
843	condition, and
844	.nf
845
846	<INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
847	...
848	}
849
850	.fi
851	will be active only when the current start condition is
852	either "INITIAL", "STRING", or "QUOTE".
853	.LP
854	Start conditions
855	are declared in the definitions (first) section of the input
856	using unindented lines beginning with either
857	.B %s
858	or
859	.B %x
860	followed by a list of names.
861	The former declares
862	.I inclusive
863	start conditions, the latter
864	.I exclusive
865	start conditions. A start condition is activated using the
866	.B BEGIN
867	action. Until the next
868	.B BEGIN
869	action is executed, rules with the given start
870	condition will be active and
871	rules with other start conditions will be inactive.
872	If the start condition is
873	.I inclusive,
874	then rules with no start conditions at all will also be active.
875	If it is
876	.I exclusive,
877	then
878	.I only
879	rules qualified with the start condition will be active.
880	A set of rules contingent on the same exclusive start condition
881	describe a scanner which is independent of any of the other rules in the
882	.I flex
883	input. Because of this,
884	exclusive start conditions make it easy to specify "mini-scanners"
885	which scan portions of the input that are syntactically different
886	from the rest (e.g., comments).
887	.LP
888	If the distinction between inclusive and exclusive start conditions
889	is still a little vague, here's a simple example illustrating the
890	connection between the two. The set of rules:
891	.nf
892
893	%s example
894	%%
895	<example>foo /* do something */
896
897	.fi
898	is equivalent to
899	.nf
900
901	%x example
902	%%
903	<INITIAL,example>foo /* do something */
904
905	.fi
906	.LP
907	The default rule (to
908	.B ECHO
909	any unmatched character) remains active in start conditions.
910	.LP
911	.B BEGIN(0)
912	returns to the original state where only the rules with
913	no start conditions are active. This state can also be
914	referred to as the start-condition "INITIAL", so
915	.B BEGIN(INITIAL)
916	is equivalent to
917	.B BEGIN(0).
918	(The parentheses around the start condition name are not required but
919	are considered good style.)
920	.LP
921	.B BEGIN
922	actions can also be given as indented code at the beginning
923	of the rules section. For example, the following will cause
924	the scanner to enter the "SPECIAL" start condition whenever
925	.I yylex()
926	is called and the global variable
927	.I enter_special
928	is true:
929	.nf
930
931	int enter_special;
932
933	%x SPECIAL
934	%%
935	if ( enter_special )
936	BEGIN(SPECIAL);
937
938	<SPECIAL>blahblahblah
939	...more rules follow...
940
941	.fi
942	.LP
943	To illustrate the uses of start conditions,
944	here is a scanner which provides two different interpretations
945	of a string like "123.456". By default it will treat it as
946	as three tokens, the integer "123", a dot ('.'), and the integer "456".
947	But if the string is preceded earlier in the line by the string
948	"expect-floats"
949	it will treat it as a single token, the floating-point number
950	123.456:
951	.nf
952
953	%{
954	#include <math.h>
955	%}
956	%s expect
957
958	%%
959	expect-floats BEGIN(expect);
960
961	<expect>[0-9]+"."[0-9]+ {
962	printf( "found a float, = %f\\n",
963	atof( yytext ) );
964	}
965	<expect>\\n {
966	/* that's the end of the line, so
967	* we need another "expect-number"
968	* before we'll recognize any more
969	* numbers
970	*/
971	BEGIN(INITIAL);
972	}
973
974	[0-9]+ {
975	printf( "found an integer, = %d\\n",
976	atoi( yytext ) );
977	}
978
979	"." printf( "found a dot\\n" );
980
981	.fi
982	Here is a scanner which recognizes (and discards) C comments while
983	maintaining a count of the current input line.
984	.nf
985
986	%x comment
987	%%
988	int line_num = 1;
989
990	"/*" BEGIN(comment);
991
992	<comment>[^\\n] /* eat anything that's not a '' /
993	<comment>""+[^/\\n]* /* eat up ''s not followed by '/'s /
994	<comment>\\n ++line_num;
995	<comment>"*"+"/" BEGIN(INITIAL);
996
997	.fi
998	Note that start-conditions names are really integer values and
999	can be stored as such. Thus, the above could be extended in the
1000	following fashion:
1001	.nf
1002
1003	%x comment foo
1004	%%
1005	int line_num = 1;
1006	int comment_caller;
1007
1008	"/*" {
1009	comment_caller = INITIAL;
1010	BEGIN(comment);
1011	}
1012
1013	...
1014
1015	<foo>"/*" {
1016	comment_caller = foo;
1017	BEGIN(comment);
1018	}
1019
1020	<comment>[^\\n] /* eat anything that's not a '' /
1021	<comment>""+[^/\\n]* /* eat up ''s not followed by '/'s /
1022	<comment>\\n ++line_num;
1023	<comment>"*"+"/" BEGIN(comment_caller);
1024
1025	.fi
1026	One can then implement a "stack" of start conditions using an
1027	array of integers. (It is likely that such stacks will become
1028	a full-fledged
1029	.I flex
1030	feature in the future.) Note, though, that
1031	start conditions do not have their own name-space; %s's and %x's
1032	declare names in the same fashion as #define's.
1033	.SH MULTIPLE INPUT BUFFERS
1034	Some scanners (such as those which support "include" files)
1035	require reading from several input streams. As
1036	.I flex
1037	scanners do a large amount of buffering, one cannot control
1038	where the next input will be read from by simply writing a
1039	.B YY_INPUT
1040	which is sensitive to the scanning context.
1041	.B YY_INPUT
1042	is only called when the scanner reaches the end of its buffer, which
1043	may be a long time after scanning a statement such as an "include"
1044	which requires switching the input source.
1045	.LP
1046	To negotiate these sorts of problems,
1047	.I flex
1048	provides a mechanism for creating and switching between multiple
1049	input buffers. An input buffer is created by using:
1050	.nf
1051
1052	YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1053
1054	.fi
1055	which takes a
1056	.I FILE
1057	pointer and a size and creates a buffer associated with the given
1058	file and large enough to hold
1059	.I size
1060	characters (when in doubt, use
1061	.B YY_BUF_SIZE
1062	for the size). It returns a
1063	.B YY_BUFFER_STATE
1064	handle, which may then be passed to other routines:
1065	.nf
1066
1067	void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1068
1069	.fi
1070	switches the scanner's input buffer so subsequent tokens will
1071	come from
1072	.I new_buffer.
1073	Note that
1074	.B yy_switch_to_buffer()
1075	may be used by yywrap() to sets things up for continued scanning, instead
1076	of opening a new file and pointing
1077	.I yyin
1078	at it.
1079	.nf
1080
1081	void yy_delete_buffer( YY_BUFFER_STATE buffer )
1082
1083	.fi
1084	is used to reclaim the storage associated with a buffer.
1085	.LP
1086	.B yy_new_buffer()
1087	is an alias for
1088	.B yy_create_buffer(),
1089	provided for compatibility with the C++ use of
1090	.I new
1091	and
1092	.I delete
1093	for creating and destroying dynamic objects.
1094	.LP
1095	Finally, the
1096	.B YY_CURRENT_BUFFER
1097	macro returns a
1098	.B YY_BUFFER_STATE
1099	handle to the current buffer.
1100	.LP
1101	Here is an example of using these features for writing a scanner
1102	which expands include files (the
1103	.B <<EOF>>
1104	feature is discussed below):
1105	.nf
1106
1107	/* the "incl" state is used for picking up the name
1108	* of an include file
1109	*/
1110	%x incl
1111
1112	%{
1113	#define MAX_INCLUDE_DEPTH 10
1114	YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1115	int include_stack_ptr = 0;
1116	%}
1117
1118	%%
1119	include BEGIN(incl);
1120
1121	[a-z]+ ECHO;
1122	[^a-z\\n]*\\n? ECHO;
1123
1124	<incl>[ \\t]* /* eat the whitespace */
1125	<incl>[^ \\t\\n]+ { /* got the include file name */
1126	if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1127	{
1128	fprintf( stderr, "Includes nested too deeply" );
1129	exit( 1 );
1130	}
1131
1132	include_stack[include_stack_ptr++] =
1133	YY_CURRENT_BUFFER;
1134
1135	yyin = fopen( yytext, "r" );
1136
1137	if ( ! yyin )
1138	error( ... );
1139
1140	yy_switch_to_buffer(
1141	yy_create_buffer( yyin, YY_BUF_SIZE ) );
1142
1143	BEGIN(INITIAL);
1144	}
1145
1146	<<EOF>> {
1147	if ( --include_stack_ptr < 0 )
1148	{
1149	yyterminate();
1150	}
1151
1152	else
1153	yy_switch_to_buffer(
1154	include_stack[include_stack_ptr] );
1155	}
1156
1157	.fi
1158	.SH END-OF-FILE RULES
1159	The special rule "<<EOF>>" indicates
1160	actions which are to be taken when an end-of-file is
1161	encountered and yywrap() returns non-zero (i.e., indicates
1162	no further files to process). The action must finish
1163	by doing one of four things:
1164	.IP -
1165	the special
1166	.B YY_NEW_FILE
1167	action, if
1168	.I yyin
1169	has been pointed at a new file to process;
1170	.IP -
1171	a
1172	.I return
1173	statement;
1174	.IP -
1175	the special
1176	.B yyterminate()
1177	action;
1178	.IP -
1179	or, switching to a new buffer using
1180	.B yy_switch_to_buffer()
1181	as shown in the example above.
1182	.LP
1183	<<EOF>> rules may not be used with other
1184	patterns; they may only be qualified with a list of start
1185	conditions. If an unqualified <<EOF>> rule is given, it
1186	applies to
1187	.I all
1188	start conditions which do not already have <<EOF>> actions. To
1189	specify an <<EOF>> rule for only the initial start condition, use
1190	.nf
1191
1192	<INITIAL><<EOF>>
1193
1194	.fi
1195	.LP
1196	These rules are useful for catching things like unclosed comments.
1197	An example:
1198	.nf
1199
1200	%x quote
1201	%%
1202
1203	...other rules for dealing with quotes...
1204
1205	<quote><<EOF>> {
1206	error( "unterminated quote" );
1207	yyterminate();
1208	}
1209	<<EOF>> {
1210	if ( *++filelist )
1211	{
1212	yyin = fopen( *filelist, "r" );
1213	YY_NEW_FILE;
1214	}
1215	else
1216	yyterminate();
1217	}
1218
1219	.fi
1220	.SH MISCELLANEOUS MACROS
1221	The macro
1222	.B YY_USER_ACTION
1223	can be redefined to provide an action
1224	which is always executed prior to the matched rule's action. For example,
1225	it could be #define'd to call a routine to convert yytext to lower-case.
1226	.LP
1227	The macro
1228	.B YY_USER_INIT
1229	may be redefined to provide an action which is always executed before
1230	the first scan (and before the scanner's internal initializations are done).
1231	For example, it could be used to call a routine to read
1232	in a data table or open a logging file.
1233	.LP
1234	In the generated scanner, the actions are all gathered in one large
1235	switch statement and separated using
1236	.B YY_BREAK,
1237	which may be redefined. By default, it is simply a "break", to separate
1238	each rule's action from the following rule's.
1239	Redefining
1240	.B YY_BREAK
1241	allows, for example, C++ users to
1242	#define YY_BREAK to do nothing (while being very careful that every
1243	rule ends with a "break" or a "return"!) to avoid suffering from
1244	unreachable statement warnings where because a rule's action ends with
1245	"return", the
1246	.B YY_BREAK
1247	is inaccessible.
1248	.SH INTERFACING WITH YACC
1249	One of the main uses of
1250	.I flex
1251	is as a companion to the
1252	.I yacc
1253	parser-generator.
1254	.I yacc
1255	parsers expect to call a routine named
1256	.B yylex()
1257	to find the next input token. The routine is supposed to
1258	return the type of the next token as well as putting any associated
1259	value in the global
1260	.B yylval.
1261	To use
1262	.I flex
1263	with
1264	.I yacc,
1265	one specifies the
1266	.B -d
1267	option to
1268	.I yacc
1269	to instruct it to generate the file
1270	.B y.tab.h
1271	containing definitions of all the
1272	.B %tokens
1273	appearing in the
1274	.I yacc
1275	input. This file is then included in the
1276	.I flex
1277	scanner. For example, if one of the tokens is "TOK_NUMBER",
1278	part of the scanner might look like:
1279	.nf
1280
1281	%{
1282	#include "y.tab.h"
1283	%}
1284
1285	%%
1286
1287	[0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1288
1289	.fi
1290	.SH TRANSLATION TABLE
1291	In the name of POSIX compliance,
1292	.I flex
1293	supports a
1294	.I translation table
1295	for mapping input characters into groups.
1296	The table is specified in the first section, and its format looks like:
1297	.nf
1298
1299	%t
1300	1 abcd
1301	2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1302	52 0123456789
1303	6 \\t\\ \\n
1304	%t
1305
1306	.fi
1307	This example specifies that the characters 'a', 'b', 'c', and 'd'
1308	are to all be lumped into group #1, upper-case letters
1309	in group #2, digits in group #52, tabs, blanks, and newlines into
1310	group #6, and
1311	.I
1312	no other characters will appear in the patterns.
1313	The group numbers are actually disregarded by
1314	.I flex;
1315	.B %t
1316	serves, though, to lump characters together. Given the above
1317	table, for example, the pattern "a(AA)5" is equivalent to "d(ZQ)0".
1318	They both say, "match any character in group #1, followed by
1319	zero-or-more pairs of characters
1320	from group #2, followed by a character from group #52." Thus
1321	.B %t
1322	provides a crude way for introducing equivalence classes into
1323	the scanner specification.
1324	.LP
1325	Note that the
1326	.B -i
1327	option (see below) coupled with the equivalence classes which
1328	.I flex
1329	automatically generates take care of virtually all the instances
1330	when one might consider using
1331	.B %t.
1332	But what the hell, it's there if you want it.
1333	.SH OPTIONS
1334	.I flex
1335	has the following options:
1336	.TP
1337	.B -b
1338	Generate backtracking information to
1339	.I lex.backtrack.
1340	This is a list of scanner states which require backtracking
1341	and the input characters on which they do so. By adding rules one
1342	can remove backtracking states. If all backtracking states
1343	are eliminated and
1344	.B -f
1345	or
1346	.B -F
1347	is used, the generated scanner will run faster (see the
1348	.B -p
1349	flag). Only users who wish to squeeze every last cycle out of their
1350	scanners need worry about this option. (See the section on PERFORMANCE
1351	CONSIDERATIONS below.)
1352	.TP
1353	.B -c
1354	is a do-nothing, deprecated option included for POSIX compliance.
1355	.IP
1356	.B NOTE:
1357	in previous releases of
1358	.I flex
1359	.B -c
1360	specified table-compression options. This functionality is
1361	now given by the
1362	.B -C
1363	flag. To ease the the impact of this change, when
1364	.I flex
1365	encounters
1366	.B -c,
1367	it currently issues a warning message and assumes that
1368	.B -C
1369	was desired instead. In the future this "promotion" of
1370	.B -c
1371	to
1372	.B -C
1373	will go away in the name of full POSIX compliance (unless
1374	the POSIX meaning is removed first).
1375	.TP
1376	.B -d
1377	makes the generated scanner run in
1378	.I debug
1379	mode. Whenever a pattern is recognized and the global
1380	.B yy_flex_debug
1381	is non-zero (which is the default),
1382	the scanner will write to
1383	.I stderr
1384	a line of the form:
1385	.nf
1386
1387	--accepting rule at line 53 ("the matched text")
1388
1389	.fi
1390	The line number refers to the location of the rule in the file
1391	defining the scanner (i.e., the file that was fed to flex). Messages
1392	are also generated when the scanner backtracks, accepts the
1393	default rule, reaches the end of its input buffer (or encounters
1394	a NUL; at this point, the two look the same as far as the scanner's concerned),
1395	or reaches an end-of-file.
1396	.TP
1397	.B -f
1398	specifies (take your pick)
1399	.I full table
1400	or
1401	.I fast scanner.
1402	No table compression is done. The result is large but fast.
1403	This option is equivalent to
1404	.B -Cf
1405	(see below).
1406	.TP
1407	.B -i
1408	instructs
1409	.I flex
1410	to generate a
1411	.I case-insensitive
1412	scanner. The case of letters given in the
1413	.I flex
1414	input patterns will
1415	be ignored, and tokens in the input will be matched regardless of case. The
1416	matched text given in
1417	.I yytext
1418	will have the preserved case (i.e., it will not be folded).
1419	.TP
1420	.B -n
1421	is another do-nothing, deprecated option included only for
1422	POSIX compliance.
1423	.TP
1424	.B -p
1425	generates a performance report to stderr. The report
1426	consists of comments regarding features of the
1427	.I flex
1428	input file which will cause a loss of performance in the resulting scanner.
1429	Note that the use of
1430	.I REJECT
1431	and variable trailing context (see the BUGS section in flex(1))
1432	entails a substantial performance penalty; use of
1433	.I yymore(),
1434	the
1435	.B ^
1436	operator,
1437	and the
1438	.B -I
1439	flag entail minor performance penalties.
1440	.TP
1441	.B -s
1442	causes the
1443	.I default rule
1444	(that unmatched scanner input is echoed to
1445	.I stdout)
1446	to be suppressed. If the scanner encounters input that does not
1447	match any of its rules, it aborts with an error. This option is
1448	useful for finding holes in a scanner's rule set.
1449	.TP
1450	.B -t
1451	instructs
1452	.I flex
1453	to write the scanner it generates to standard output instead
1454	of
1455	.B lex.yy.c.
1456	.TP
1457	.B -v
1458	specifies that
1459	.I flex
1460	should write to
1461	.I stderr
1462	a summary of statistics regarding the scanner it generates.
1463	Most of the statistics are meaningless to the casual
1464	.I flex
1465	user, but the
1466	first line identifies the version of
1467	.I flex,
1468	which is useful for figuring
1469	out where you stand with respect to patches and new releases,
1470	and the next two lines give the date when the scanner was created
1471	and a summary of the flags which were in effect.
1472	.TP
1473	.B -F
1474	specifies that the
1475	.I fast
1476	scanner table representation should be used. This representation is
1477	about as fast as the full table representation
1478	.RB ( \-f ),
1479	and for some sets of patterns will be considerably smaller (and for
1480	others, larger). In general, if the pattern set contains both "keywords"
1481	and a catch-all, "identifier" rule, such as in the set:
1482	.nf
1483
1484	"case" return TOK_CASE;
1485	"switch" return TOK_SWITCH;
1486	...
1487	"default" return TOK_DEFAULT;
1488	[a-z]+ return TOK_ID;
1489
1490	.fi
1491	then you're better off using the full table representation. If only
1492	the "identifier" rule is present and you then use a hash table or some such
1493	to detect the keywords, you're better off using
1494	.BR \-F .
1495	.IP
1496	This option is equivalent to
1497	.B -CF
1498	(see below).
1499	.TP
1500	.B -I
1501	instructs
1502	.I flex
1503	to generate an
1504	.I interactive
1505	scanner. Normally, scanners generated by
1506	.I flex
1507	always look ahead one
1508	character before deciding that a rule has been matched. At the cost of
1509	some scanning overhead,
1510	.I flex
1511	will generate a scanner which only looks ahead
1512	when needed. Such scanners are called
1513	.I interactive
1514	because if you want to write a scanner for an interactive system such as a
1515	command shell, you will probably want the user's input to be terminated
1516	with a newline, and without
1517	.B -I
1518	the user will have to type a character in addition to the newline in order
1519	to have the newline recognized. This leads to dreadful interactive
1520	performance.
1521	.IP
1522	If all this seems to confusing, here's the general rule: if a human will
1523	be typing in input to your scanner, use
1524	.B -I,
1525	otherwise don't; if you don't care about squeezing the utmost performance
1526	from your scanner and you
1527	don't want to make any assumptions about the input to your scanner,
1528	use
1529	.B -I.
1530	.IP
1531	Note,
1532	.B -I
1533	cannot be used in conjunction with
1534	.I full
1535	or
1536	.I fast tables,
1537	i.e., the
1538	.B -f, -F, -Cf,
1539	or
1540	.B -CF
1541	flags.
1542	.TP
1543	.B -L
1544	instructs
1545	.I flex
1546	not to generate
1547	.B #line
1548	directives. Without this option,
1549	.I flex
1550	peppers the generated scanner
1551	with #line directives so error messages in the actions will be correctly
1552	located with respect to the original
1553	.I flex
1554	input file, and not to
1555	the fairly meaningless line numbers of
1556	.B lex.yy.c.
1557	(Unfortunately
1558	.I flex
1559	does not presently generate the necessary directives
1560	to "retarget" the line numbers for those parts of
1561	.B lex.yy.c
1562	which it generated. So if there is an error in the generated code,
1563	a meaningless line number is reported.)
1564	.TP
1565	.B -T
1566	makes
1567	.I flex
1568	run in
1569	.I trace
1570	mode. It will generate a lot of messages to
1571	.I stdout
1572	concerning
1573	the form of the input and the resultant non-deterministic and deterministic
1574	finite automata. This option is mostly for use in maintaining
1575	.I flex.
1576	.TP
1577	.B -8
1578	instructs
1579	.I flex
1580	to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1581	characters. On some sites,
1582	.I flex
1583	is installed with this option as the default. On others, the default
1584	is 7-bit characters. To see which is the case, check the verbose
1585	.B (-v)
1586	output for "equivalence classes created". If the denominator of
1587	the number shown is 128, then by default
1588	.I flex
1589	is generating 7-bit characters. If it is 256, then the default is
1590	8-bit characters and the
1591	.B -8
1592	flag is not required (but may be a good idea to keep the scanner
1593	specification portable). Feeding a 7-bit scanner 8-bit characters
1594	will result in infinite loops, bus errors, or other such fireworks,
1595	so when in doubt, use the flag. Note that if equivalence classes
1596	are used, 8-bit scanners take only slightly more table space than
1597	7-bit scanners (128 bytes, to be exact); if equivalence classes are
1598	not used, however, then the tables may grow up to twice their
1599	7-bit size.
1600	.TP
1601	.B -C[efmF]
1602	controls the degree of table compression.
1603	.IP
1604	.B -Ce
1605	directs
1606	.I flex
1607	to construct
1608	.I equivalence classes,
1609	i.e., sets of characters
1610	which have identical lexical properties (for example, if the only
1611	appearance of digits in the
1612	.I flex
1613	input is in the character class
1614	"[0-9]" then the digits '0', '1', ..., '9' will all be put
1615	in the same equivalence class). Equivalence classes usually give
1616	dramatic reductions in the final table/object file sizes (typically
1617	a factor of 2-5) and are pretty cheap performance-wise (one array
1618	look-up per character scanned).
1619	.IP
1620	.B -Cf
1621	specifies that the
1622	.I full
1623	scanner tables should be generated -
1624	.I flex
1625	should not compress the
1626	tables by taking advantages of similar transition functions for
1627	different states.
1628	.IP
1629	.B -CF
1630	specifies that the alternate fast scanner representation (described
1631	above under the
1632	.B -F
1633	flag)
1634	should be used.
1635	.IP
1636	.B -Cm
1637	directs
1638	.I flex
1639	to construct
1640	.I meta-equivalence classes,
1641	which are sets of equivalence classes (or characters, if equivalence
1642	classes are not being used) that are commonly used together. Meta-equivalence
1643	classes are often a big win when using compressed tables, but they
1644	have a moderate performance impact (one or two "if" tests and one
1645	array look-up per character scanned).
1646	.IP
1647	A lone
1648	.B -C
1649	specifies that the scanner tables should be compressed but neither
1650	equivalence classes nor meta-equivalence classes should be used.
1651	.IP
1652	The options
1653	.B -Cf
1654	or
1655	.B -CF
1656	and
1657	.B -Cm
1658	do not make sense together - there is no opportunity for meta-equivalence
1659	classes if the table is not being compressed. Otherwise the options
1660	may be freely mixed.
1661	.IP
1662	The default setting is
1663	.B -Cem,
1664	which specifies that
1665	.I flex
1666	should generate equivalence classes
1667	and meta-equivalence classes. This setting provides the highest
1668	degree of table compression. You can trade off
1669	faster-executing scanners at the cost of larger tables with
1670	the following generally being true:
1671	.nf
1672
1673	slowest & smallest
1674	-Cem
1675	-Cm
1676	-Ce
1677	-C
1678	-C{f,F}e
1679	-C{f,F}
1680	fastest & largest
1681
1682	.fi
1683	Note that scanners with the smallest tables are usually generated and
1684	compiled the quickest, so
1685	during development you will usually want to use the default, maximal
1686	compression.
1687	.IP
1688	.B -Cfe
1689	is often a good compromise between speed and size for production
1690	scanners.
1691	.IP
1692	.B -C
1693	options are not cumulative; whenever the flag is encountered, the
1694	previous -C settings are forgotten.
1695	.TP
1696	.B -Sskeleton_file
1697	overrides the default skeleton file from which
1698	.I flex
1699	constructs its scanners. You'll never need this option unless you are doing
1700	.I flex
1701	maintenance or development.
1702	.SH PERFORMANCE CONSIDERATIONS
1703	The main design goal of
1704	.I flex
1705	is that it generate high-performance scanners. It has been optimized
1706	for dealing well with large sets of rules. Aside from the effects
1707	of table compression on scanner speed outlined above,
1708	there are a number of options/actions which degrade performance. These
1709	are, from most expensive to least:
1710	.nf
1711
1712	REJECT
1713
1714	pattern sets that require backtracking
1715	arbitrary trailing context
1716
1717	'^' beginning-of-line operator
1718	yymore()
1719
1720	.fi
1721	with the first three all being quite expensive and the last two
1722	being quite cheap.
1723	.LP
1724	.B REJECT
1725	should be avoided at all costs when performance is important.
1726	It is a particularly expensive option.
1727	.LP
1728	Getting rid of backtracking is messy and often may be an enormous
1729	amount of work for a complicated scanner. In principal, one begins
1730	by using the
1731	.B -b
1732	flag to generate a
1733	.I lex.backtrack
1734	file. For example, on the input
1735	.nf
1736
1737	%%
1738	foo return TOK_KEYWORD;
1739	foobar return TOK_KEYWORD;
1740
1741	.fi
1742	the file looks like:
1743	.nf
1744
1745	State #6 is non-accepting -
1746	associated rule line numbers:
1747	2 3
1748	out-transitions: [ o ]
1749	jam-transitions: EOF [ \\001-n p-\\177 ]
1750
1751	State #8 is non-accepting -
1752	associated rule line numbers:
1753	3
1754	out-transitions: [ a ]
1755	jam-transitions: EOF [ \\001-` b-\\177 ]
1756
1757	State #9 is non-accepting -
1758	associated rule line numbers:
1759	3
1760	out-transitions: [ r ]
1761	jam-transitions: EOF [ \\001-q s-\\177 ]
1762
1763	Compressed tables always backtrack.
1764
1765	.fi
1766	The first few lines tell us that there's a scanner state in
1767	which it can make a transition on an 'o' but not on any other
1768	character, and that in that state the currently scanned text does not match
1769	any rule. The state occurs when trying to match the rules found
1770	at lines 2 and 3 in the input file.
1771	If the scanner is in that state and then reads
1772	something other than an 'o', it will have to backtrack to find
1773	a rule which is matched. With
1774	a bit of headscratching one can see that this must be the
1775	state it's in when it has seen "fo". When this has happened,
1776	if anything other than another 'o' is seen, the scanner will
1777	have to back up to simply match the 'f' (by the default rule).
1778	.LP
1779	The comment regarding State #8 indicates there's a problem
1780	when "foob" has been scanned. Indeed, on any character other
1781	than a 'b', the scanner will have to back up to accept "foo".
1782	Similarly, the comment for State #9 concerns when "fooba" has
1783	been scanned.
1784	.LP
1785	The final comment reminds us that there's no point going to
1786	all the trouble of removing backtracking from the rules unless
1787	we're using
1788	.B -f
1789	or
1790	.B -F,
1791	since there's no performance gain doing so with compressed scanners.
1792	.LP
1793	The way to remove the backtracking is to add "error" rules:
1794	.nf
1795
1796	%%
1797	foo return TOK_KEYWORD;
1798	foobar return TOK_KEYWORD;
1799
1800	fooba \|
1801	foob \|
1802	fo {
1803	/* false alarm, not really a keyword */
1804	return TOK_ID;
1805	}
1806
1807	.fi
1808	.LP
1809	Eliminating backtracking among a list of keywords can also be
1810	done using a "catch-all" rule:
1811	.nf
1812
1813	%%
1814	foo return TOK_KEYWORD;
1815	foobar return TOK_KEYWORD;
1816
1817	[a-z]+ return TOK_ID;
1818
1819	.fi
1820	This is usually the best solution when appropriate.
1821	.LP
1822	Backtracking messages tend to cascade.
1823	With a complicated set of rules it's not uncommon to get hundreds
1824	of messages. If one can decipher them, though, it often
1825	only takes a dozen or so rules to eliminate the backtracking (though
1826	it's easy to make a mistake and have an error rule accidentally match
1827	a valid token. A possible future
1828	.I flex
1829	feature will be to automatically add rules to eliminate backtracking).
1830	.LP
1831	.I Variable
1832	trailing context (where both the leading and trailing parts do not have
1833	a fixed length) entails almost the same performance loss as
1834	.I REJECT
1835	(i.e., substantial). So when possible a rule like:
1836	.nf
1837
1838	%%
1839	mouse\|rat/(cat\|dog) run();
1840
1841	.fi
1842	is better written:
1843	.nf
1844
1845	%%
1846	mouse/cat\|dog run();
1847	rat/cat\|dog run();
1848
1849	.fi
1850	or as
1851	.nf
1852
1853	%%
1854	mouse\|rat/cat run();
1855	mouse\|rat/dog run();
1856
1857	.fi
1858	Note that here the special '\|' action does
1859	.I not
1860	provide any savings, and can even make things worse (see
1861	.B BUGS
1862	in flex(1)).
1863	.LP
1864	Another area where the user can increase a scanner's performance
1865	(and one that's easier to implement) arises from the fact that
1866	the longer the tokens matched, the faster the scanner will run.
1867	This is because with long tokens the processing of most input
1868	characters takes place in the (short) inner scanning loop, and
1869	does not often have to go through the additional work of setting up
1870	the scanning environment (e.g.,
1871	.B yytext)
1872	for the action. Recall the scanner for C comments:
1873	.nf
1874
1875	%x comment
1876	%%
1877	int line_num = 1;
1878
1879	"/*" BEGIN(comment);
1880
1881	<comment>[^\\n]
1882	<comment>""+[^/\\n]*
1883	<comment>\\n ++line_num;
1884	<comment>"*"+"/" BEGIN(INITIAL);
1885
1886	.fi
1887	This could be sped up by writing it as:
1888	.nf
1889
1890	%x comment
1891	%%
1892	int line_num = 1;
1893
1894	"/*" BEGIN(comment);
1895
1896	<comment>[^\\n]
1897	<comment>[^\\n]\\n ++line_num;
1898	<comment>""+[^/\\n]*
1899	<comment>""+[^/\\n]*\\n ++line_num;
1900	<comment>"*"+"/" BEGIN(INITIAL);
1901
1902	.fi
1903	Now instead of each newline requiring the processing of another
1904	action, recognizing the newlines is "distributed" over the other rules
1905	to keep the matched text as long as possible. Note that
1906	.I adding
1907	rules does
1908	.I not
1909	slow down the scanner! The speed of the scanner is independent
1910	of the number of rules or (modulo the considerations given at the
1911	beginning of this section) how complicated the rules are with
1912	regard to operators such as '*' and '\|'.
1913	.LP
1914	A final example in speeding up a scanner: suppose you want to scan
1915	through a file containing identifiers and keywords, one per line
1916	and with no other extraneous characters, and recognize all the
1917	keywords. A natural first approach is:
1918	.nf
1919
1920	%%
1921	asm \|
1922	auto \|
1923	break \|
1924	... etc ...
1925	volatile \|
1926	while /* it's a keyword */
1927
1928	.\|\\n /* it's not a keyword */
1929
1930	.fi
1931	To eliminate the back-tracking, introduce a catch-all rule:
1932	.nf
1933
1934	%%
1935	asm \|
1936	auto \|
1937	break \|
1938	... etc ...
1939	volatile \|
1940	while /* it's a keyword */
1941
1942	[a-z]+ \|
1943	.\|\\n /* it's not a keyword */
1944
1945	.fi
1946	Now, if it's guaranteed that there's exactly one word per line,
1947	then we can reduce the total number of matches by a half by
1948	merging in the recognition of newlines with that of the other
1949	tokens:
1950	.nf
1951
1952	%%
1953	asm\\n \|
1954	auto\\n \|
1955	break\\n \|
1956	... etc ...
1957	volatile\\n \|
1958	while\\n /* it's a keyword */
1959
1960	[a-z]+\\n \|
1961	.\|\\n /* it's not a keyword */
1962
1963	.fi
1964	One has to be careful here, as we have now reintroduced backtracking
1965	into the scanner. In particular, while
1966	.I we
1967	know that there will never be any characters in the input stream
1968	other than letters or newlines,
1969	.I flex
1970	can't figure this out, and it will plan for possibly needing backtracking
1971	when it has scanned a token like "auto" and then the next character
1972	is something other than a newline or a letter. Previously it would
1973	then just match the "auto" rule and be done, but now it has no "auto"
1974	rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1975	we could either duplicate all rules but without final newlines, or,
1976	since we never expect to encounter such an input and therefore don't
1977	how it's classified, we can introduce one more catch-all rule, this
1978	one which doesn't include a newline:
1979	.nf
1980
1981	%%
1982	asm\\n \|
1983	auto\\n \|
1984	break\\n \|
1985	... etc ...
1986	volatile\\n \|
1987	while\\n /* it's a keyword */
1988
1989	[a-z]+\\n \|
1990	[a-z]+ \|
1991	.\|\\n /* it's not a keyword */
1992
1993	.fi
1994	Compiled with
1995	.B -Cf,
1996	this is about as fast as one can get a
1997	.I flex
1998	scanner to go for this particular problem.
1999	.LP
2000	A final note:
2001	.I flex
2002	is slow when matching NUL's, particularly when a token contains
2003	multiple NUL's.
2004	It's best to write rules which match
2005	.I short
2006	amounts of text if it's anticipated that the text will often include NUL's.
2007	.SH INCOMPATIBILITIES WITH LEX AND POSIX
2008	.I flex
2009	is a rewrite of the Unix
2010	.I lex
2011	tool (the two implementations do not share any code, though),
2012	with some extensions and incompatibilities, both of which
2013	are of concern to those who wish to write scanners acceptable
2014	to either implementation. At present, the POSIX
2015	.I lex
2016	draft is
2017	very close to the original
2018	.I lex
2019	implementation, so some of these
2020	incompatibilities are also in conflict with the POSIX draft. But
2021	the intent is that except as noted below,
2022	.I flex
2023	as it presently stands will
2024	ultimately be POSIX conformant (i.e., that those areas of conflict with
2025	the POSIX draft will be resolved in
2026	.I flex's
2027	favor). Please bear in
2028	mind that all the comments which follow are with regard to the POSIX
2029	.I draft
2030	standard of Summer 1989, and not the final document (or subsequent
2031	drafts); they are included so
2032	.I flex
2033	users can be aware of the standardization issues and those areas where
2034	.I flex
2035	may in the near future undergo changes incompatible with
2036	its current definition.
2037	.LP
2038	.I flex
2039	is fully compatible with
2040	.I lex
2041	with the following exceptions:
2042	.IP -
2043	The undocumented
2044	.I lex
2045	scanner internal variable
2046	.B yylineno
2047	is not supported. It is difficult to support this option efficiently,
2048	since it requires examining every character scanned and reexamining
2049	the characters when the scanner backs up.
2050	Things get more complicated when the end of buffer or file is reached or a
2051	NUL is scanned (since the scan must then be restarted with the proper line
2052	number count), or the user uses the yyless(), unput(), or REJECT actions,
2053	or the multiple input buffer functions.
2054	.IP
2055	The fix is to add rules which, upon seeing a newline, increment
2056	yylineno. This is usually an easy process, though it can be a drag if some
2057	of the patterns can match multiple newlines along with other characters.
2058	.IP
2059	yylineno is not part of the POSIX draft.
2060	.IP -
2061	The
2062	.B input()
2063	routine is not redefinable, though it may be called to read characters
2064	following whatever has been matched by a rule. If
2065	.B input()
2066	encounters an end-of-file the normal
2067	.B yywrap()
2068	processing is done. A ``real'' end-of-file is returned by
2069	.B input()
2070	as
2071	.I EOF.
2072	.IP
2073	Input is instead controlled by redefining the
2074	.B YY_INPUT
2075	macro.
2076	.IP
2077	The
2078	.I flex
2079	restriction that
2080	.B input()
2081	cannot be redefined is in accordance with the POSIX draft, but
2082	.B YY_INPUT
2083	has not yet been accepted into the draft (and probably won't; it looks
2084	like the draft will simply not specify any way of controlling the
2085	scanner's input other than by making an initial assignment to
2086	.I yyin).
2087	.IP -
2088	.I flex
2089	scanners do not use stdio for input. Because of this, when writing an
2090	interactive scanner one must explicitly call fflush() on the
2091	stream associated with the terminal after writing out a prompt.
2092	With
2093	.I lex
2094	such writes are automatically flushed since
2095	.I lex
2096	scanners use
2097	.B getchar()
2098	for their input. Also, when writing interactive scanners with
2099	.I flex,
2100	the
2101	.B -I
2102	flag must be used.
2103	.IP -
2104	.I flex
2105	scanners are not as reentrant as
2106	.I lex
2107	scanners. In particular, if you have an interactive scanner and
2108	an interrupt handler which long-jumps out of the scanner, and
2109	the scanner is subsequently called again, you may get the following
2110	message:
2111	.nf
2112
2113	fatal flex scanner internal error--end of buffer missed
2114
2115	.fi
2116	To reenter the scanner, first use
2117	.nf
2118
2119	yyrestart( yyin );
2120
2121	.fi
2122	.IP -
2123	.B output()
2124	is not supported.
2125	Output from the
2126	.B ECHO
2127	macro is done to the file-pointer
2128	.I yyout
2129	(default
2130	.I stdout).
2131	.IP
2132	The POSIX draft mentions that an
2133	.B output()
2134	routine exists but currently gives no details as to what it does.
2135	.IP -
2136	.I lex
2137	does not support exclusive start conditions (%x), though they
2138	are in the current POSIX draft.
2139	.IP -
2140	When definitions are expanded,
2141	.I flex
2142	encloses them in parentheses.
2143	With lex, the following:
2144	.nf
2145
2146	NAME [A-Z][A-Z0-9]*
2147	%%
2148	foo{NAME}? printf( "Found it\\n" );
2149	%%
2150
2151	.fi
2152	will not match the string "foo" because when the macro
2153	is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2154	and the precedence is such that the '?' is associated with
2155	"[A-Z0-9]*". With
2156	.I flex,
2157	the rule will be expanded to
2158	"foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2159	Note that because of this, the
2160	.B ^, $, <s>, /,
2161	and
2162	.B <<EOF>>
2163	operators cannot be used in a
2164	.I flex
2165	definition.
2166	.IP
2167	The POSIX draft interpretation is the same as
2168	.I flex's.
2169	.IP -
2170	To specify a character class which matches anything but a left bracket (']'),
2171	in
2172	.I lex
2173	one can use "[^]]" but with
2174	.I flex
2175	one must use "[^\\]]". The latter works with
2176	.I lex,
2177	too.
2178	.IP -
2179	The
2180	.I lex
2181	.B %r
2182	(generate a Ratfor scanner) option is not supported. It is not part
2183	of the POSIX draft.
2184	.IP -
2185	If you are providing your own yywrap() routine, you must include a
2186	"#undef yywrap" in the definitions section (section 1). Note that
2187	the "#undef" will have to be enclosed in %{}'s.
2188	.IP
2189	The POSIX draft
2190	specifies that yywrap() is a function and this is very unlikely to change; so
2191	.I flex users are warned
2192	that
2193	.B yywrap()
2194	is likely to be changed to a function in the near future.
2195	.IP -
2196	After a call to
2197	.B unput(),
2198	.I yytext
2199	and
2200	.I yyleng
2201	are undefined until the next token is matched. This is not the case with
2202	.I lex
2203	or the present POSIX draft.
2204	.IP -
2205	The precedence of the
2206	.B {}
2207	(numeric range) operator is different.
2208	.I lex
2209	interprets "abc{1,3}" as "match one, two, or
2210	three occurrences of 'abc'", whereas
2211	.I flex
2212	interprets it as "match 'ab'
2213	followed by one, two, or three occurrences of 'c'". The latter is
2214	in agreement with the current POSIX draft.
2215	.IP -
2216	The precedence of the
2217	.B ^
2218	operator is different.
2219	.I lex
2220	interprets "^foo\|bar" as "match either 'foo' at the beginning of a line,
2221	or 'bar' anywhere", whereas
2222	.I flex
2223	interprets it as "match either 'foo' or 'bar' if they come at the beginning
2224	of a line". The latter is in agreement with the current POSIX draft.
2225	.IP -
2226	To refer to yytext outside of the scanner source file,
2227	the correct definition with
2228	.I flex
2229	is "extern char *yytext" rather than "extern char yytext[]".
2230	This is contrary to the current POSIX draft but a point on which
2231	.I flex
2232	will not be changing, as the array representation entails a
2233	serious performance penalty. It is hoped that the POSIX draft will
2234	be emended to support the
2235	.I flex
2236	variety of declaration (as this is a fairly painless change to
2237	require of
2238	.I lex
2239	users).
2240	.IP -
2241	.I yyin
2242	is
2243	.I initialized
2244	by
2245	.I lex
2246	to be
2247	.I stdin;
2248	.I flex,
2249	on the other hand,
2250	initializes
2251	.I yyin
2252	to NULL
2253	and then
2254	.I assigns
2255	it to
2256	.I stdin
2257	the first time the scanner is called, providing
2258	.I yyin
2259	has not already been assigned to a non-NULL value. The difference is
2260	subtle, but the net effect is that with
2261	.I flex
2262	scanners,
2263	.I yyin
2264	does not have a valid value until the scanner has been called.
2265	.IP -
2266	The special table-size declarations such as
2267	.B %a
2268	supported by
2269	.I lex
2270	are not required by
2271	.I flex
2272	scanners;
2273	.I flex
2274	ignores them.
2275	.IP -
2276	The name
2277	.B FLEX_SCANNER
2278	is #define'd so scanners may be written for use with either
2279	.I flex
2280	or
2281	.I lex.
2282	.LP
2283	The following
2284	.I flex
2285	features are not included in
2286	.I lex
2287	or the POSIX draft standard:
2288	.nf
2289
2290	yyterminate()
2291	<<EOF>>
2292	YY_DECL
2293	#line directives
2294	%{}'s around actions
2295	yyrestart()
2296	comments beginning with '#' (deprecated)
2297	multiple actions on a line
2298
2299	.fi
2300	This last feature refers to the fact that with
2301	.I flex
2302	you can put multiple actions on the same line, separated with
2303	semi-colons, while with
2304	.I lex,
2305	the following
2306	.nf
2307
2308	foo handle_foo(); ++num_foos_seen;
2309
2310	.fi
2311	is (rather surprisingly) truncated to
2312	.nf
2313
2314	foo handle_foo();
2315
2316	.fi
2317	.I flex
2318	does not truncate the action. Actions that are not enclosed in
2319	braces are simply terminated at the end of the line.
2320	.SH DIAGNOSTICS
2321	.I reject_used_but_not_detected undefined
2322	or
2323	.I yymore_used_but_not_detected undefined -
2324	These errors can occur at compile time. They indicate that the
2325	scanner uses
2326	.B REJECT
2327	or
2328	.B yymore()
2329	but that
2330	.I flex
2331	failed to notice the fact, meaning that
2332	.I flex
2333	scanned the first two sections looking for occurrences of these actions
2334	and failed to find any, but somehow you snuck some in (via a #include
2335	file, for example). Make an explicit reference to the action in your
2336	.I flex
2337	input file. (Note that previously
2338	.I flex
2339	supported a
2340	.B %used/%unused
2341	mechanism for dealing with this problem; this feature is still supported
2342	but now deprecated, and will go away soon unless the author hears from
2343	people who can argue compellingly that they need it.)
2344	.LP
2345	.I flex scanner jammed -
2346	a scanner compiled with
2347	.B -s
2348	has encountered an input string which wasn't matched by
2349	any of its rules.
2350	.LP
2351	.I flex input buffer overflowed -
2352	a scanner rule matched a string long enough to overflow the
2353	scanner's internal input buffer (16K bytes by default - controlled by
2354	.B YY_BUF_SIZE
2355	in "flex.skel". Note that to redefine this macro, you must first
2356	.B #undefine
2357	it).
2358	.LP
2359	.I scanner requires -8 flag -
2360	Your scanner specification includes recognizing 8-bit characters and
2361	you did not specify the -8 flag (and your site has not installed flex
2362	with -8 as the default).
2363	.LP
2364	.I
2365	fatal flex scanner internal error--end of buffer missed -
2366	This can occur in an scanner which is reentered after a long-jump
2367	has jumped out (or over) the scanner's activation frame. Before
2368	reentering the scanner, use:
2369	.nf
2370
2371	yyrestart( yyin );
2372
2373	.fi
2374	.LP
2375	.I too many %t classes! -
2376	You managed to put every single character into its own %t class.
2377	.I flex
2378	requires that at least one of the classes share characters.
2379	.SH DEFICIENCIES / BUGS
2380	See flex(1).
2381	.SH "SEE ALSO"
2382	.LP
2383	flex(1), lex(1), yacc(1), sed(1), awk(9).
2384	.LP
2385	M. E. Lesk and E. Schmidt,
2386	.I LEX - Lexical Analyzer Generator
2387	.SH AUTHOR
2388	Vern Paxson, with the help of many ideas and much inspiration from
2389	Van Jacobson. Original version by Jef Poskanzer. The fast table
2390	representation is a partial implementation of a design done by Van
2391	Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2392	.LP
2393	Thanks to the many
2394	.I flex
2395	beta-testers, feedbackers, and contributors, especially Casey
2396	Leedom, benson@odi.com, Keith Bostic,
2397	Frederic Brehm, Nick Christopher, Jason Coughlin,
2398	Scott David Daniels, Leo Eskin,
2399	Chris Faylor, Eric Goldman, Eric
2400	Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2401	Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2402	Jef Poskanzer, Jim Roskind,
2403	Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2404	have slipped my marginal mail-archiving skills but whose contributions
2405	are appreciated all the same.
2406	.LP
2407	Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2408	Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2409	headaches.
2410	.LP
2411	Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2412	to Benson Margulies and Fred
2413	Burke for C++ support; to Ove Ewerlid for the basics of support for
2414	NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2415	.LP
2416	Work is being done on extending
2417	.I flex
2418	to generate scanners in which the
2419	state machine is directly represented in C code rather than tables.
2420	These scanners may well be substantially faster than those generated
2421	using -f or -F. If you are working in this area and are interested
2422	in comparing notes and seeing whether redundant work can be avoided,
2423	contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2424	.LP
2425	This work was primarily done when I was at the Real Time Systems Group
2426	at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2427	for the support I received.
2428	.LP
2429	Send comments to:
2430	.nf
2431
2432	Vern Paxson
2433	Computer Science Department
2434	4126 Upson Hall
2435	Cornell University
2436	Ithaca, NY 14853-7501
2437
2438	vern@cs.cornell.edu
2439	decvax!cornell!vern
2440
2441	.fi
2442	.\" ref. to awk(9) man page corrected -- ASW 2005-01-15

Note: See TracBrowser for help on using the repository browser.

Download in other formats:

Original Format