source: trunk/minix/lib/regex/regex.3@ 9

Last change on this file since 9 was 9, checked in by Mattia Monga, 13 years ago

Minix 3.1.2a

File size: 16.5 KB
Line 
1.\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2.\" Copyright (c) 1992, 1993, 1994
3.\" The Regents of the University of California. All rights reserved.
4.\"
5.\" This code is derived from software contributed to Berkeley by
6.\" Henry Spencer.
7.\"
8.\" Redistribution and use in source and binary forms, with or without
9.\" modification, are permitted provided that the following conditions
10.\" are met:
11.\" 1. Redistributions of source code must retain the above copyright
12.\" notice, this list of conditions and the following disclaimer.
13.\" 2. Redistributions in binary form must reproduce the above copyright
14.\" notice, this list of conditions and the following disclaimer in the
15.\" documentation and/or other materials provided with the distribution.
16.\" 3. All advertising materials mentioning features or use of this software
17.\" must display the following acknowledgement:
18.\" This product includes software developed by the University of
19.\" California, Berkeley and its contributors.
20.\" 4. Neither the name of the University nor the names of its contributors
21.\" may be used to endorse or promote products derived from this software
22.\" without specific prior written permission.
23.\"
24.\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
25.\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
26.\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
27.\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
28.\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
29.\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
30.\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
31.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
32.\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
33.\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
34.\" SUCH DAMAGE.
35.\"
36.\" @(#)regex.3 8.4 (Berkeley) 3/20/94
37.\"
38.TH REGEX 3 "March 20, 1994"
39.de ZR
40.\" one other place knows this name: the SEE ALSO section
41.BR re_format (7) \\$1
42..
43.SH NAME
44regex, regcomp, regexec, regerror, regfree \- regular-expression library
45.SH SYNOPSIS
46.ft B
47.\".na
48#include <sys/types.h>
49.br
50#include <regex.h>
51.sp
52.in +.5i
53.ti -.5i
54int regcomp(regex_t *\fIpreg\fP, const char *\fIpattern\fP, int \fIcflags\fP);
55.ti -.5i
56int regexec(const regex_t *\fIpreg\fP, const char *\fIstring\fP,
57size_t \fInmatch\fP, regmatch_t \fIpmatch\fP[], int \fIeflags\fP);
58.ti -.5i
59size_t regerror(int \fIerrcode\fP, const regex_t *\fIpreg\fP,
60char *\fIerrbuf\fP, size_t \fIerrbuf_size\fP);
61.ti -.5i
62void regfree(regex_t *\fIpreg\fP);
63.in -.5i
64.ft R
65.SH DESCRIPTION
66These routines implement POSIX 1003.2 regular expressions (``RE''s);
67see
68.ZR .
69.B Regcomp
70compiles an RE written as a string into an internal form,
71.B regexec
72matches that internal form against a string and reports results,
73.B regerror
74transforms error codes from either into human-readable messages,
75and
76.B regfree
77frees any dynamically-allocated storage used by the internal form
78of an RE.
79.PP
80The header
81.I <regex.h>
82declares two structure types,
83.B regex_t
84and
85.BR regmatch_t ,
86the former for compiled internal forms and the latter for match reporting.
87It also declares the four functions,
88a type
89.BR regoff_t ,
90and a number of constants with names starting with ``REG_''.
91.PP
92.B Regcomp
93compiles the regular expression contained in the
94.I pattern
95string,
96subject to the flags in
97.IR cflags ,
98and places the results in the
99.B regex_t
100structure pointed to by
101.IR preg .
102.I Cflags
103is the bitwise OR of zero or more of the following flags:
104.IP REG_EXTENDED \w'REG_EXTENDED'u+2n
105Compile modern (``extended'') REs,
106rather than the obsolete (``basic'') REs that
107are the default.
108.IP REG_BASIC
109This is a synonym for 0,
110provided as a counterpart to REG_EXTENDED to improve readability.
111.IP REG_NOSPEC
112Compile with recognition of all special characters turned off.
113All characters are thus considered ordinary,
114so the ``RE'' is a literal string.
115This is an extension,
116compatible with but not specified by POSIX 1003.2,
117and should be used with
118caution in software intended to be portable to other systems.
119REG_EXTENDED and REG_NOSPEC may not be used
120in the same call to
121.IR regcomp .
122.IP REG_ICASE
123Compile for matching that ignores upper/lower case distinctions.
124See
125.ZR .
126.IP REG_NOSUB
127Compile for matching that need only report success or failure,
128not what was matched.
129.IP REG_NEWLINE
130Compile for newline-sensitive matching.
131By default, newline is a completely ordinary character with no special
132meaning in either REs or strings.
133With this flag,
134`[^' bracket expressions and `.' never match newline,
135a `^' anchor matches the null string after any newline in the string
136in addition to its normal function,
137and the `$' anchor matches the null string before any newline in the
138string in addition to its normal function.
139.IP REG_PEND
140The regular expression ends,
141not at the first NUL,
142but just before the character pointed to by the
143.B re_endp
144member of the structure pointed to by
145.IR preg .
146The
147.B re_endp
148member is of type
149.BR "const\ char\ *" .
150This flag permits inclusion of NULs in the RE;
151they are considered ordinary characters.
152This is an extension,
153compatible with but not specified by POSIX 1003.2,
154and should be used with
155caution in software intended to be portable to other systems.
156.PP
157When successful,
158.B regcomp
159returns 0 and fills in the structure pointed to by
160.IR preg .
161One member of that structure
162(other than
163.BR re_endp )
164is publicized:
165.BR re_nsub ,
166of type
167.BR size_t ,
168contains the number of parenthesized subexpressions within the RE
169(except that the value of this member is undefined if the
170REG_NOSUB flag was used).
171If
172.B regcomp
173fails, it returns a non-zero error code;
174see DIAGNOSTICS.
175.PP
176.B Regexec
177matches the compiled RE pointed to by
178.I preg
179against the
180.IR string ,
181subject to the flags in
182.IR eflags ,
183and reports results using
184.IR nmatch ,
185.IR pmatch ,
186and the returned value.
187The RE must have been compiled by a previous invocation of
188.BR regcomp .
189The compiled form is not altered during execution of
190.BR regexec ,
191so a single compiled RE can be used simultaneously by multiple threads.
192.PP
193By default,
194the NUL-terminated string pointed to by
195.I string
196is considered to be the text of an entire line, minus any terminating
197newline.
198The
199.I eflags
200argument is the bitwise OR of zero or more of the following flags:
201.IP REG_NOTBOL \w'REG_STARTEND'u+2n
202The first character of
203the string
204is not the beginning of a line, so the `^' anchor should not match before it.
205This does not affect the behavior of newlines under REG_NEWLINE.
206.IP REG_NOTEOL
207The NUL terminating
208the string
209does not end a line, so the `$' anchor should not match before it.
210This does not affect the behavior of newlines under REG_NEWLINE.
211.IP REG_STARTEND
212The string is considered to start at
213\fIstring\fR\ + \fIpmatch\fR[0].\fBrm_so\fR
214and to have a terminating NUL located at
215\fIstring\fR\ + \fIpmatch\fR[0].\fBrm_eo\fR
216(there need not actually be a NUL at that location),
217regardless of the value of
218.IR nmatch .
219See below for the definition of
220.IR pmatch
221and
222.IR nmatch .
223This is an extension,
224compatible with but not specified by POSIX 1003.2,
225and should be used with
226caution in software intended to be portable to other systems.
227Note that a non-zero \fBrm_so\fR does not imply REG_NOTBOL;
228REG_STARTEND affects only the location of the string,
229not how it is matched.
230.PP
231See
232.ZR
233for a discussion of what is matched in situations where an RE or a
234portion thereof could match any of several substrings of
235.IR string .
236.PP
237Normally,
238.B regexec
239returns 0 for success and the non-zero code REG_NOMATCH for failure.
240Other non-zero error codes may be returned in exceptional situations;
241see DIAGNOSTICS.
242.PP
243If REG_NOSUB was specified in the compilation of the RE,
244or if
245.I nmatch
246is 0,
247.B regexec
248ignores the
249.I pmatch
250argument (but see below for the case where REG_STARTEND is specified).
251Otherwise,
252.I pmatch
253points to an array of
254.I nmatch
255structures of type
256.BR regmatch_t .
257Such a structure has at least the members
258.B rm_so
259and
260.BR rm_eo ,
261both of type
262.B regoff_t
263(a signed arithmetic type at least as large as an
264.B off_t
265and a
266.BR ssize_t ),
267containing respectively the offset of the first character of a substring
268and the offset of the first character after the end of the substring.
269Offsets are measured from the beginning of the
270.I string
271argument given to
272.BR regexec .
273An empty substring is denoted by equal offsets,
274both indicating the character following the empty substring.
275.PP
276The 0th member of the
277.I pmatch
278array is filled in to indicate what substring of
279.I string
280was matched by the entire RE.
281Remaining members report what substring was matched by parenthesized
282subexpressions within the RE;
283member
284.I i
285reports subexpression
286.IR i ,
287with subexpressions counted (starting at 1) by the order of their opening
288parentheses in the RE, left to right.
289Unused entries in the array\(emcorresponding either to subexpressions that
290did not participate in the match at all, or to subexpressions that do not
291exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fBre_nsub\fR)\(emhave both
292.B rm_so
293and
294.B rm_eo
295set to \-1.
296If a subexpression participated in the match several times,
297the reported substring is the last one it matched.
298(Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
299the parenthesized subexpression matches each of the three `b's and then
300an infinite number of empty strings following the last `b',
301so the reported substring is one of the empties.)
302.PP
303If REG_STARTEND is specified,
304.I pmatch
305must point to at least one
306.B regmatch_t
307(even if
308.I nmatch
309is 0 or REG_NOSUB was specified),
310to hold the input offsets for REG_STARTEND.
311Use for output is still entirely controlled by
312.IR nmatch ;
313if
314.I nmatch
315is 0 or REG_NOSUB was specified,
316the value of
317.IR pmatch [0]
318will not be changed by a successful
319.BR regexec .
320.PP
321.B Regerror
322maps a non-zero
323.I errcode
324from either
325.B regcomp
326or
327.B regexec
328to a human-readable, printable message.
329If
330.I preg
331is non-NULL,
332the error code should have arisen from use of
333the
334.B regex_t
335pointed to by
336.IR preg ,
337and if the error code came from
338.BR regcomp ,
339it should have been the result from the most recent
340.B regcomp
341using that
342.BR regex_t .
343.RI ( Regerror
344may be able to supply a more detailed message using information
345from the
346.BR regex_t .)
347.B Regerror
348places the NUL-terminated message into the buffer pointed to by
349.IR errbuf ,
350limiting the length (including the NUL) to at most
351.I errbuf_size
352bytes.
353If the whole message won't fit,
354as much of it as will fit before the terminating NUL is supplied.
355In any case,
356the returned value is the size of buffer needed to hold the whole
357message (including terminating NUL).
358If
359.I errbuf_size
360is 0,
361.I errbuf
362is ignored but the return value is still correct.
363.PP
364If the
365.I errcode
366given to
367.B regerror
368is first ORed with REG_ITOA,
369the ``message'' that results is the printable name of the error code,
370e.g. ``REG_NOMATCH'',
371rather than an explanation thereof.
372If
373.I errcode
374is REG_ATOI,
375then
376.I preg
377shall be non-NULL and the
378.B re_endp
379member of the structure it points to
380must point to the printable name of an error code;
381in this case, the result in
382.I errbuf
383is the decimal digits of
384the numeric value of the error code
385(0 if the name is not recognized).
386REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
387they are extensions,
388compatible with but not specified by POSIX 1003.2,
389and should be used with
390caution in software intended to be portable to other systems.
391Be warned also that they are considered experimental and changes are possible.
392.PP
393.B Regfree
394frees any dynamically-allocated storage associated with the compiled RE
395pointed to by
396.IR preg .
397The remaining
398.B regex_t
399is no longer a valid compiled RE
400and the effect of supplying it to
401.B regexec
402or
403.B regerror
404is undefined.
405.PP
406None of these functions references global variables except for tables
407of constants;
408all are safe for use from multiple threads if the arguments are safe.
409.SH IMPLEMENTATION CHOICES
410There are a number of decisions that 1003.2 leaves up to the implementor,
411either by explicitly saying ``undefined'' or by virtue of them being
412forbidden by the RE grammar.
413This implementation treats them as follows.
414.PP
415See
416.ZR
417for a discussion of the definition of case-independent matching.
418.PP
419There is no particular limit on the length of REs,
420except insofar as memory is limited.
421Memory usage is approximately linear in RE size, and largely insensitive
422to RE complexity, except for bounded repetitions.
423See BUGS for one short RE using them
424that will run almost any system out of memory.
425.PP
426A backslashed character other than one specifically given a magic meaning
427by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
428is taken as an ordinary character.
429.PP
430Any unmatched [ is a REG_EBRACK error.
431.PP
432Equivalence classes cannot begin or end bracket-expression ranges.
433The endpoint of one range cannot begin another.
434.PP
435RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
436.PP
437A repetition operator (?, *, +, or bounds) cannot follow another
438repetition operator.
439A repetition operator cannot begin an expression or subexpression
440or follow `^' or `|'.
441.PP
442`|' cannot appear first or last in a (sub)expression or after another `|',
443i.e. an operand of `|' cannot be an empty subexpression.
444An empty parenthesized subexpression, `()', is legal and matches an
445empty (sub)string.
446An empty string is not a legal RE.
447.PP
448A `{' followed by a digit is considered the beginning of bounds for a
449bounded repetition, which must then follow the syntax for bounds.
450A `{' \fInot\fR followed by a digit is considered an ordinary character.
451.PP
452`^' and `$' beginning and ending subexpressions in obsolete (``basic'')
453REs are anchors, not ordinary characters.
454.SH SEE ALSO
455.BR grep (1),
456.BR re_format (7).
457.PP
458POSIX 1003.2, sections 2.8 (Regular Expression Notation)
459and
460B.5 (C Binding for Regular Expression Matching).
461.SH DIAGNOSTICS
462Non-zero error codes from
463.B regcomp
464and
465.B regexec
466include the following:
467.PP
468.nf
469.ta \w'REG_ECOLLATE'u+3n
470REG_NOMATCH regexec() failed to match
471REG_BADPAT invalid regular expression
472REG_ECOLLATE invalid collating element
473REG_ECTYPE invalid character class
474REG_EESCAPE \e applied to unescapable character
475REG_ESUBREG invalid backreference number
476REG_EBRACK brackets [ ] not balanced
477REG_EPAREN parentheses ( ) not balanced
478REG_EBRACE braces { } not balanced
479REG_BADBR invalid repetition count(s) in { }
480REG_ERANGE invalid character range in [ ]
481REG_ESPACE ran out of memory
482REG_BADRPT ?, *, or + operand invalid
483REG_EMPTY empty (sub)expression
484REG_ASSERT ``can't happen''\(emyou found a bug
485REG_INVARG invalid argument, e.g. negative-length string
486.fi
487.SH HISTORY
488Originally written by Henry Spencer.
489Altered for inclusion in the 4.4BSD distribution.
490.SH BUGS
491This is an alpha release with known defects.
492Please report problems.
493.PP
494There is one known functionality bug.
495The implementation of internationalization is incomplete:
496the locale is always assumed to be the default one of 1003.2,
497and only the collating elements etc. of that locale are available.
498.PP
499The back-reference code is subtle and doubts linger about its correctness
500in complex cases.
501.PP
502.B Regexec
503performance is poor.
504This will improve with later releases.
505.I Nmatch
506exceeding 0 is expensive;
507.I nmatch
508exceeding 1 is worse.
509.B Regexec
510is largely insensitive to RE complexity \fIexcept\fR that back
511references are massively expensive.
512RE length does matter; in particular, there is a strong speed bonus
513for keeping RE length under about 30 characters,
514with most special characters counting roughly double.
515.PP
516.B Regcomp
517implements bounded repetitions by macro expansion,
518which is costly in time and space if counts are large
519or bounded repetitions are nested.
520An RE like, say,
521`((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
522will (eventually) run almost any existing machine out of swap space.
523.PP
524There are suspected problems with response to obscure error conditions.
525Notably,
526certain kinds of internal overflow,
527produced only by truly enormous REs or by multiply nested bounded repetitions,
528are probably not handled well.
529.PP
530Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
531a special character only in the presence of a previous unmatched `('.
532This can't be fixed until the spec is fixed.
533.PP
534The standard's definition of back references is vague.
535For example, does
536`a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
537Until the standard is clarified,
538behavior in such cases should not be relied on.
539.PP
540The implementation of word-boundary matching is a bit of a kludge,
541and bugs may lurk in combinations of word-boundary matching and anchoring.
Note: See TracBrowser for help on using the repository browser.