[9] | 1 | .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
|
---|
| 2 | .\" Copyright (c) 1992, 1993, 1994
|
---|
| 3 | .\" The Regents of the University of California. All rights reserved.
|
---|
| 4 | .\"
|
---|
| 5 | .\" This code is derived from software contributed to Berkeley by
|
---|
| 6 | .\" Henry Spencer.
|
---|
| 7 | .\"
|
---|
| 8 | .\" Redistribution and use in source and binary forms, with or without
|
---|
| 9 | .\" modification, are permitted provided that the following conditions
|
---|
| 10 | .\" are met:
|
---|
| 11 | .\" 1. Redistributions of source code must retain the above copyright
|
---|
| 12 | .\" notice, this list of conditions and the following disclaimer.
|
---|
| 13 | .\" 2. Redistributions in binary form must reproduce the above copyright
|
---|
| 14 | .\" notice, this list of conditions and the following disclaimer in the
|
---|
| 15 | .\" documentation and/or other materials provided with the distribution.
|
---|
| 16 | .\" 3. All advertising materials mentioning features or use of this software
|
---|
| 17 | .\" must display the following acknowledgement:
|
---|
| 18 | .\" This product includes software developed by the University of
|
---|
| 19 | .\" California, Berkeley and its contributors.
|
---|
| 20 | .\" 4. Neither the name of the University nor the names of its contributors
|
---|
| 21 | .\" may be used to endorse or promote products derived from this software
|
---|
| 22 | .\" without specific prior written permission.
|
---|
| 23 | .\"
|
---|
| 24 | .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
|
---|
| 25 | .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
|
---|
| 26 | .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
|
---|
| 27 | .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
|
---|
| 28 | .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
|
---|
| 29 | .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
|
---|
| 30 | .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
---|
| 31 | .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
|
---|
| 32 | .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
|
---|
| 33 | .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
|
---|
| 34 | .\" SUCH DAMAGE.
|
---|
| 35 | .\"
|
---|
| 36 | .\" @(#)re_format.7 8.3 (Berkeley) 3/20/94
|
---|
| 37 | .\"
|
---|
| 38 | .TH RE_FORMAT 7 "March 20, 1994"
|
---|
| 39 | .SH NAME
|
---|
| 40 | re_format \- POSIX 1003.2 regular expressions
|
---|
| 41 | .SH DESCRIPTION
|
---|
| 42 | Regular expressions (``RE''s),
|
---|
| 43 | as defined in POSIX 1003.2, come in two forms:
|
---|
| 44 | modern REs (roughly those of
|
---|
| 45 | .BR egrep ;
|
---|
| 46 | 1003.2 calls these ``extended'' REs)
|
---|
| 47 | and obsolete REs (roughly those of
|
---|
| 48 | .BR ed ;
|
---|
| 49 | 1003.2 ``basic'' REs).
|
---|
| 50 | Obsolete REs mostly exist for backward compatibility in some old programs;
|
---|
| 51 | they will be discussed at the end.
|
---|
| 52 | 1003.2 leaves some aspects of RE syntax and semantics open;
|
---|
| 53 | `\(dg' marks decisions on these aspects that
|
---|
| 54 | may not be fully portable to other 1003.2 implementations.
|
---|
| 55 | .PP
|
---|
| 56 | A (modern) RE is one\(dg or more non-empty\(dg \fIbranches\fR,
|
---|
| 57 | separated by `|'.
|
---|
| 58 | It matches anything that matches one of the branches.
|
---|
| 59 | .PP
|
---|
| 60 | A branch is one\(dg or more \fIpieces\fR, concatenated.
|
---|
| 61 | It matches a match for the first, followed by a match for the second, etc.
|
---|
| 62 | .PP
|
---|
| 63 | A piece is an \fIatom\fR possibly followed
|
---|
| 64 | by a single\(dg `*', `+', `?', or \fIbound\fR.
|
---|
| 65 | An atom followed by `*' matches a sequence of 0 or more matches of the atom.
|
---|
| 66 | An atom followed by `+' matches a sequence of 1 or more matches of the atom.
|
---|
| 67 | An atom followed by `?' matches a sequence of 0 or 1 matches of the atom.
|
---|
| 68 | .PP
|
---|
| 69 | A \fIbound\fR is `{' followed by an unsigned decimal integer,
|
---|
| 70 | possibly followed by `,'
|
---|
| 71 | possibly followed by another unsigned decimal integer,
|
---|
| 72 | always followed by `}'.
|
---|
| 73 | The integers must lie between 0 and RE_DUP_MAX (255\(dg) inclusive,
|
---|
| 74 | and if there are two of them, the first may not exceed the second.
|
---|
| 75 | An atom followed by a bound containing one integer \fIi\fR
|
---|
| 76 | and no comma matches
|
---|
| 77 | a sequence of exactly \fIi\fR matches of the atom.
|
---|
| 78 | An atom followed by a bound
|
---|
| 79 | containing one integer \fIi\fR and a comma matches
|
---|
| 80 | a sequence of \fIi\fR or more matches of the atom.
|
---|
| 81 | An atom followed by a bound
|
---|
| 82 | containing two integers \fIi\fR and \fIj\fR matches
|
---|
| 83 | a sequence of \fIi\fR through \fIj\fR (inclusive) matches of the atom.
|
---|
| 84 | .PP
|
---|
| 85 | An atom is a regular expression enclosed in `()' (matching a match for the
|
---|
| 86 | regular expression),
|
---|
| 87 | an empty set of `()' (matching the null string)\(dg,
|
---|
| 88 | a \fIbracket expression\fR (see below), `.'
|
---|
| 89 | (matching any single character), `^' (matching the null string at the
|
---|
| 90 | beginning of a line), `$' (matching the null string at the
|
---|
| 91 | end of a line), a `\e' followed by one of the characters
|
---|
| 92 | `^.[$()|*+?{\e'
|
---|
| 93 | (matching that character taken as an ordinary character),
|
---|
| 94 | a `\e' followed by any other character\(dg
|
---|
| 95 | (matching that character taken as an ordinary character,
|
---|
| 96 | as if the `\e' had not been present\(dg),
|
---|
| 97 | or a single character with no other significance (matching that character).
|
---|
| 98 | A `{' followed by a character other than a digit is an ordinary
|
---|
| 99 | character, not the beginning of a bound\(dg.
|
---|
| 100 | It is illegal to end an RE with `\e'.
|
---|
| 101 | .PP
|
---|
| 102 | A \fIbracket expression\fR is a list of characters enclosed in `[]'.
|
---|
| 103 | It normally matches any single character from the list (but see below).
|
---|
| 104 | If the list begins with `^',
|
---|
| 105 | it matches any single character
|
---|
| 106 | (but see below) \fInot\fR from the rest of the list.
|
---|
| 107 | If two characters in the list are separated by `\-', this is shorthand
|
---|
| 108 | for the full \fIrange\fR of characters between those two (inclusive) in the
|
---|
| 109 | collating sequence,
|
---|
| 110 | e.g. `[0-9]' in ASCII matches any decimal digit.
|
---|
| 111 | It is illegal\(dg for two ranges to share an
|
---|
| 112 | endpoint, e.g. `a-c-e'.
|
---|
| 113 | Ranges are very collating-sequence-dependent,
|
---|
| 114 | and portable programs should avoid relying on them.
|
---|
| 115 | .PP
|
---|
| 116 | To include a literal `]' in the list, make it the first character
|
---|
| 117 | (following a possible `^').
|
---|
| 118 | To include a literal `\-', make it the first or last character,
|
---|
| 119 | or the second endpoint of a range.
|
---|
| 120 | To use a literal `\-' as the first endpoint of a range,
|
---|
| 121 | enclose it in `[.' and `.]' to make it a collating element (see below).
|
---|
| 122 | With the exception of these and some combinations using `[' (see next
|
---|
| 123 | paragraphs), all other special characters, including `\e', lose their
|
---|
| 124 | special significance within a bracket expression.
|
---|
| 125 | .PP
|
---|
| 126 | Within a bracket expression, a collating element (a character,
|
---|
| 127 | a multi-character sequence that collates as if it were a single character,
|
---|
| 128 | or a collating-sequence name for either)
|
---|
| 129 | enclosed in `[.' and `.]' stands for the
|
---|
| 130 | sequence of characters of that collating element.
|
---|
| 131 | The sequence is a single element of the bracket expression's list.
|
---|
| 132 | A bracket expression containing a multi-character collating element
|
---|
| 133 | can thus match more than one character,
|
---|
| 134 | e.g. if the collating sequence includes a `ch' collating element,
|
---|
| 135 | then the RE `[[.ch.]]*c' matches the first five characters
|
---|
| 136 | of `chchcc'.
|
---|
| 137 | .PP
|
---|
| 138 | Within a bracket expression, a collating element enclosed in `[=' and
|
---|
| 139 | `=]' is an equivalence class, standing for the sequences of characters
|
---|
| 140 | of all collating elements equivalent to that one, including itself.
|
---|
| 141 | (If there are no other equivalent collating elements,
|
---|
| 142 | the treatment is as if the enclosing delimiters were `[.' and `.]'.)
|
---|
| 143 | For example, if o and \o'o^' are the members of an equivalence class,
|
---|
| 144 | then `[[=o=]]', `[[=\o'o^'=]]', and `[o\o'o^']' are all synonymous.
|
---|
| 145 | An equivalence class may not\(dg be an endpoint
|
---|
| 146 | of a range.
|
---|
| 147 | .PP
|
---|
| 148 | Within a bracket expression, the name of a \fIcharacter class\fR enclosed
|
---|
| 149 | in `[:' and `:]' stands for the list of all characters belonging to that
|
---|
| 150 | class.
|
---|
| 151 | Standard character class names are:
|
---|
| 152 | .PP
|
---|
| 153 | .RS
|
---|
| 154 | .nf
|
---|
| 155 | .ta 3c 6c 9c
|
---|
| 156 | alnum digit punct
|
---|
| 157 | alpha graph space
|
---|
| 158 | blank lower upper
|
---|
| 159 | cntrl print xdigit
|
---|
| 160 | .fi
|
---|
| 161 | .RE
|
---|
| 162 | .PP
|
---|
| 163 | These stand for the character classes defined in
|
---|
| 164 | .BR ctype (3).
|
---|
| 165 | A locale may provide others.
|
---|
| 166 | A character class may not be used as an endpoint of a range.
|
---|
| 167 | .PP
|
---|
| 168 | There are two special cases\(dg of bracket expressions:
|
---|
| 169 | the bracket expressions `[[:<:]]' and `[[:>:]]' match the null string at
|
---|
| 170 | the beginning and end of a word respectively.
|
---|
| 171 | A word is defined as a sequence of
|
---|
| 172 | word characters
|
---|
| 173 | which is neither preceded nor followed by
|
---|
| 174 | word characters.
|
---|
| 175 | A word character is an
|
---|
| 176 | .B alnum
|
---|
| 177 | character (as defined by
|
---|
| 178 | .BR ctype (3))
|
---|
| 179 | or an underscore.
|
---|
| 180 | This is an extension,
|
---|
| 181 | compatible with but not specified by POSIX 1003.2,
|
---|
| 182 | and should be used with
|
---|
| 183 | caution in software intended to be portable to other systems.
|
---|
| 184 | .PP
|
---|
| 185 | In the event that an RE could match more than one substring of a given
|
---|
| 186 | string,
|
---|
| 187 | the RE matches the one starting earliest in the string.
|
---|
| 188 | If the RE could match more than one substring starting at that point,
|
---|
| 189 | it matches the longest.
|
---|
| 190 | Subexpressions also match the longest possible substrings, subject to
|
---|
| 191 | the constraint that the whole match be as long as possible,
|
---|
| 192 | with subexpressions starting earlier in the RE taking priority over
|
---|
| 193 | ones starting later.
|
---|
| 194 | Note that higher-level subexpressions thus take priority over
|
---|
| 195 | their lower-level component subexpressions.
|
---|
| 196 | .PP
|
---|
| 197 | Match lengths are measured in characters, not collating elements.
|
---|
| 198 | A null string is considered longer than no match at all.
|
---|
| 199 | For example,
|
---|
| 200 | `bb*' matches the three middle characters of `abbbc',
|
---|
| 201 | `(wee|week)(knights|nights)' matches all ten characters of `weeknights',
|
---|
| 202 | when `(.*).*' is matched against `abc' the parenthesized subexpression
|
---|
| 203 | matches all three characters, and
|
---|
| 204 | when `(a*)*' is matched against `bc' both the whole RE and the parenthesized
|
---|
| 205 | subexpression match the null string.
|
---|
| 206 | .PP
|
---|
| 207 | If case-independent matching is specified,
|
---|
| 208 | the effect is much as if all case distinctions had vanished from the
|
---|
| 209 | alphabet.
|
---|
| 210 | When an alphabetic that exists in multiple cases appears as an
|
---|
| 211 | ordinary character outside a bracket expression, it is effectively
|
---|
| 212 | transformed into a bracket expression containing both cases,
|
---|
| 213 | e.g. `x' becomes `[xX]'.
|
---|
| 214 | When it appears inside a bracket expression, all case counterparts
|
---|
| 215 | of it are added to the bracket expression, so that (e.g.) `[x]'
|
---|
| 216 | becomes `[xX]' and `[^x]' becomes `[^xX]'.
|
---|
| 217 | .PP
|
---|
| 218 | No particular limit is imposed on the length of REs\(dg.
|
---|
| 219 | Programs intended to be portable should not employ REs longer
|
---|
| 220 | than 256 bytes,
|
---|
| 221 | as an implementation can refuse to accept such REs and remain
|
---|
| 222 | POSIX-compliant.
|
---|
| 223 | .PP
|
---|
| 224 | Obsolete (``basic'') regular expressions differ in several respects.
|
---|
| 225 | `|', `+', and `?' are ordinary characters and there is no equivalent
|
---|
| 226 | for their functionality.
|
---|
| 227 | The delimiters for bounds are `\e{' and `\e}',
|
---|
| 228 | with `{' and `}' by themselves ordinary characters.
|
---|
| 229 | The parentheses for nested subexpressions are `\e(' and `\e)',
|
---|
| 230 | with `(' and `)' by themselves ordinary characters.
|
---|
| 231 | `^' is an ordinary character except at the beginning of the
|
---|
| 232 | RE or\(dg the beginning of a parenthesized subexpression,
|
---|
| 233 | `$' is an ordinary character except at the end of the
|
---|
| 234 | RE or\(dg the end of a parenthesized subexpression,
|
---|
| 235 | and `*' is an ordinary character if it appears at the beginning of the
|
---|
| 236 | RE or the beginning of a parenthesized subexpression
|
---|
| 237 | (after a possible leading `^').
|
---|
| 238 | Finally, there is one new type of atom, a \fIback reference\fR:
|
---|
| 239 | `\e' followed by a non-zero decimal digit \fId\fR
|
---|
| 240 | matches the same sequence of characters
|
---|
| 241 | matched by the \fId\fRth parenthesized subexpression
|
---|
| 242 | (numbering subexpressions by the positions of their opening parentheses,
|
---|
| 243 | left to right),
|
---|
| 244 | so that (e.g.) `\e([bc]\e)\e1' matches `bb' or `cc' but not `bc'.
|
---|
| 245 | .SH SEE ALSO
|
---|
| 246 | regex(3)
|
---|
| 247 | .PP
|
---|
| 248 | POSIX 1003.2, section 2.8 (Regular Expression Notation).
|
---|
| 249 | .SH BUGS
|
---|
| 250 | Having two kinds of REs is a botch.
|
---|
| 251 | .PP
|
---|
| 252 | The current 1003.2 spec says that `)' is an ordinary character in
|
---|
| 253 | the absence of an unmatched `(';
|
---|
| 254 | this was an unintentional result of a wording error,
|
---|
| 255 | and change is likely.
|
---|
| 256 | Avoid relying on it.
|
---|
| 257 | .PP
|
---|
| 258 | Back references are a dreadful botch,
|
---|
| 259 | posing major problems for efficient implementations.
|
---|
| 260 | They are also somewhat vaguely defined
|
---|
| 261 | (does
|
---|
| 262 | `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?).
|
---|
| 263 | Avoid using them.
|
---|
| 264 | .PP
|
---|
| 265 | 1003.2's specification of case-independent matching is vague.
|
---|
| 266 | The ``one case implies all cases'' definition given above
|
---|
| 267 | is current consensus among implementors as to the right interpretation.
|
---|
| 268 | .PP
|
---|
| 269 | The syntax for word boundaries is incredibly ugly.
|
---|