reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77
...
...
@@ -8,21 +8,21 @@
8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described below. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -102,15 +102,15 @@
102
102
<section xml:id="regexp.reference.meta">
103
103
<title>Meta-characters</title>
104
104
<para>
105
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
106
106
ability to include alternatives and repetitions in the
107
-
pattern. These are encoded in the pattern by the use of
108
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
109
109
are interpreted in some special way.
110
110
</para>
111
111
<para>
112
-
There are two different sets of meta-characters: those that
113
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
114
114
brackets, and those that are recognized in square brackets.
115
115
Outside square brackets, the meta-characters are as follows:
116
116

...
...
@@ -130,7 +130,8 @@
130
130
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
131
</row>
132
132
<row>
133
-
<entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
134
135
</row>
135
136
<row>
136
137
<entry>.</entry><entry>match any character except newline (by default)</entry>
...
...
@@ -204,9 +205,9 @@
204
205
<section xml:id="regexp.reference.escape">
205
206
<title>Escape sequences</title>
206
207
<para>
207
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
208
209
followed by a non-alphanumeric character, it takes away any
209
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
210
211
backslash as an escape character applies both inside and
211
212
outside character classes.
212
213
</para>
...
...
@@ -215,7 +216,7 @@
215
216
"\*" in the pattern. This applies whether or not the
216
217
following character would otherwise be interpreted as a
217
218
meta-character, so it is always safe to precede a non-alphanumeric
218
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
219
220
particular, if you want to match a backslash, you write "\\".
220
221
</para>
221
222
<note>
...
...
@@ -237,10 +238,10 @@
237
238
<para>
238
239
A second use of backslash provides a way of encoding
239
240
non-printing characters in patterns in a visible manner. There
240
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
241
242
apart from the binary zero that terminates a pattern,
242
243
but when a pattern is being prepared by text editing, it is
243
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
244
245
than the binary character it represents:
245
246
</para>
246
247
<para>
...
...
@@ -331,9 +332,9 @@
331
332
</para>
332
333
<para>
333
334
The precise effect of "<literal>\cx</literal>" is as follows:
334
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
335
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
336
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
338
339
becomes hex 7B.
339
340
</para>
...
...
@@ -349,7 +350,7 @@
349
350
</para>
350
351
<para>
351
352
After "<literal>\0</literal>" up to two further octal digits are read.
352
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
353
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
354
355
specifies two binary zeros followed by a BEL character. Make sure you
355
356
supply two digits after the initial zero if the character
...
...
@@ -358,20 +359,20 @@
358
359
<para>
359
360
The handling of a backslash followed by a digit other than 0
360
361
is complicated. Outside a character class, PCRE reads it
361
-
and any following digits as a decimal number. If the number
362
-
is less than 10, or if there have been at least that many
363
-
previous capturing left parentheses in the expression, the
364
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
365
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
366
367
of parenthesized subpatterns.
367
368
</para>
368
369
<para>
369
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
370
371
greater than 9 and there have not been that many capturing
371
372
subpatterns, PCRE re-reads up to three octal digits following
372
373
the backslash, and generates a single byte from the
373
374
least significant 8 bits of the value. Any subsequent digits
374
-
stand for themselves. For example:
375
+
stand for themselves. For example:
375
376
</para>
376
377
<para>
377
378
<variablelist>
...
...
@@ -439,7 +440,7 @@
439
440
digits are ever read.
440
441
</para>
441
442
<para>
442
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
443
444
used both inside and outside character classes. In addition,
444
445
inside a character class, the sequence "<literal>\b</literal>"
445
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -506,7 +507,7 @@
506
507
</para>
507
508
<para>
508
509
A "word" character is any letter or digit or the underscore
509
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
510
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
511
512
controlled by PCRE's character tables, and may vary if locale-specific
512
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -515,15 +516,15 @@
515
516
</para>
516
517
<para>
517
518
These character type sequences can appear both inside and
518
-
outside character classes. They each match one character of
519
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
520
521
the end of the subject string, all of them fail, since there
521
522
is no character to match.
522
523
</para>
523
524
<para>
524
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
525
526
assertions. An assertion specifies a condition that has to be met
526
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
527
528
characters from the subject string. The use of subpatterns
528
529
for more complicated assertions is described below. The
529
530
backslashed assertions are
...
...
@@ -562,7 +563,7 @@
562
563
</variablelist>
563
564
</para>
564
565
<para>
565
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
566
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
567
568
character, inside a character class).
568
569
</para>
...
...
@@ -570,20 +571,20 @@
570
571
A word boundary is a position in the subject string where
571
572
the current character and the previous character do not both
572
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
573
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
574
575
<literal>\W</literal>), or the start or end of the string if the first
575
576
or last character matches <literal>\w</literal>, respectively.
576
577
</para>
577
578
<para>
578
579
The <literal>\A</literal>, <literal>\Z</literal>, and
579
-
<literal>\z</literal> assertions differ from the traditional
580
-
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only
581
-
ever match at the very start and end of the subject string,
582
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
583
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
584
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
585
-
options. The difference between <literal>\Z</literal> and
586
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
587
588
newline that is the last character of the string as well as at the end of
588
589
the string, whereas <literal>\z</literal> matches only at the end.
589
590
</para>
...
...
@@ -873,8 +874,8 @@
873
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
874
875
</para>
875
876
<para>
876
-
Sets of Unicode characters are defined as belonging to certain scripts. A
877
-
character from one of these sets can be matched using a script name. For
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
878
879
example:
879
880
</para>
880
881
<itemizedlist>
...
...
@@ -886,7 +887,7 @@
886
887
</listitem>
887
888
</itemizedlist>
888
889
<para>
889
-
Those that are not part of an identified script are lumped together as
890
+
Those that are not part of an identified script are lumped together as
890
891
<literal>Common</literal>. The current list of scripts is:
891
892
</para>
892
893
<table>
...
...
@@ -1055,7 +1056,7 @@
1055
1056
<para>
1056
1057
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1057
1058
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1058
-
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
1060
character without the "mark" property, followed by zero or more characters
1060
1061
with the "mark" property, and treats the sequence as an atomic group (see
1061
1062
below). Characters with the "mark" property are typically accents that
...
...
@@ -1075,8 +1076,8 @@
1075
1076
<para>
1076
1077
Outside a character class, in the default matching mode, the
1077
1078
circumflex character (<literal>^</literal>) is an assertion which
1078
-
is true only if the current matching point is at the start of
1079
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1080
1081
has an entirely different meaning (see below).
1081
1082
</para>
1082
1083
<para>
...
...
@@ -1091,12 +1092,12 @@
1091
1092
</para>
1092
1093
<para>
1093
1094
A dollar character (<literal>$</literal>) is an assertion which is
1094
-
&true; only if the current matching point is at the end of the subject
1095
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
1096
1097
character in the string (by default). Dollar (<literal>$</literal>)
1097
-
need not be the last character of the pattern if a number of
1098
-
alternatives are involved, but it should be the last item in any branch
1099
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
1100
1101
character class.
1101
1102
</para>
1102
1103
<para>
...
...
@@ -1122,9 +1123,9 @@
1122
1123
set.
1123
1124
</para>
1124
1125
<para>
1125
-
Note that the sequences \A, \Z, and \z can be used to match
1126
-
the start and end of the subject in both modes, and if all
1127
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
1128
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1129
1130
is set or not.
1130
1131
</para>
...
...
@@ -1133,14 +1134,14 @@
1133
1134
<section xml:id="regexp.reference.dot">
1134
1135
<title>Dot</title>
1135
1136
<para>
1136
-
Outside a character class, a dot in the pattern matches any
1137
-
one character in the subject, including a non-printing
1138
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
1139
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1140
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
1141
1142
handling of dot is entirely independent of the handling of
1142
-
circumflex and dollar, the only relationship being that they
1143
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
1144
1145
in a character class.
1145
1146
</para>
1146
1147
<para>
...
...
@@ -1154,29 +1155,29 @@
1154
1155
<title>Character classes</title>
1155
1156
<para>
1156
1157
An opening square bracket introduces a character class,
1157
-
terminated by a closing square bracket. A closing square
1158
-
bracket on its own is not special. If a closing square
1159
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
1160
1161
the first data character in the class (after an initial
1161
1162
circumflex, if present) or escaped with a backslash.
1162
1163
</para>
1163
1164
<para>
1164
1165
A character class matches a single character in the subject;
1165
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
1166
1167
the class, unless the first character in the class is a
1167
-
circumflex, in which case the subject character must not be in
1168
-
the set defined by the class. If a circumflex is actually
1169
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
1170
1171
first character, or escape it with a backslash.
1171
1172
</para>
1172
1173
<para>
1173
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
1174
1175
case vowel, while [^aeiou] matches any character that is not
1175
-
a lower case vowel. Note that a circumflex is just a
1176
-
convenient notation for specifying the characters which are in
1177
-
the class by enumerating those that are not. It is not an
1178
-
assertion: it still consumes a character from the subject
1179
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
1180
1181
the string.
1181
1182
</para>
1182
1183
<para>
...
...
@@ -1188,61 +1189,62 @@
1188
1189
</para>
1189
1190
<para>
1190
1191
The newline character is never treated in any special way in
1191
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
1192
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1193
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1194
1195
options is. A class such as [^a] will always match a newline.
1195
1196
</para>
1196
1197
<para>
1197
-
The minus (hyphen) character can be used to specify a range
1198
-
of characters in a character class. For example, [d-m]
1199
-
matches any letter between d and m, inclusive. If a minus
1200
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1201
1202
backslash or appear in a position where it cannot be
1202
1203
interpreted as indicating a range, typically as the first or last
1203
1204
character in the class.
1204
1205
</para>
1205
1206
<para>
1206
-
It is not possible to have the literal character "]" as the
1207
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1208
1209
interpreted as a class of two characters ("W" and "-")
1209
1210
followed by a literal string "46]", so it would match "W46]" or
1210
-
"-46]". However, if the "]" is escaped with a backslash it
1211
-
is interpreted as the end of range, so [W-\]46] is
1212
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1213
1214
separate characters. The octal or hexadecimal representation
1214
1215
of "]" can also be used to end a range.
1215
1216
</para>
1216
1217
<para>
1217
1218
Ranges operate in ASCII collating sequence. They can also be
1218
-
used for characters specified numerically, for example
1219
-
[\000-\037]. If a range that includes letters is used when
1220
-
case-insensitive (caseless) matching is set, it matches the
1221
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1222
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1223
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1224
1225
accented E characters in both cases.
1225
1226
</para>
1226
1227
<para>
1227
-
The character types \d, \D, \s, \S, \w, and \W may also
1228
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1229
1230
they match to the class. For example, [\dABCDEF] matches any
1230
-
hexadecimal digit. A circumflex can conveniently be used
1231
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1232
1233
restricted set of characters than the matching lower case type.
1233
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1234
1235
but not underscore.
1235
1236
</para>
1236
1237
<para>
1237
-
All non-alphanumeric characters other than \, -, ^ (at the
1238
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1239
1240
classes, but it does no harm if they are escaped. The pattern
1240
1241
terminator is always special and must be escaped when used
1241
1242
within an expression.
1242
1243
</para>
1243
1244
<para>
1244
1245
Perl supports the POSIX notation for character classes. This uses names
1245
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1246
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1247
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1248
1250
names are:
...
...
@@ -1297,16 +1299,16 @@
1297
1299
<section xml:id="regexp.reference.alternation">
1298
1300
<title>Alternation</title>
1299
1301
<para>
1300
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1301
1303
patterns. For example, the pattern
1302
1304
<literal>gilbert|sullivan</literal>
1303
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1304
-
may appear, and an empty alternative is permitted
1305
-
(matching the empty string). The matching process tries
1306
-
each alternative in turn, from left to right, and the first
1307
-
one that succeeds is used. If the alternatives are within a
1308
-
subpattern (defined below), "succeeds" means matching the
1309
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1310
1312
subpattern.
1311
1313
</para>
1312
1314
</section>
...
...
@@ -1321,7 +1323,7 @@
1321
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1322
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1323
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1324
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1325
1327
")". The option letters are:
1326
1328

1327
1329
<table>
...
...
@@ -1350,7 +1352,8 @@
1350
1352
</row>
1351
1353
<row>
1352
1354
<entry><literal>X</literal></entry>
1353
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1354
1357
</row>
1355
1358
<row>
1356
1359
<entry><literal>J</literal></entry>
...
...
@@ -1361,16 +1364,16 @@
1361
1364
</table>
1362
1365
</para>
1363
1366
<para>
1364
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1365
1368
also possible to unset these options by preceding the letter
1366
-
with a hyphen, and a combined setting and unsetting such as
1367
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1368
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1369
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1370
1373
while unsetting <link
1371
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1372
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1373
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1374
1377
hyphen, the option is unset.
1375
1378
</para>
1376
1379
<para>
...
...
@@ -1380,14 +1383,14 @@
1380
1383
and "abC".
1381
1384
</para>
1382
1385
<para>
1383
-
If an option change occurs inside a subpattern, the effect
1384
-
is different. This is a change of behaviour in Perl 5.005.
1385
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1386
1389
of the subpattern that follows it, so
1387
1390

1388
1391
<literal>(a(?i)b)c</literal>
1389
1392

1390
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1391
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1392
1395
used). By this means, options can be made to have different settings in
1393
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1396,18 +1399,18 @@
1396
1399

1397
1400
<literal>(a(?i)b|c)</literal>
1398
1401

1399
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1400
1403
"C" the first branch is abandoned before the option setting.
1401
-
This is because the effects of option settings happen at
1402
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1403
1406
</para>
1404
1407
<para>
1405
1408
The PCRE-specific options <link
1406
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1407
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1408
1411
be changed in the same way as the Perl-compatible options by
1409
-
using the characters U and X respectively. The (?X) flag
1410
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1411
1414
the pattern than any of the additional features it turns on,
1412
1415
even when it is at top level. It is best put at the start.
1413
1416
</para>
...
...
@@ -1416,8 +1419,8 @@
1416
1419
<section xml:id="regexp.reference.subpatterns">
1417
1420
<title>Subpatterns</title>
1418
1421
<para>
1419
-
Subpatterns are delimited by parentheses (round brackets),
1420
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1421
1424
does two things:
1422
1425
</para>
1423
1426
<orderedlist>
...
...
@@ -1446,30 +1449,30 @@
1446
1449

1447
1450
<literal>the ((red|white) (king|queen))</literal>
1448
1451

1449
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1450
1453
and are numbered 1, 2, and 3.
1451
1454
</para>
1452
1455
<para>
1453
-
The fact that plain parentheses fulfill two functions is not
1454
-
always helpful. There are often times when a grouping subpattern
1455
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1456
1459
opening parenthesis is followed by "?:", the subpattern does
1457
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1458
1461
number of any subsequent capturing subpatterns. For example,
1459
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1460
1463
pattern
1461
1464

1462
1465
<literal>the ((?:red|white) (king|queen))</literal>
1463
1466

1464
-
the captured substrings are "white queen" and "queen", and
1465
-
are numbered 1 and 2. The maximum number of captured substrings
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1466
1469
is 65535. It may not be possible to compile such large patterns,
1467
1470
however, depending on the configuration options of libpcre.
1468
1471
</para>
1469
1472
<para>
1470
-
As a convenient shorthand, if any option settings are
1471
-
required at the start of a non-capturing subpattern, the
1472
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1473
1476
the two patterns
1474
1477
</para>
1475
1478

...
...
@@ -1483,10 +1486,10 @@
1483
1486
</informalexample>
1484
1487

1485
1488
<para>
1486
-
match exactly the same set of strings. Because alternative
1487
-
branches are tried from left to right, and options are not
1488
-
reset until the end of the subpattern is reached, an option
1489
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1490
1493
the above patterns match "SUNDAY" as well as "Saturday".
1491
1494
</para>
1492
1495

...
...
@@ -1515,9 +1518,10 @@
1515
1518

1516
1519
<para>
1517
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1518
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1519
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1520
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1521
1525
</para>
1522
1526

1523
1527
<informalexample>
...
...
@@ -1543,45 +1547,45 @@
1543
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1544
1548
<listitem><simpara>a character class</simpara></listitem>
1545
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1546
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1547
1551
see below)</simpara></listitem>
1548
1552
</itemizedlist>
1549
1553
</para>
1550
1554
<para>
1551
-
The general repetition quantifier specifies a minimum and
1552
-
maximum number of permitted matches, by giving the two
1553
-
numbers in curly brackets (braces), separated by a comma.
1554
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1555
1559
less than or equal to the second. For example:
1556
1560

1557
1561
<literal>z{2,4}</literal>
1558
1562

1559
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1560
1564
is not a special character. If the second number is omitted,
1561
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1562
1566
second number and the comma are both omitted, the quantifier
1563
1567
specifies an exact number of required matches. Thus
1564
1568

1565
1569
<literal>[aeiou]{3,}</literal>
1566
1570

1567
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1568
1572
more, while
1569
1573

1570
1574
<literal>\d{8}</literal>
1571
1575

1572
-
matches exactly 8 digits. An opening curly bracket that
1573
-
appears in a position where a quantifier is not allowed, or
1576
+
matches exactly 8 digits. An opening curly bracket that
1577
+
appears in a position where a quantifier is not allowed, or
1574
1578
one that does not match the syntax of a quantifier, is taken
1575
-
as a literal character. For example, {,6} is not a quantifier,
1579
+
as a literal character. For example, {,6} is not a quantifier,
1576
1580
but a literal string of four characters.
1577
1581
</para>
1578
1582
<para>
1579
-
The quantifier {0} is permitted, causing the expression to
1580
-
behave as if the previous item and the quantifier were not
1583
+
The quantifier {0} is permitted, causing the expression to
1584
+
behave as if the previous item and the quantifier were not
1581
1585
present.
1582
1586
</para>
1583
1587
<para>
1584
-
For convenience (and historical compatibility) the three
1588
+
For convenience (and historical compatibility) the three
1585
1589
most common quantifiers have single-character abbreviations:
1586
1590

1587
1591
<table>
...
...
@@ -1605,63 +1609,63 @@
1605
1609
</table>
1606
1610
</para>
1607
1611
<para>
1608
-
It is possible to construct infinite loops by following a
1609
-
subpattern that can match no characters with a quantifier
1612
+
It is possible to construct infinite loops by following a
1613
+
subpattern that can match no characters with a quantifier
1610
1614
that has no upper limit, for example:
1611
1615

1612
1616
<literal>(a?)*</literal>
1613
1617
</para>
1614
1618
<para>
1615
-
Earlier versions of Perl and PCRE used to give an error at
1616
-
compile time for such patterns. However, because there are
1617
-
cases where this can be useful, such patterns are now
1618
-
accepted, but if any repetition of the subpattern does in
1619
+
Earlier versions of Perl and PCRE used to give an error at
1620
+
compile time for such patterns. However, because there are
1621
+
cases where this can be useful, such patterns are now
1622
+
accepted, but if any repetition of the subpattern does in
1619
1623
fact match no characters, the loop is forcibly broken.
1620
1624
</para>
1621
1625
<para>
1622
-
By default, the quantifiers are "greedy", that is, they
1623
-
match as much as possible (up to the maximum number of permitted
1624
-
times), without causing the rest of the pattern to
1626
+
By default, the quantifiers are "greedy", that is, they
1627
+
match as much as possible (up to the maximum number of permitted
1628
+
times), without causing the rest of the pattern to
1625
1629
fail. The classic example of where this gives problems is in
1626
1630
trying to match comments in C programs. These appear between
1627
-
the sequences /* and */ and within the sequence, individual
1628
-
* and / characters may appear. An attempt to match C comments
1631
+
the sequences /* and */ and within the sequence, individual
1632
+
* and / characters may appear. An attempt to match C comments
1629
1633
by applying the pattern
1630
1634

1631
1635
<literal>/\*.*\*/</literal>
1632
1636

1633
1637
to the string
1634
1638

1635
-
<literal>/* first comment */ not comment /* second comment */</literal>
1639
+
<literal>/* first comment */ not comment /* second comment */</literal>
1636
1640

1637
-
fails, because it matches the entire string due to the
1638
-
greediness of the .* item.
1641
+
fails, because it matches the entire string due to the
1642
+
greediness of the .* item.
1639
1643
</para>
1640
1644
<para>
1641
-
However, if a quantifier is followed by a question mark,
1645
+
However, if a quantifier is followed by a question mark,
1642
1646
then it becomes lazy, and instead matches the minimum
1643
1647
number of times possible, so the pattern
1644
1648

1645
1649
<literal>/\*.*?\*/</literal>
1646
1650

1647
1651
does the right thing with the C comments. The meaning of the
1648
-
various quantifiers is not otherwise changed, just the preferred
1649
-
number of matches. Do not confuse this use of
1650
-
question mark with its use as a quantifier in its own right.
1652
+
various quantifiers is not otherwise changed, just the preferred
1653
+
number of matches. Do not confuse this use of
1654
+
question mark with its use as a quantifier in its own right.
1651
1655
Because it has two uses, it can sometimes appear doubled, as
1652
1656
in
1653
1657

1654
1658
<literal>\d??\d</literal>
1655
1659

1656
-
which matches one digit by preference, but can match two if
1660
+
which matches one digit by preference, but can match two if
1657
1661
that is the only way the rest of the pattern matches.
1658
1662
</para>
1659
1663
<para>
1660
1664
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1661
-
option is set (an option which is not
1662
-
available in Perl) then the quantifiers are not greedy by
1665
+
option is set (an option which is not
1666
+
available in Perl) then the quantifiers are not greedy by
1663
1667
default, but individual ones can be made greedy by following
1664
-
them with a question mark. In other words, it inverts the
1668
+
them with a question mark. In other words, it inverts the
1665
1669
default behaviour.
1666
1670
</para>
1667
1671
<para>
...
...
@@ -1673,41 +1677,41 @@
1673
1677
</para>
1674
1678
<para>
1675
1679
When a parenthesized subpattern is quantified with a minimum
1676
-
repeat count that is greater than 1 or with a limited maximum,
1677
-
more store is required for the compiled pattern, in
1680
+
repeat count that is greater than 1 or with a limited maximum,
1681
+
more store is required for the compiled pattern, in
1678
1682
proportion to the size of the minimum or maximum.
1679
1683
</para>
1680
1684
<para>
1681
-
If a pattern starts with .* or .{0,} and the <link
1685
+
If a pattern starts with .* or .{0,} and the <link
1682
1686
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1683
1687
option (equivalent to Perl's /s) is set, thus allowing the .
1684
-
to match newlines, then the pattern is implicitly anchored,
1688
+
to match newlines, then the pattern is implicitly anchored,
1685
1689
because whatever follows will be tried against every character
1686
-
position in the subject string, so there is no point in
1687
-
retrying the overall match at any position after the first.
1690
+
position in the subject string, so there is no point in
1691
+
retrying the overall match at any position after the first.
1688
1692
PCRE treats such a pattern as though it were preceded by \A.
1689
-
In cases where it is known that the subject string contains
1693
+
In cases where it is known that the subject string contains
1690
1694
no newlines, it is worth setting <link
1691
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1695
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1692
1696
pattern begins with .* in order to
1693
1697
obtain this optimization, or
1694
1698
alternatively using ^ to indicate anchoring explicitly.
1695
1699
</para>
1696
1700
<para>
1697
-
When a capturing subpattern is repeated, the value captured
1701
+
When a capturing subpattern is repeated, the value captured
1698
1702
is the substring that matched the final iteration. For example, after
1699
1703

1700
1704
<literal>(tweedle[dume]{3}\s*)+</literal>
1701
1705

1702
-
has matched "tweedledum tweedledee" the value of the captured
1703
-
substring is "tweedledee". However, if there are
1704
-
nested capturing subpatterns, the corresponding captured
1705
-
values may have been set in previous iterations. For example,
1706
+
has matched "tweedledum tweedledee" the value of the captured
1707
+
substring is "tweedledee". However, if there are
1708
+
nested capturing subpatterns, the corresponding captured
1709
+
values may have been set in previous iterations. For example,
1706
1710
after
1707
1711

1708
1712
<literal>/(a|(b))+/</literal>
1709
1713

1710
-
matches "aba" the value of the second captured substring is
1714
+
matches "aba" the value of the second captured substring is
1711
1715
"b".
1712
1716
</para>
1713
1717
</section>
...
...
@@ -1715,74 +1719,74 @@
1715
1719
<section xml:id="regexp.reference.back-references">
1716
1720
<title>Back references</title>
1717
1721
<para>
1718
-
Outside a character class, a backslash followed by a digit
1719
-
greater than 0 (and possibly further digits) is a back
1720
-
reference to a capturing subpattern earlier (i.e. to its
1721
-
left) in the pattern, provided there have been that many
1722
+
Outside a character class, a backslash followed by a digit
1723
+
greater than 0 (and possibly further digits) is a back
1724
+
reference to a capturing subpattern earlier (i.e. to its
1725
+
left) in the pattern, provided there have been that many
1722
1726
previous capturing left parentheses.
1723
1727
</para>
1724
1728
<para>
1725
-
However, if the decimal number following the backslash is
1726
-
less than 10, it is always taken as a back reference, and
1727
-
causes an error only if there are not that many capturing
1728
-
left parentheses in the entire pattern. In other words, the
1729
-
parentheses that are referenced need not be to the left of
1730
-
the reference for numbers less than 10.
1729
+
However, if the decimal number following the backslash is
1730
+
less than 10, it is always taken as a back reference, and
1731
+
causes an error only if there are not that many capturing
1732
+
left parentheses in the entire pattern. In other words, the
1733
+
parentheses that are referenced need not be to the left of
1734
+
the reference for numbers less than 10.
1731
1735
A "forward back reference" can make sense when a repetition
1732
1736
is involved and the subpattern to the right has participated
1733
1737
in an earlier iteration. See the section
1734
-
entitled "Backslash" above for further details of the handling
1738
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1735
1739
of digits following a backslash.
1736
1740
</para>
1737
1741
<para>
1738
-
A back reference matches whatever actually matched the capturing
1742
+
A back reference matches whatever actually matched the capturing
1739
1743
subpattern in the current subject string, rather than
1740
1744
anything matching the subpattern itself. So the pattern
1741
1745

1742
1746
<literal>(sens|respons)e and \1ibility</literal>
1743
1747

1744
-
matches "sense and sensibility" and "response and responsibility",
1745
-
but not "sense and responsibility". If case-sensitive (caseful)
1748
+
matches "sense and sensibility" and "response and responsibility",
1749
+
but not "sense and responsibility". If case-sensitive (caseful)
1746
1750
matching is in force at the time of the back reference, then
1747
1751
the case of letters is relevant. For example,
1748
1752

1749
1753
<literal>((?i)rah)\s+\1</literal>
1750
1754

1751
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1752
-
though the original capturing subpattern is matched
1755
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1756
+
though the original capturing subpattern is matched
1753
1757
case-insensitively (caselessly).
1754
1758
</para>
1755
1759
<para>
1756
-
There may be more than one back reference to the same subpattern.
1757
-
If a subpattern has not actually been used in a
1758
-
particular match, then any back references to it always
1760
+
There may be more than one back reference to the same subpattern.
1761
+
If a subpattern has not actually been used in a
1762
+
particular match, then any back references to it always
1759
1763
fail. For example, the pattern
1760
1764

1761
1765
<literal>(a|(bc))\2</literal>
1762
1766

1763
-
always fails if it starts to match "a" rather than "bc".
1764
-
Because there may be up to 99 back references, all digits
1765
-
following the backslash are taken as part of a potential
1767
+
always fails if it starts to match "a" rather than "bc".
1768
+
Because there may be up to 99 back references, all digits
1769
+
following the backslash are taken as part of a potential
1766
1770
back reference number. If the pattern continues with a digit
1767
1771
character, then some delimiter must be used to terminate the
1768
1772
back reference. If the <link
1769
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1770
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1773
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1774
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1771
1775
</para>
1772
1776
<para>
1773
1777
A back reference that occurs inside the parentheses to which
1774
-
it refers fails when the subpattern is first used, so, for
1775
-
example, (a\1) never matches. However, such references can
1778
+
it refers fails when the subpattern is first used, so, for
1779
+
example, (a\1) never matches. However, such references can
1776
1780
be useful inside repeated subpatterns. For example, the pattern
1777
1781

1778
1782
<literal>(a|b\1)+</literal>
1779
1783

1780
-
matches any number of "a"s and also "aba", "ababba" etc. At
1784
+
matches any number of "a"s and also "aba", "ababba" etc. At
1781
1785
each iteration of the subpattern, the back reference matches
1782
-
the character string corresponding to the previous iteration.
1786
+
the character string corresponding to the previous iteration.
1783
1787
In order for this to work, the pattern must be such
1784
-
that the first iteration does not need to match the back
1785
-
reference. This can be done using alternation, as in the
1788
+
that the first iteration does not need to match the back
1789
+
reference. This can be done using alternation, as in the
1786
1790
example above, or by a quantifier with a minimum of zero.
1787
1791
</para>
1788
1792
<para>
...
...
@@ -1817,18 +1821,18 @@
1817
1821
<section xml:id="regexp.reference.assertions">
1818
1822
<title>Assertions</title>
1819
1823
<para>
1820
-
An assertion is a test on the characters following or
1821
-
preceding the current matching point that does not actually
1822
-
consume any characters. The simple assertions coded as \b,
1823
-
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1824
-
assertions are coded as subpatterns. There are two
1825
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1824
+
An assertion is a test on the characters following or
1825
+
preceding the current matching point that does not actually
1826
+
consume any characters. The simple assertions coded as \b,
1827
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1828
+
assertions are coded as subpatterns. There are two
1829
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1826
1830
subject string, and those that <emphasis>look behind</emphasis> it.
1827
1831
</para>
1828
1832
<para>
1829
1833
An assertion subpattern is matched in the normal way, except
1830
-
that it does not cause the current matching position to be
1831
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1834
+
that it does not cause the current matching position to be
1835
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1832
1836
assertions and (?! for negative assertions. For example,
1833
1837

1834
1838
<literal>\w+(?=;)</literal>
...
...
@@ -1838,27 +1842,27 @@
1838
1842

1839
1843
<literal>foo(?!bar)</literal>
1840
1844

1841
-
matches any occurrence of "foo" that is not followed by
1845
+
matches any occurrence of "foo" that is not followed by
1842
1846
"bar". Note that the apparently similar pattern
1843
1847

1844
1848
<literal>(?!foo)bar</literal>
1845
1849

1846
-
does not find an occurrence of "bar" that is preceded by
1850
+
does not find an occurrence of "bar" that is preceded by
1847
1851
something other than "foo"; it finds any occurrence of "bar"
1848
-
whatsoever, because the assertion (?!foo) is always &true;
1849
-
when the next three characters are "bar". A lookbehind
1852
+
whatsoever, because the assertion (?!foo) is always &true;
1853
+
when the next three characters are "bar". A lookbehind
1850
1854
assertion is needed to achieve this effect.
1851
1855
</para>
1852
1856
<para>
1853
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1857
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1854
1858
and (?&lt;! for negative assertions. For example,
1855
1859

1856
1860
<literal>(?&lt;!foo)bar</literal>
1857
1861

1858
-
does find an occurrence of "bar" that is not preceded by
1862
+
does find an occurrence of "bar" that is not preceded by
1859
1863
"foo". The contents of a lookbehind assertion are restricted
1860
-
such that all the strings it matches must have a fixed
1861
-
length. However, if there are several alternatives, they do
1864
+
such that all the strings it matches must have a fixed
1865
+
length. However, if there are several alternatives, they do
1862
1866
not all have to have the same fixed length. Thus
1863
1867

1864
1868
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1867,51 +1871,51 @@
1867
1871

1868
1872
<literal>(?&lt;!dogs?|cats?)</literal>
1869
1873

1870
-
causes an error at compile time. Branches that match different
1874
+
causes an error at compile time. Branches that match different
1871
1875
length strings are permitted only at the top level of
1872
-
a lookbehind assertion. This is an extension compared with
1873
-
Perl 5.005, which requires all branches to match the same
1876
+
a lookbehind assertion. This is an extension compared with
1877
+
Perl 5.005, which requires all branches to match the same
1874
1878
length of string. An assertion such as
1875
1879

1876
1880
<literal>(?&lt;=ab(c|de))</literal>
1877
1881

1878
-
is not permitted, because its single top-level branch can
1882
+
is not permitted, because its single top-level branch can
1879
1883
match two different lengths, but it is acceptable if rewritten
1880
1884
to use two top-level branches:
1881
1885

1882
1886
<literal>(?&lt;=abc|abde)</literal>
1883
1887

1884
-
The implementation of lookbehind assertions is, for each
1885
-
alternative, to temporarily move the current position back
1886
-
by the fixed width and then try to match. If there are
1887
-
insufficient characters before the current position, the
1888
-
match is deemed to fail. Lookbehinds in conjunction with
1889
-
once-only subpatterns can be particularly useful for matching
1890
-
at the ends of strings; an example is given at the end
1888
+
The implementation of lookbehind assertions is, for each
1889
+
alternative, to temporarily move the current position back
1890
+
by the fixed width and then try to match. If there are
1891
+
insufficient characters before the current position, the
1892
+
match is deemed to fail. Lookbehinds in conjunction with
1893
+
once-only subpatterns can be particularly useful for matching
1894
+
at the ends of strings; an example is given at the end
1891
1895
of the section on once-only subpatterns.
1892
1896
</para>
1893
1897
<para>
1894
-
Several assertions (of any sort) may occur in succession.
1898
+
Several assertions (of any sort) may occur in succession.
1895
1899
For example,
1896
1900

1897
1901
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1898
1902

1899
-
matches "foo" preceded by three digits that are not "999".
1900
-
Notice that each of the assertions is applied independently
1901
-
at the same point in the subject string. First there is a
1902
-
check that the previous three characters are all digits,
1903
+
matches "foo" preceded by three digits that are not "999".
1904
+
Notice that each of the assertions is applied independently
1905
+
at the same point in the subject string. First there is a
1906
+
check that the previous three characters are all digits,
1903
1907
then there is a check that the same three characters are not
1904
-
"999". This pattern does not match "foo" preceded by six
1908
+
"999". This pattern does not match "foo" preceded by six
1905
1909
characters, the first of which are digits and the last three
1906
-
of which are not "999". For example, it doesn't match
1910
+
of which are not "999". For example, it doesn't match
1907
1911
"123abcfoo". A pattern to do that is
1908
1912

1909
1913
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1910
1914
</para>
1911
1915
<para>
1912
-
This time the first assertion looks at the preceding six
1913
-
characters, checking that the first three are digits, and
1914
-
then the second assertion checks that the preceding three
1916
+
This time the first assertion looks at the preceding six
1917
+
characters, checking that the first three are digits, and
1918
+
then the second assertion checks that the preceding three
1915
1919
characters are not "999".
1916
1920
</para>
1917
1921
<para>
...
...
@@ -1919,26 +1923,26 @@
1919
1923

1920
1924
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1921
1925

1922
-
matches an occurrence of "baz" that is preceded by "bar"
1926
+
matches an occurrence of "baz" that is preceded by "bar"
1923
1927
which in turn is not preceded by "foo", while
1924
1928

1925
1929
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1926
1930

1927
-
is another pattern which matches "foo" preceded by three
1931
+
is another pattern which matches "foo" preceded by three
1928
1932
digits and any three characters that are not "999".
1929
1933
</para>
1930
1934
<para>
1931
1935
Assertion subpatterns are not capturing subpatterns, and may
1932
-
not be repeated, because it makes no sense to assert the
1933
-
same thing several times. If any kind of assertion contains
1934
-
capturing subpatterns within it, these are counted for the
1936
+
not be repeated, because it makes no sense to assert the
1937
+
same thing several times. If any kind of assertion contains
1938
+
capturing subpatterns within it, these are counted for the
1935
1939
purposes of numbering the capturing subpatterns in the whole
1936
-
pattern. However, substring capturing is carried out only
1937
-
for positive assertions, because it does not make sense for
1940
+
pattern. However, substring capturing is carried out only
1941
+
for positive assertions, because it does not make sense for
1938
1942
negative assertions.
1939
1943
</para>
1940
1944
<para>
1941
-
Assertions count towards the maximum of 200 parenthesized
1945
+
Assertions count towards the maximum of 200 parenthesized
1942
1946
subpatterns.
1943
1947
</para>
1944
1948
</section>
...
...
@@ -1946,17 +1950,17 @@
1946
1950
<section xml:id="regexp.reference.onlyonce">
1947
1951
<title>Once-only subpatterns</title>
1948
1952
<para>
1949
-
With both maximizing and minimizing repetition, failure of
1950
-
what follows normally causes the repeated item to be
1953
+
With both maximizing and minimizing repetition, failure of
1954
+
what follows normally causes the repeated item to be
1951
1955
re-evaluated to see if a different number of repeats allows the
1952
-
rest of the pattern to match. Sometimes it is useful to
1953
-
prevent this, either to change the nature of the match, or
1954
-
to cause it fail earlier than it otherwise might, when the
1955
-
author of the pattern knows there is no point in carrying
1956
+
rest of the pattern to match. Sometimes it is useful to
1957
+
prevent this, either to change the nature of the match, or
1958
+
to cause it fail earlier than it otherwise might, when the
1959
+
author of the pattern knows there is no point in carrying
1956
1960
on.
1957
1961
</para>
1958
1962
<para>
1959
-
Consider, for example, the pattern \d+foo when applied to
1963
+
Consider, for example, the pattern \d+foo when applied to
1960
1964
the subject line
1961
1965

1962
1966
<literal>123456bar</literal>
...
...
@@ -1964,108 +1968,108 @@
1964
1968
<para>
1965
1969
After matching all 6 digits and then failing to match "foo",
1966
1970
the normal action of the matcher is to try again with only 5
1967
-
digits matching the \d+ item, and then with 4, and so on,
1971
+
digits matching the \d+ item, and then with 4, and so on,
1968
1972
before ultimately failing. Once-only subpatterns provide the
1969
-
means for specifying that once a portion of the pattern has
1970
-
matched, it is not to be re-evaluated in this way, so the
1971
-
matcher would give up immediately on failing to match "foo"
1972
-
the first time. The notation is another kind of special
1973
+
means for specifying that once a portion of the pattern has
1974
+
matched, it is not to be re-evaluated in this way, so the
1975
+
matcher would give up immediately on failing to match "foo"
1976
+
the first time. The notation is another kind of special
1973
1977
parenthesis, starting with (?&gt; as in this example:
1974
1978

1975
1979
<literal>(?&gt;\d+)bar</literal>
1976
1980
</para>
1977
1981
<para>
1978
-
This kind of parenthesis "locks up" the part of the pattern
1979
-
it contains once it has matched, and a failure further into
1980
-
the pattern is prevented from backtracking into it.
1981
-
Backtracking past it to previous items, however, works as normal.
1982
+
This kind of parenthesis "locks up" the part of the pattern
1983
+
it contains once it has matched, and a failure further into
1984
+
the pattern is prevented from backtracking into it.
1985
+
Backtracking past it to previous items, however, works as normal.
1982
1986
</para>
1983
1987
<para>
1984
1988
An alternative description is that a subpattern of this type
1985
-
matches the string of characters that an identical standalone
1989
+
matches the string of characters that an identical standalone
1986
1990
pattern would match, if anchored at the current point
1987
1991
in the subject string.
1988
1992
</para>
1989
1993
<para>
1990
-
Once-only subpatterns are not capturing subpatterns. Simple
1991
-
cases such as the above example can be thought of as a maximizing
1992
-
repeat that must swallow everything it can. So,
1994
+
Once-only subpatterns are not capturing subpatterns. Simple
1995
+
cases such as the above example can be thought of as a maximizing
1996
+
repeat that must swallow everything it can. So,
1993
1997
while both \d+ and \d+? are prepared to adjust the number of
1994
-
digits they match in order to make the rest of the pattern
1998
+
digits they match in order to make the rest of the pattern
1995
1999
match, (?&gt;\d+) can only match an entire sequence of digits.
1996
2000
</para>
1997
2001
<para>
1998
-
This construction can of course contain arbitrarily complicated
2002
+
This construction can of course contain arbitrarily complicated
1999
2003
subpatterns, and it can be nested.
2000
2004
</para>
2001
2005
<para>
2002
2006
Once-only subpatterns can be used in conjunction with
2003
-
lookbehind assertions to specify efficient matching at the end
2007
+
lookbehind assertions to specify efficient matching at the end
2004
2008
of the subject string. Consider a simple pattern such as
2005
2009

2006
2010
<literal>abcd$</literal>
2007
2011

2008
-
when applied to a long string which does not match. Because
2009
-
matching proceeds from left to right, PCRE will look for
2012
+
when applied to a long string which does not match. Because
2013
+
matching proceeds from left to right, PCRE will look for
2010
2014
each "a" in the subject and then see if what follows matches
2011
2015
the rest of the pattern. If the pattern is specified as
2012
2016

2013
2017
<literal>^.*abcd$</literal>
2014
2018

2015
-
then the initial .* matches the entire string at first, but
2016
-
when this fails (because there is no following "a"), it
2019
+
then the initial .* matches the entire string at first, but
2020
+
when this fails (because there is no following "a"), it
2017
2021
backtracks to match all but the last character, then all but
2018
-
the last two characters, and so on. Once again the search
2019
-
for "a" covers the entire string, from right to left, so we
2022
+
the last two characters, and so on. Once again the search
2023
+
for "a" covers the entire string, from right to left, so we
2020
2024
are no better off. However, if the pattern is written as
2021
2025

2022
2026
<literal>^(?>.*)(?&lt;=abcd)</literal>
2023
2027

2024
-
then there can be no backtracking for the .* item; it can
2025
-
match only the entire string. The subsequent lookbehind
2028
+
then there can be no backtracking for the .* item; it can
2029
+
match only the entire string. The subsequent lookbehind
2026
2030
assertion does a single test on the last four characters. If
2027
-
it fails, the match fails immediately. For long strings,
2031
+
it fails, the match fails immediately. For long strings,
2028
2032
this approach makes a significant difference to the processing time.
2029
2033
</para>
2030
2034
<para>
2031
2035
When a pattern contains an unlimited repeat inside a subpattern
2032
2036
that can itself be repeated an unlimited number of
2033
-
times, the use of a once-only subpattern is the only way to
2034
-
avoid some failing matches taking a very long time indeed.
2037
+
times, the use of a once-only subpattern is the only way to
2038
+
avoid some failing matches taking a very long time indeed.
2035
2039
The pattern
2036
2040

2037
2041
<literal>(\D+|&lt;\d+>)*[!?]</literal>
2038
2042

2039
-
matches an unlimited number of substrings that either consist
2040
-
of non-digits, or digits enclosed in &lt;>, followed by
2043
+
matches an unlimited number of substrings that either consist
2044
+
of non-digits, or digits enclosed in &lt;>, followed by
2041
2045
either ! or ?. When it matches, it runs quickly. However, if
2042
2046
it is applied to
2043
2047

2044
2048
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
2045
2049

2046
-
it takes a long time before reporting failure. This is
2050
+
it takes a long time before reporting failure. This is
2047
2051
because the string can be divided between the two repeats in
2048
2052
a large number of ways, and all have to be tried. (The example
2049
-
used [!?] rather than a single character at the end,
2050
-
because both PCRE and Perl have an optimization that allows
2051
-
for fast failure when a single character is used. They
2052
-
remember the last single character that is required for a
2053
-
match, and fail early if it is not present in the string.)
2053
+
used [!?] rather than a single character at the end,
2054
+
because both PCRE and Perl have an optimization that allows
2055
+
for fast failure when a single character is used. They
2056
+
remember the last single character that is required for a
2057
+
match, and fail early if it is not present in the string.)
2054
2058
If the pattern is changed to
2055
2059

2056
2060
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
2057
2061

2058
-
sequences of non-digits cannot be broken, and failure happens quickly.
2062
+
sequences of non-digits cannot be broken, and failure happens quickly.
2059
2063
</para>
2060
2064
</section>
2061
2065

2062
2066
<section xml:id="regexp.reference.conditional">
2063
2067
<title>Conditional subpatterns</title>
2064
2068
<para>
2065
-
It is possible to cause the matching process to obey a subpattern
2066
-
conditionally or to choose between two alternative
2067
-
subpatterns, depending on the result of an assertion, or
2068
-
whether a previous capturing subpattern matched or not. The
2069
+
It is possible to cause the matching process to obey a subpattern
2070
+
conditionally or to choose between two alternative
2071
+
subpatterns, depending on the result of an assertion, or
2072
+
whether a previous capturing subpattern matched or not. The
2069
2073
two possible forms of conditional subpattern are
2070
2074
</para>
2071
2075

...
...
@@ -2079,39 +2083,39 @@
2079
2083
</informalexample>
2080
2084
<para>
2081
2085
If the condition is satisfied, the yes-pattern is used; otherwise
2082
-
the no-pattern (if present) is used. If there are
2086
+
the no-pattern (if present) is used. If there are
2083
2087
more than two alternatives in the subpattern, a compile-time
2084
2088
error occurs.
2085
2089
</para>
2086
2090
<para>
2087
-
There are two kinds of condition. If the text between the
2088
-
parentheses consists of a sequence of digits, then the
2089
-
condition is satisfied if the capturing subpattern of that
2090
-
number has previously matched. Consider the following pattern,
2091
-
which contains non-significant white space to make it
2092
-
more readable (assume the <link
2091
+
There are two kinds of condition. If the text between the
2092
+
parentheses consists of a sequence of digits, then the
2093
+
condition is satisfied if the capturing subpattern of that
2094
+
number has previously matched. Consider the following pattern,
2095
+
which contains non-significant white space to make it
2096
+
more readable (assume the <link
2093
2097
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2094
-
option) and to divide it into three parts for ease of discussion:
2098
+
option) and to divide it into three parts for ease of discussion:
2095
2099
</para>
2096
2100
<informalexample>
2097
2101
<programlisting>
2098
2102
<![CDATA[
2099
-
( \( )? [^()]+ (?(1) \) )
2103
+
( \( )? [^()]+ (?(1) \) )
2100
2104
]]>
2101
2105
</programlisting>
2102
2106
</informalexample>
2103
2107
<para>
2104
-
The first part matches an optional opening parenthesis, and
2105
-
if that character is present, sets it as the first captured
2106
-
substring. The second part matches one or more characters
2107
-
that are not parentheses. The third part is a conditional
2108
-
subpattern that tests whether the first set of parentheses
2109
-
matched or not. If they did, that is, if subject started
2110
-
with an opening parenthesis, the condition is &true;, and so
2111
-
the yes-pattern is executed and a closing parenthesis is
2112
-
required. Otherwise, since no-pattern is not present, the
2113
-
subpattern matches nothing. In other words, this pattern
2114
-
matches a sequence of non-parentheses, optionally enclosed
2108
+
The first part matches an optional opening parenthesis, and
2109
+
if that character is present, sets it as the first captured
2110
+
substring. The second part matches one or more characters
2111
+
that are not parentheses. The third part is a conditional
2112
+
subpattern that tests whether the first set of parentheses
2113
+
matched or not. If they did, that is, if subject started
2114
+
with an opening parenthesis, the condition is &true;, and so
2115
+
the yes-pattern is executed and a closing parenthesis is
2116
+
required. Otherwise, since no-pattern is not present, the
2117
+
subpattern matches nothing. In other words, this pattern
2118
+
matches a sequence of non-parentheses, optionally enclosed
2115
2119
in parentheses.
2116
2120
</para>
2117
2121
<para>
...
...
@@ -2120,10 +2124,10 @@
2120
2124
level", the condition is false.
2121
2125
</para>
2122
2126
<para>
2123
-
If the condition is not a sequence of digits or (R), it must be an
2124
-
assertion. This may be a positive or negative lookahead or
2125
-
lookbehind assertion. Consider this pattern, again containing
2126
-
non-significant white space, and with the two alternatives on
2127
+
If the condition is not a sequence of digits or (R), it must be an
2128
+
assertion. This may be a positive or negative lookahead or
2129
+
lookbehind assertion. Consider this pattern, again containing
2130
+
non-significant white space, and with the two alternatives on
2127
2131
the second line:
2128
2132
</para>
2129
2133

...
...
@@ -2131,18 +2135,18 @@
2131
2135
<programlisting>
2132
2136
<![CDATA[
2133
2137
(?(?=[^a-z]*[a-z])
2134
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2138
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2135
2139
]]>
2136
2140
</programlisting>
2137
2141
</informalexample>
2138
2142
<para>
2139
2143
The condition is a positive lookahead assertion that matches
2140
2144
an optional sequence of non-letters followed by a letter. In
2141
-
other words, it tests for the presence of at least one
2142
-
letter in the subject. If a letter is found, the subject is
2143
-
matched against the first alternative; otherwise it is
2144
-
matched against the second. This pattern matches strings in
2145
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2145
+
other words, it tests for the presence of at least one
2146
+
letter in the subject. If a letter is found, the subject is
2147
+
matched against the first alternative; otherwise it is
2148
+
matched against the second. This pattern matches strings in
2149
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2146
2150
letters and dd are digits.
2147
2151
</para>
2148
2152
</section>
...
...
@@ -2150,14 +2154,14 @@
2150
2154
<section xml:id="regexp.reference.comments">
2151
2155
<title>Comments</title>
2152
2156
<para>
2153
-
The sequence (?# marks the start of a comment which
2154
-
continues up to the next closing parenthesis. Nested
2157
+
The sequence (?# marks the start of a comment which
2158
+
continues up to the next closing parenthesis. Nested
2155
2159
parentheses are not permitted. The characters that make up a
2156
2160
comment play no part in the pattern matching at all.
2157
2161
</para>
2158
2162
<para>
2159
2163
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2160
-
option is set, an unescaped # character outside a character class
2164
+
option is set, an unescaped # character outside a character class
2161
2165
introduces a comment that continues up to the next newline character
2162
2166
in the pattern.
2163
2167
</para>
...
...
@@ -2201,15 +2205,15 @@ int(1)
2201
2205
<section xml:id="regexp.reference.recursive">
2202
2206
<title>Recursive patterns</title>
2203
2207
<para>
2204
-
Consider the problem of matching a string in parentheses,
2205
-
allowing for unlimited nested parentheses. Without the use
2206
-
of recursion, the best that can be done is to use a pattern
2207
-
that matches up to some fixed depth of nesting. It is not
2208
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
2209
-
provided an experimental facility that allows regular
2210
-
expressions to recurse (among other things). The special
2211
-
item (?R) is provided for the specific case of recursion.
2212
-
This PCRE pattern solves the parentheses problem (assume
2208
+
Consider the problem of matching a string in parentheses,
2209
+
allowing for unlimited nested parentheses. Without the use
2210
+
of recursion, the best that can be done is to use a pattern
2211
+
that matches up to some fixed depth of nesting. It is not
2212
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2213
+
provided an experimental facility that allows regular
2214
+
expressions to recurse (among other things). The special
2215
+
item (?R) is provided for the specific case of recursion.
2216
+
This PCRE pattern solves the parentheses problem (assume
2213
2217
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2214
2218
option is set so that white space is
2215
2219
ignored):
...
...
@@ -2218,45 +2222,45 @@ int(1)
2218
2222
</para>
2219
2223
<para>
2220
2224
First it matches an opening parenthesis. Then it matches any
2221
-
number of substrings which can either be a sequence of
2222
-
non-parentheses, or a recursive match of the pattern itself
2225
+
number of substrings which can either be a sequence of
2226
+
non-parentheses, or a recursive match of the pattern itself
2223
2227
(i.e. a correctly parenthesized substring). Finally there is
2224
2228
a closing parenthesis.
2225
2229
</para>
2226
2230
<para>
2227
-
This particular example pattern contains nested unlimited
2231
+
This particular example pattern contains nested unlimited
2228
2232
repeats, and so the use of a once-only subpattern for matching
2229
-
strings of non-parentheses is important when applying
2230
-
the pattern to strings that do not match. For example, when
2233
+
strings of non-parentheses is important when applying
2234
+
the pattern to strings that do not match. For example, when
2231
2235
it is applied to
2232
2236

2233
2237
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
2234
2238

2235
-
it yields "no match" quickly. However, if a once-only subpattern
2236
-
is not used, the match runs for a very long time
2237
-
indeed because there are so many different ways the + and *
2238
-
repeats can carve up the subject, and all have to be tested
2239
+
it yields "no match" quickly. However, if a once-only subpattern
2240
+
is not used, the match runs for a very long time
2241
+
indeed because there are so many different ways the + and *
2242
+
repeats can carve up the subject, and all have to be tested
2239
2243
before failure can be reported.
2240
2244
</para>
2241
2245
<para>
2242
-
The values set for any capturing subpatterns are those from
2246
+
The values set for any capturing subpatterns are those from
2243
2247
the outermost level of the recursion at which the subpattern
2244
2248
value is set. If the pattern above is matched against
2245
2249

2246
2250
<literal>(ab(cd)ef)</literal>
2247
2251

2248
-
the value for the capturing parentheses is "ef", which is
2249
-
the last value taken on at the top level. If additional
2252
+
the value for the capturing parentheses is "ef", which is
2253
+
the last value taken on at the top level. If additional
2250
2254
parentheses are added, giving
2251
2255

2252
2256
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2253
2257
then the string they capture
2254
2258
is "ab(cd)ef", the contents of the top level parentheses. If
2255
-
there are more than 15 capturing parentheses in a pattern,
2256
-
PCRE has to obtain extra memory to store data during a
2257
-
recursion, which it does by using pcre_malloc, freeing it
2258
-
via pcre_free afterwards. If no memory can be obtained, it
2259
-
saves data for the first 15 capturing parentheses only, as
2259
+
there are more than 15 capturing parentheses in a pattern,
2260
+
PCRE has to obtain extra memory to store data during a
2261
+
recursion, which it does by using pcre_malloc, freeing it
2262
+
via pcre_free afterwards. If no memory can be obtained, it
2263
+
saves data for the first 15 capturing parentheses only, as
2260
2264
there is no way to give an out-of-memory error from within a
2261
2265
recursion.
2262
2266
</para>
...
...
@@ -2295,75 +2299,75 @@ int(1)
2295
2299
<title>Performance</title>
2296
2300
<para>
2297
2301
Certain items that may appear in patterns are more efficient
2298
-
than others. It is more efficient to use a character class
2302
+
than others. It is more efficient to use a character class
2299
2303
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2300
-
In general, the simplest construction that provides the
2301
-
required behaviour is usually the most efficient. Jeffrey
2302
-
Friedl's book contains a lot of discussion about optimizing
2304
+
In general, the simplest construction that provides the
2305
+
required behaviour is usually the most efficient. Jeffrey
2306
+
Friedl's book contains a lot of discussion about optimizing
2303
2307
regular expressions for efficient performance.
2304
2308
</para>
2305
2309
<para>
2306
2310
When a pattern begins with .* and the <link
2307
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2308
-
set, the pattern is implicitly anchored by PCRE, since it
2311
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2312
+
set, the pattern is implicitly anchored by PCRE, since it
2309
2313
can match only at the start of a subject string. However, if
2310
2314
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2311
2315
is not set, PCRE cannot make this optimization,
2312
-
because the . metacharacter does not then match a newline,
2316
+
because the . metacharacter does not then match a newline,
2313
2317
and if the subject string contains newlines, the pattern may
2314
-
match from the character immediately following one of them
2318
+
match from the character immediately following one of them
2315
2319
instead of from the very start. For example, the pattern
2316
2320

2317
2321
<literal>(.*) second</literal>
2318
2322

2319
2323
matches the subject "first\nand second" (where \n stands for
2320
2324
a newline character) with the first captured substring being
2321
-
"and". In order to do this, PCRE has to retry the match
2325
+
"and". In order to do this, PCRE has to retry the match
2322
2326
starting after every newline in the subject.
2323
2327
</para>
2324
2328
<para>
2325
2329
If you are using such a pattern with subject strings that do
2326
-
not contain newlines, the best performance is obtained by
2330
+
not contain newlines, the best performance is obtained by
2327
2331
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2328
-
or starting the pattern with ^.* to
2329
-
indicate explicit anchoring. That saves PCRE from having to
2332
+
or starting the pattern with ^.* to
2333
+
indicate explicit anchoring. That saves PCRE from having to
2330
2334
scan along the subject looking for a newline to restart at.
2331
2335
</para>
2332
2336
<para>
2333
-
Beware of patterns that contain nested indefinite repeats.
2334
-
These can take a long time to run when applied to a string
2337
+
Beware of patterns that contain nested indefinite repeats.
2338
+
These can take a long time to run when applied to a string
2335
2339
that does not match. Consider the pattern fragment
2336
2340

2337
2341
<literal>(a+)*</literal>
2338
2342
</para>
2339
2343
<para>
2340
-
This can match "aaaa" in 33 different ways, and this number
2341
-
increases very rapidly as the string gets longer. (The *
2342
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2343
-
those cases other than 0, the + repeats can match different
2344
+
This can match "aaaa" in 33 different ways, and this number
2345
+
increases very rapidly as the string gets longer. (The *
2346
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2347
+
those cases other than 0, the + repeats can match different
2344
2348
numbers of times.) When the remainder of the pattern is such
2345
-
that the entire match is going to fail, PCRE has in principle
2346
-
to try every possible variation, and this can take an
2349
+
that the entire match is going to fail, PCRE has in principle
2350
+
to try every possible variation, and this can take an
2347
2351
extremely long time.
2348
2352
</para>
2349
2353
<para>
2350
-
An optimization catches some of the more simple cases such
2354
+
An optimization catches some of the more simple cases such
2351
2355
as
2352
2356

2353
2357
<literal>(a+)*b</literal>
2354
2358

2355
-
where a literal character follows. Before embarking on the
2359
+
where a literal character follows. Before embarking on the
2356
2360
standard matching procedure, PCRE checks that there is a "b"
2357
-
later in the subject string, and if there is not, it fails
2358
-
the match immediately. However, when there is no following
2359
-
literal this optimization cannot be used. You can see the
2361
+
later in the subject string, and if there is not, it fails
2362
+
the match immediately. However, when there is no following
2363
+
literal this optimization cannot be used. You can see the
2360
2364
difference by comparing the behaviour of
2361
2365

2362
2366
<literal>(a+)*\d</literal>
2363
2367

2364
-
with the pattern above. The former gives a failure almost
2365
-
instantly when applied to a whole line of "a" characters,
2366
-
whereas the latter takes an appreciable time with strings
2368
+
with the pattern above. The former gives a failure almost
2369
+
instantly when applied to a whole line of "a" characters,
2370
+
whereas the latter takes an appreciable time with strings
2367
2371
longer than about 20 characters.
2368
2372
</para>
2369
2373
</section>
2370
2374