reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77
...
...
@@ -8,21 +8,21 @@
8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described below. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -102,15 +102,15 @@
102
102
<section xml:id="regexp.reference.meta">
103
103
<title>Meta-characters</title>
104
104
<para>
105
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
106
106
ability to include alternatives and repetitions in the
107
-
pattern. These are encoded in the pattern by the use of
108
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
109
109
are interpreted in some special way.
110
110
</para>
111
111
<para>
112
-
There are two different sets of meta-characters: those that
113
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
114
114
brackets, and those that are recognized in square brackets.
115
115
Outside square brackets, the meta-characters are as follows:
116
116

...
...
@@ -130,7 +130,8 @@
130
130
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
131
</row>
132
132
<row>
133
-
<entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
134
135
</row>
135
136
<row>
136
137
<entry>.</entry><entry>match any character except newline (by default)</entry>
...
...
@@ -204,9 +205,9 @@
204
205
<section xml:id="regexp.reference.escape">
205
206
<title>Escape sequences</title>
206
207
<para>
207
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
208
209
followed by a non-alphanumeric character, it takes away any
209
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
210
211
backslash as an escape character applies both inside and
211
212
outside character classes.
212
213
</para>
...
...
@@ -215,7 +216,7 @@
215
216
"\*" in the pattern. This applies whether or not the
216
217
following character would otherwise be interpreted as a
217
218
meta-character, so it is always safe to precede a non-alphanumeric
218
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
219
220
particular, if you want to match a backslash, you write "\\".
220
221
</para>
221
222
<note>
...
...
@@ -237,10 +238,10 @@
237
238
<para>
238
239
A second use of backslash provides a way of encoding
239
240
non-printing characters in patterns in a visible manner. There
240
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
241
242
apart from the binary zero that terminates a pattern,
242
243
but when a pattern is being prepared by text editing, it is
243
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
244
245
than the binary character it represents:
245
246
</para>
246
247
<para>
...
...
@@ -331,9 +332,9 @@
331
332
</para>
332
333
<para>
333
334
The precise effect of "<literal>\cx</literal>" is as follows:
334
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
335
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
336
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
338
339
becomes hex 7B.
339
340
</para>
...
...
@@ -349,7 +350,7 @@
349
350
</para>
350
351
<para>
351
352
After "<literal>\0</literal>" up to two further octal digits are read.
352
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
353
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
354
355
specifies two binary zeros followed by a BEL character. Make sure you
355
356
supply two digits after the initial zero if the character
...
...
@@ -358,20 +359,20 @@
358
359
<para>
359
360
The handling of a backslash followed by a digit other than 0
360
361
is complicated. Outside a character class, PCRE reads it
361
-
and any following digits as a decimal number. If the number
362
-
is less than 10, or if there have been at least that many
363
-
previous capturing left parentheses in the expression, the
364
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
365
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
366
367
of parenthesized subpatterns.
367
368
</para>
368
369
<para>
369
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
370
371
greater than 9 and there have not been that many capturing
371
372
subpatterns, PCRE re-reads up to three octal digits following
372
373
the backslash, and generates a single byte from the
373
374
least significant 8 bits of the value. Any subsequent digits
374
-
stand for themselves. For example:
375
+
stand for themselves. For example:
375
376
</para>
376
377
<para>
377
378
<variablelist>
...
...
@@ -439,7 +440,7 @@
439
440
digits are ever read.
440
441
</para>
441
442
<para>
442
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
443
444
used both inside and outside character classes. In addition,
444
445
inside a character class, the sequence "<literal>\b</literal>"
445
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -506,7 +507,7 @@
506
507
</para>
507
508
<para>
508
509
A "word" character is any letter or digit or the underscore
509
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
510
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
511
512
controlled by PCRE's character tables, and may vary if locale-specific
512
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -515,15 +516,15 @@
515
516
</para>
516
517
<para>
517
518
These character type sequences can appear both inside and
518
-
outside character classes. They each match one character of
519
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
520
521
the end of the subject string, all of them fail, since there
521
522
is no character to match.
522
523
</para>
523
524
<para>
524
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
525
526
assertions. An assertion specifies a condition that has to be met
526
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
527
528
characters from the subject string. The use of subpatterns
528
529
for more complicated assertions is described below. The
529
530
backslashed assertions are
...
...
@@ -562,7 +563,7 @@
562
563
</variablelist>
563
564
</para>
564
565
<para>
565
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
566
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
567
568
character, inside a character class).
568
569
</para>
...
...
@@ -570,20 +571,20 @@
570
571
A word boundary is a position in the subject string where
571
572
the current character and the previous character do not both
572
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
573
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
574
575
<literal>\W</literal>), or the start or end of the string if the first
575
576
or last character matches <literal>\w</literal>, respectively.
576
577
</para>
577
578
<para>
578
579
The <literal>\A</literal>, <literal>\Z</literal>, and
579
-
<literal>\z</literal> assertions differ from the traditional
580
-
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only
581
-
ever match at the very start and end of the subject string,
582
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
583
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
584
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
585
-
options. The difference between <literal>\Z</literal> and
586
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
587
588
newline that is the last character of the string as well as at the end of
588
589
the string, whereas <literal>\z</literal> matches only at the end.
589
590
</para>
...
...
@@ -600,7 +601,11 @@
600
601
regexp metacharacters in the pattern. For example:
601
602
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
602
603
followed by literals <literal>.$.</literal> and anchored at the end of
603
-
the string.
604
+
the string. Note that this does not change the behavior of
605
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
606
+
is not valid, because the second <literal>#</literal> marks the end
607
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
608
+
modifiers.
604
609
</para>
605
610

606
611
<para>
...
...
@@ -869,8 +874,8 @@
869
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
870
875
</para>
871
876
<para>
872
-
Sets of Unicode characters are defined as belonging to certain scripts. A
873
-
character from one of these sets can be matched using a script name. For
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
874
879
example:
875
880
</para>
876
881
<itemizedlist>
...
...
@@ -882,7 +887,7 @@
882
887
</listitem>
883
888
</itemizedlist>
884
889
<para>
885
-
Those that are not part of an identified script are lumped together as
890
+
Those that are not part of an identified script are lumped together as
886
891
<literal>Common</literal>. The current list of scripts is:
887
892
</para>
888
893
<table>
...
...
@@ -1051,7 +1056,7 @@
1051
1056
<para>
1052
1057
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1053
1058
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1054
-
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1055
1060
character without the "mark" property, followed by zero or more characters
1056
1061
with the "mark" property, and treats the sequence as an atomic group (see
1057
1062
below). Characters with the "mark" property are typically accents that
...
...
@@ -1071,8 +1076,8 @@
1071
1076
<para>
1072
1077
Outside a character class, in the default matching mode, the
1073
1078
circumflex character (<literal>^</literal>) is an assertion which
1074
-
is true only if the current matching point is at the start of
1075
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1076
1081
has an entirely different meaning (see below).
1077
1082
</para>
1078
1083
<para>
...
...
@@ -1087,12 +1092,12 @@
1087
1092
</para>
1088
1093
<para>
1089
1094
A dollar character (<literal>$</literal>) is an assertion which is
1090
-
&true; only if the current matching point is at the end of the subject
1091
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
1092
1097
character in the string (by default). Dollar (<literal>$</literal>)
1093
-
need not be the last character of the pattern if a number of
1094
-
alternatives are involved, but it should be the last item in any branch
1095
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
1096
1101
character class.
1097
1102
</para>
1098
1103
<para>
...
...
@@ -1118,9 +1123,9 @@
1118
1123
set.
1119
1124
</para>
1120
1125
<para>
1121
-
Note that the sequences \A, \Z, and \z can be used to match
1122
-
the start and end of the subject in both modes, and if all
1123
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
1124
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1125
1130
is set or not.
1126
1131
</para>
...
...
@@ -1129,14 +1134,14 @@
1129
1134
<section xml:id="regexp.reference.dot">
1130
1135
<title>Dot</title>
1131
1136
<para>
1132
-
Outside a character class, a dot in the pattern matches any
1133
-
one character in the subject, including a non-printing
1134
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
1135
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1136
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
1137
1142
handling of dot is entirely independent of the handling of
1138
-
circumflex and dollar, the only relationship being that they
1139
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
1140
1145
in a character class.
1141
1146
</para>
1142
1147
<para>
...
...
@@ -1150,29 +1155,29 @@
1150
1155
<title>Character classes</title>
1151
1156
<para>
1152
1157
An opening square bracket introduces a character class,
1153
-
terminated by a closing square bracket. A closing square
1154
-
bracket on its own is not special. If a closing square
1155
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
1156
1161
the first data character in the class (after an initial
1157
1162
circumflex, if present) or escaped with a backslash.
1158
1163
</para>
1159
1164
<para>
1160
1165
A character class matches a single character in the subject;
1161
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
1162
1167
the class, unless the first character in the class is a
1163
-
circumflex, in which case the subject character must not be in
1164
-
the set defined by the class. If a circumflex is actually
1165
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
1166
1171
first character, or escape it with a backslash.
1167
1172
</para>
1168
1173
<para>
1169
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
1170
1175
case vowel, while [^aeiou] matches any character that is not
1171
-
a lower case vowel. Note that a circumflex is just a
1172
-
convenient notation for specifying the characters which are in
1173
-
the class by enumerating those that are not. It is not an
1174
-
assertion: it still consumes a character from the subject
1175
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
1176
1181
the string.
1177
1182
</para>
1178
1183
<para>
...
...
@@ -1184,61 +1189,62 @@
1184
1189
</para>
1185
1190
<para>
1186
1191
The newline character is never treated in any special way in
1187
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
1188
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1189
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1190
1195
options is. A class such as [^a] will always match a newline.
1191
1196
</para>
1192
1197
<para>
1193
-
The minus (hyphen) character can be used to specify a range
1194
-
of characters in a character class. For example, [d-m]
1195
-
matches any letter between d and m, inclusive. If a minus
1196
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1197
1202
backslash or appear in a position where it cannot be
1198
1203
interpreted as indicating a range, typically as the first or last
1199
1204
character in the class.
1200
1205
</para>
1201
1206
<para>
1202
-
It is not possible to have the literal character "]" as the
1203
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1204
1209
interpreted as a class of two characters ("W" and "-")
1205
1210
followed by a literal string "46]", so it would match "W46]" or
1206
-
"-46]". However, if the "]" is escaped with a backslash it
1207
-
is interpreted as the end of range, so [W-\]46] is
1208
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1209
1214
separate characters. The octal or hexadecimal representation
1210
1215
of "]" can also be used to end a range.
1211
1216
</para>
1212
1217
<para>
1213
1218
Ranges operate in ASCII collating sequence. They can also be
1214
-
used for characters specified numerically, for example
1215
-
[\000-\037]. If a range that includes letters is used when
1216
-
case-insensitive (caseless) matching is set, it matches the
1217
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1218
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1219
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1220
1225
accented E characters in both cases.
1221
1226
</para>
1222
1227
<para>
1223
-
The character types \d, \D, \s, \S, \w, and \W may also
1224
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1225
1230
they match to the class. For example, [\dABCDEF] matches any
1226
-
hexadecimal digit. A circumflex can conveniently be used
1227
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1228
1233
restricted set of characters than the matching lower case type.
1229
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1230
1235
but not underscore.
1231
1236
</para>
1232
1237
<para>
1233
-
All non-alphanumeric characters other than \, -, ^ (at the
1234
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1235
1240
classes, but it does no harm if they are escaped. The pattern
1236
1241
terminator is always special and must be escaped when used
1237
1242
within an expression.
1238
1243
</para>
1239
1244
<para>
1240
1245
Perl supports the POSIX notation for character classes. This uses names
1241
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1242
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1243
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1244
1250
names are:
...
...
@@ -1293,16 +1299,16 @@
1293
1299
<section xml:id="regexp.reference.alternation">
1294
1300
<title>Alternation</title>
1295
1301
<para>
1296
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1297
1303
patterns. For example, the pattern
1298
1304
<literal>gilbert|sullivan</literal>
1299
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1300
-
may appear, and an empty alternative is permitted
1301
-
(matching the empty string). The matching process tries
1302
-
each alternative in turn, from left to right, and the first
1303
-
one that succeeds is used. If the alternatives are within a
1304
-
subpattern (defined below), "succeeds" means matching the
1305
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1306
1312
subpattern.
1307
1313
</para>
1308
1314
</section>
...
...
@@ -1317,7 +1323,7 @@
1317
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1318
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1319
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1320
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1321
1327
")". The option letters are:
1322
1328

1323
1329
<table>
...
...
@@ -1346,7 +1352,8 @@
1346
1352
</row>
1347
1353
<row>
1348
1354
<entry><literal>X</literal></entry>
1349
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1350
1357
</row>
1351
1358
<row>
1352
1359
<entry><literal>J</literal></entry>
...
...
@@ -1357,16 +1364,16 @@
1357
1364
</table>
1358
1365
</para>
1359
1366
<para>
1360
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1361
1368
also possible to unset these options by preceding the letter
1362
-
with a hyphen, and a combined setting and unsetting such as
1363
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1364
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1365
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1366
1373
while unsetting <link
1367
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1368
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1369
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1370
1377
hyphen, the option is unset.
1371
1378
</para>
1372
1379
<para>
...
...
@@ -1376,14 +1383,14 @@
1376
1383
and "abC".
1377
1384
</para>
1378
1385
<para>
1379
-
If an option change occurs inside a subpattern, the effect
1380
-
is different. This is a change of behaviour in Perl 5.005.
1381
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1382
1389
of the subpattern that follows it, so
1383
1390

1384
1391
<literal>(a(?i)b)c</literal>
1385
1392

1386
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1387
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1388
1395
used). By this means, options can be made to have different settings in
1389
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1392,18 +1399,18 @@
1392
1399

1393
1400
<literal>(a(?i)b|c)</literal>
1394
1401

1395
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1396
1403
"C" the first branch is abandoned before the option setting.
1397
-
This is because the effects of option settings happen at
1398
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1399
1406
</para>
1400
1407
<para>
1401
1408
The PCRE-specific options <link
1402
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1403
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1404
1411
be changed in the same way as the Perl-compatible options by
1405
-
using the characters U and X respectively. The (?X) flag
1406
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1407
1414
the pattern than any of the additional features it turns on,
1408
1415
even when it is at top level. It is best put at the start.
1409
1416
</para>
...
...
@@ -1412,8 +1419,8 @@
1412
1419
<section xml:id="regexp.reference.subpatterns">
1413
1420
<title>Subpatterns</title>
1414
1421
<para>
1415
-
Subpatterns are delimited by parentheses (round brackets),
1416
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1417
1424
does two things:
1418
1425
</para>
1419
1426
<orderedlist>
...
...
@@ -1442,30 +1449,30 @@
1442
1449

1443
1450
<literal>the ((red|white) (king|queen))</literal>
1444
1451

1445
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1446
1453
and are numbered 1, 2, and 3.
1447
1454
</para>
1448
1455
<para>
1449
-
The fact that plain parentheses fulfill two functions is not
1450
-
always helpful. There are often times when a grouping subpattern
1451
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1452
1459
opening parenthesis is followed by "?:", the subpattern does
1453
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1454
1461
number of any subsequent capturing subpatterns. For example,
1455
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1456
1463
pattern
1457
1464

1458
1465
<literal>the ((?:red|white) (king|queen))</literal>
1459
1466

1460
-
the captured substrings are "white queen" and "queen", and
1461
-
are numbered 1 and 2. The maximum number of captured substrings
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1462
1469
is 65535. It may not be possible to compile such large patterns,
1463
1470
however, depending on the configuration options of libpcre.
1464
1471
</para>
1465
1472
<para>
1466
-
As a convenient shorthand, if any option settings are
1467
-
required at the start of a non-capturing subpattern, the
1468
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1469
1476
the two patterns
1470
1477
</para>
1471
1478

...
...
@@ -1479,10 +1486,10 @@
1479
1486
</informalexample>
1480
1487

1481
1488
<para>
1482
-
match exactly the same set of strings. Because alternative
1483
-
branches are tried from left to right, and options are not
1484
-
reset until the end of the subpattern is reached, an option
1485
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1486
1493
the above patterns match "SUNDAY" as well as "Saturday".
1487
1494
</para>
1488
1495

...
...
@@ -1511,9 +1518,10 @@
1511
1518

1512
1519
<para>
1513
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1514
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1515
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1516
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1517
1525
</para>
1518
1526

1519
1527
<informalexample>
...
...
@@ -1539,45 +1547,45 @@
1539
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1540
1548
<listitem><simpara>a character class</simpara></listitem>
1541
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1542
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1543
1551
see below)</simpara></listitem>
1544
1552
</itemizedlist>
1545
1553
</para>
1546
1554
<para>
1547
-
The general repetition quantifier specifies a minimum and
1548
-
maximum number of permitted matches, by giving the two
1549
-
numbers in curly brackets (braces), separated by a comma.
1550
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1551
1559
less than or equal to the second. For example:
1552
1560

1553
1561
<literal>z{2,4}</literal>
1554
1562

1555
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1556
1564
is not a special character. If the second number is omitted,
1557
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1558
1566
second number and the comma are both omitted, the quantifier
1559
1567
specifies an exact number of required matches. Thus
1560
1568

1561
1569
<literal>[aeiou]{3,}</literal>
1562
1570

1563
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1564
1572
more, while
1565
1573

1566
1574
<literal>\d{8}</literal>
1567
1575

1568
-
matches exactly 8 digits. An opening curly bracket that
1569
-
appears in a position where a quantifier is not allowed, or
1576
+
matches exactly 8 digits. An opening curly bracket that
1577
+
appears in a position where a quantifier is not allowed, or
1570
1578
one that does not match the syntax of a quantifier, is taken
1571
-
as a literal character. For example, {,6} is not a quantifier,
1579
+
as a literal character. For example, {,6} is not a quantifier,
1572
1580
but a literal string of four characters.
1573
1581
</para>
1574
1582
<para>
1575
-
The quantifier {0} is permitted, causing the expression to
1576
-
behave as if the previous item and the quantifier were not
1583
+
The quantifier {0} is permitted, causing the expression to
1584
+
behave as if the previous item and the quantifier were not
1577
1585
present.
1578
1586
</para>
1579
1587
<para>
1580
-
For convenience (and historical compatibility) the three
1588
+
For convenience (and historical compatibility) the three
1581
1589
most common quantifiers have single-character abbreviations:
1582
1590

1583
1591
<table>
...
...
@@ -1601,63 +1609,63 @@
1601
1609
</table>
1602
1610
</para>
1603
1611
<para>
1604
-
It is possible to construct infinite loops by following a
1605
-
subpattern that can match no characters with a quantifier
1612
+
It is possible to construct infinite loops by following a
1613
+
subpattern that can match no characters with a quantifier
1606
1614
that has no upper limit, for example:
1607
1615

1608
1616
<literal>(a?)*</literal>
1609
1617
</para>
1610
1618
<para>
1611
-
Earlier versions of Perl and PCRE used to give an error at
1612
-
compile time for such patterns. However, because there are
1613
-
cases where this can be useful, such patterns are now
1614
-
accepted, but if any repetition of the subpattern does in
1619
+
Earlier versions of Perl and PCRE used to give an error at
1620
+
compile time for such patterns. However, because there are
1621
+
cases where this can be useful, such patterns are now
1622
+
accepted, but if any repetition of the subpattern does in
1615
1623
fact match no characters, the loop is forcibly broken.
1616
1624
</para>
1617
1625
<para>
1618
-
By default, the quantifiers are "greedy", that is, they
1619
-
match as much as possible (up to the maximum number of permitted
1620
-
times), without causing the rest of the pattern to
1626
+
By default, the quantifiers are "greedy", that is, they
1627
+
match as much as possible (up to the maximum number of permitted
1628
+
times), without causing the rest of the pattern to
1621
1629
fail. The classic example of where this gives problems is in
1622
1630
trying to match comments in C programs. These appear between
1623
-
the sequences /* and */ and within the sequence, individual
1624
-
* and / characters may appear. An attempt to match C comments
1631
+
the sequences /* and */ and within the sequence, individual
1632
+
* and / characters may appear. An attempt to match C comments
1625
1633
by applying the pattern
1626
1634

1627
1635
<literal>/\*.*\*/</literal>
1628
1636

1629
1637
to the string
1630
1638

1631
-
<literal>/* first comment */ not comment /* second comment */</literal>
1639
+
<literal>/* first comment */ not comment /* second comment */</literal>
1632
1640

1633
-
fails, because it matches the entire string due to the
1634
-
greediness of the .* item.
1641
+
fails, because it matches the entire string due to the
1642
+
greediness of the .* item.
1635
1643
</para>
1636
1644
<para>
1637
-
However, if a quantifier is followed by a question mark,
1645
+
However, if a quantifier is followed by a question mark,
1638
1646
then it becomes lazy, and instead matches the minimum
1639
1647
number of times possible, so the pattern
1640
1648

1641
1649
<literal>/\*.*?\*/</literal>
1642
1650

1643
1651
does the right thing with the C comments. The meaning of the
1644
-
various quantifiers is not otherwise changed, just the preferred
1645
-
number of matches. Do not confuse this use of
1646
-
question mark with its use as a quantifier in its own right.
1652
+
various quantifiers is not otherwise changed, just the preferred
1653
+
number of matches. Do not confuse this use of
1654
+
question mark with its use as a quantifier in its own right.
1647
1655
Because it has two uses, it can sometimes appear doubled, as
1648
1656
in
1649
1657

1650
1658
<literal>\d??\d</literal>
1651
1659

1652
-
which matches one digit by preference, but can match two if
1660
+
which matches one digit by preference, but can match two if
1653
1661
that is the only way the rest of the pattern matches.
1654
1662
</para>
1655
1663
<para>
1656
1664
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1657
-
option is set (an option which is not
1658
-
available in Perl) then the quantifiers are not greedy by
1665
+
option is set (an option which is not
1666
+
available in Perl) then the quantifiers are not greedy by
1659
1667
default, but individual ones can be made greedy by following
1660
-
them with a question mark. In other words, it inverts the
1668
+
them with a question mark. In other words, it inverts the
1661
1669
default behaviour.
1662
1670
</para>
1663
1671
<para>
...
...
@@ -1669,41 +1677,41 @@
1669
1677
</para>
1670
1678
<para>
1671
1679
When a parenthesized subpattern is quantified with a minimum
1672
-
repeat count that is greater than 1 or with a limited maximum,
1673
-
more store is required for the compiled pattern, in
1680
+
repeat count that is greater than 1 or with a limited maximum,
1681
+
more store is required for the compiled pattern, in
1674
1682
proportion to the size of the minimum or maximum.
1675
1683
</para>
1676
1684
<para>
1677
-
If a pattern starts with .* or .{0,} and the <link
1685
+
If a pattern starts with .* or .{0,} and the <link
1678
1686
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1679
1687
option (equivalent to Perl's /s) is set, thus allowing the .
1680
-
to match newlines, then the pattern is implicitly anchored,
1688
+
to match newlines, then the pattern is implicitly anchored,
1681
1689
because whatever follows will be tried against every character
1682
-
position in the subject string, so there is no point in
1683
-
retrying the overall match at any position after the first.
1690
+
position in the subject string, so there is no point in
1691
+
retrying the overall match at any position after the first.
1684
1692
PCRE treats such a pattern as though it were preceded by \A.
1685
-
In cases where it is known that the subject string contains
1693
+
In cases where it is known that the subject string contains
1686
1694
no newlines, it is worth setting <link
1687
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1695
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1688
1696
pattern begins with .* in order to
1689
1697
obtain this optimization, or
1690
1698
alternatively using ^ to indicate anchoring explicitly.
1691
1699
</para>
1692
1700
<para>
1693
-
When a capturing subpattern is repeated, the value captured
1701
+
When a capturing subpattern is repeated, the value captured
1694
1702
is the substring that matched the final iteration. For example, after
1695
1703

1696
1704
<literal>(tweedle[dume]{3}\s*)+</literal>
1697
1705

1698
-
has matched "tweedledum tweedledee" the value of the captured
1699
-
substring is "tweedledee". However, if there are
1700
-
nested capturing subpatterns, the corresponding captured
1701
-
values may have been set in previous iterations. For example,
1706
+
has matched "tweedledum tweedledee" the value of the captured
1707
+
substring is "tweedledee". However, if there are
1708
+
nested capturing subpatterns, the corresponding captured
1709
+
values may have been set in previous iterations. For example,
1702
1710
after
1703
1711

1704
1712
<literal>/(a|(b))+/</literal>
1705
1713

1706
-
matches "aba" the value of the second captured substring is
1714
+
matches "aba" the value of the second captured substring is
1707
1715
"b".
1708
1716
</para>
1709
1717
</section>
...
...
@@ -1711,74 +1719,74 @@
1711
1719
<section xml:id="regexp.reference.back-references">
1712
1720
<title>Back references</title>
1713
1721
<para>
1714
-
Outside a character class, a backslash followed by a digit
1715
-
greater than 0 (and possibly further digits) is a back
1716
-
reference to a capturing subpattern earlier (i.e. to its
1717
-
left) in the pattern, provided there have been that many
1722
+
Outside a character class, a backslash followed by a digit
1723
+
greater than 0 (and possibly further digits) is a back
1724
+
reference to a capturing subpattern earlier (i.e. to its
1725
+
left) in the pattern, provided there have been that many
1718
1726
previous capturing left parentheses.
1719
1727
</para>
1720
1728
<para>
1721
-
However, if the decimal number following the backslash is
1722
-
less than 10, it is always taken as a back reference, and
1723
-
causes an error only if there are not that many capturing
1724
-
left parentheses in the entire pattern. In other words, the
1725
-
parentheses that are referenced need not be to the left of
1726
-
the reference for numbers less than 10.
1729
+
However, if the decimal number following the backslash is
1730
+
less than 10, it is always taken as a back reference, and
1731
+
causes an error only if there are not that many capturing
1732
+
left parentheses in the entire pattern. In other words, the
1733
+
parentheses that are referenced need not be to the left of
1734
+
the reference for numbers less than 10.
1727
1735
A "forward back reference" can make sense when a repetition
1728
1736
is involved and the subpattern to the right has participated
1729
1737
in an earlier iteration. See the section
1730
-
entitled "Backslash" above for further details of the handling
1738
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1731
1739
of digits following a backslash.
1732
1740
</para>
1733
1741
<para>
1734
-
A back reference matches whatever actually matched the capturing
1742
+
A back reference matches whatever actually matched the capturing
1735
1743
subpattern in the current subject string, rather than
1736
1744
anything matching the subpattern itself. So the pattern
1737
1745

1738
1746
<literal>(sens|respons)e and \1ibility</literal>
1739
1747

1740
-
matches "sense and sensibility" and "response and responsibility",
1741
-
but not "sense and responsibility". If case-sensitive (caseful)
1748
+
matches "sense and sensibility" and "response and responsibility",
1749
+
but not "sense and responsibility". If case-sensitive (caseful)
1742
1750
matching is in force at the time of the back reference, then
1743
1751
the case of letters is relevant. For example,
1744
1752

1745
1753
<literal>((?i)rah)\s+\1</literal>
1746
1754

1747
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1748
-
though the original capturing subpattern is matched
1755
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1756
+
though the original capturing subpattern is matched
1749
1757
case-insensitively (caselessly).
1750
1758
</para>
1751
1759
<para>
1752
-
There may be more than one back reference to the same subpattern.
1753
-
If a subpattern has not actually been used in a
1754
-
particular match, then any back references to it always
1760
+
There may be more than one back reference to the same subpattern.
1761
+
If a subpattern has not actually been used in a
1762
+
particular match, then any back references to it always
1755
1763
fail. For example, the pattern
1756
1764

1757
1765
<literal>(a|(bc))\2</literal>
1758
1766

1759
-
always fails if it starts to match "a" rather than "bc".
1760
-
Because there may be up to 99 back references, all digits
1761
-
following the backslash are taken as part of a potential
1767
+
always fails if it starts to match "a" rather than "bc".
1768
+
Because there may be up to 99 back references, all digits
1769
+
following the backslash are taken as part of a potential
1762
1770
back reference number. If the pattern continues with a digit
1763
1771
character, then some delimiter must be used to terminate the
1764
1772
back reference. If the <link
1765
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1766
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1773
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1774
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1767
1775
</para>
1768
1776
<para>
1769
1777
A back reference that occurs inside the parentheses to which
1770
-
it refers fails when the subpattern is first used, so, for
1771
-
example, (a\1) never matches. However, such references can
1778
+
it refers fails when the subpattern is first used, so, for
1779
+
example, (a\1) never matches. However, such references can
1772
1780
be useful inside repeated subpatterns. For example, the pattern
1773
1781

1774
1782
<literal>(a|b\1)+</literal>
1775
1783

1776
-
matches any number of "a"s and also "aba", "ababba" etc. At
1784
+
matches any number of "a"s and also "aba", "ababba" etc. At
1777
1785
each iteration of the subpattern, the back reference matches
1778
-
the character string corresponding to the previous iteration.
1786
+
the character string corresponding to the previous iteration.
1779
1787
In order for this to work, the pattern must be such
1780
-
that the first iteration does not need to match the back
1781
-
reference. This can be done using alternation, as in the
1788
+
that the first iteration does not need to match the back
1789
+
reference. This can be done using alternation, as in the
1782
1790
example above, or by a quantifier with a minimum of zero.
1783
1791
</para>
1784
1792
<para>
...
...
@@ -1813,18 +1821,18 @@
1813
1821
<section xml:id="regexp.reference.assertions">
1814
1822
<title>Assertions</title>
1815
1823
<para>
1816
-
An assertion is a test on the characters following or
1817
-
preceding the current matching point that does not actually
1818
-
consume any characters. The simple assertions coded as \b,
1819
-
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1820
-
assertions are coded as subpatterns. There are two
1821
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1824
+
An assertion is a test on the characters following or
1825
+
preceding the current matching point that does not actually
1826
+
consume any characters. The simple assertions coded as \b,
1827
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1828
+
assertions are coded as subpatterns. There are two
1829
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1822
1830
subject string, and those that <emphasis>look behind</emphasis> it.
1823
1831
</para>
1824
1832
<para>
1825
1833
An assertion subpattern is matched in the normal way, except
1826
-
that it does not cause the current matching position to be
1827
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1834
+
that it does not cause the current matching position to be
1835
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1828
1836
assertions and (?! for negative assertions. For example,
1829
1837

1830
1838
<literal>\w+(?=;)</literal>
...
...
@@ -1834,27 +1842,27 @@
1834
1842

1835
1843
<literal>foo(?!bar)</literal>
1836
1844

1837
-
matches any occurrence of "foo" that is not followed by
1845
+
matches any occurrence of "foo" that is not followed by
1838
1846
"bar". Note that the apparently similar pattern
1839
1847

1840
1848
<literal>(?!foo)bar</literal>
1841
1849

1842
-
does not find an occurrence of "bar" that is preceded by
1850
+
does not find an occurrence of "bar" that is preceded by
1843
1851
something other than "foo"; it finds any occurrence of "bar"
1844
-
whatsoever, because the assertion (?!foo) is always &true;
1845
-
when the next three characters are "bar". A lookbehind
1852
+
whatsoever, because the assertion (?!foo) is always &true;
1853
+
when the next three characters are "bar". A lookbehind
1846
1854
assertion is needed to achieve this effect.
1847
1855
</para>
1848
1856
<para>
1849
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1857
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1850
1858
and (?&lt;! for negative assertions. For example,
1851
1859

1852
1860
<literal>(?&lt;!foo)bar</literal>
1853
1861

1854
-
does find an occurrence of "bar" that is not preceded by
1862
+
does find an occurrence of "bar" that is not preceded by
1855
1863
"foo". The contents of a lookbehind assertion are restricted
1856
-
such that all the strings it matches must have a fixed
1857
-
length. However, if there are several alternatives, they do
1864
+
such that all the strings it matches must have a fixed
1865
+
length. However, if there are several alternatives, they do
1858
1866
not all have to have the same fixed length. Thus
1859
1867

1860
1868
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1863,51 +1871,51 @@
1863
1871

1864
1872
<literal>(?&lt;!dogs?|cats?)</literal>
1865
1873

1866
-
causes an error at compile time. Branches that match different
1874
+
causes an error at compile time. Branches that match different
1867
1875
length strings are permitted only at the top level of
1868
-
a lookbehind assertion. This is an extension compared with
1869
-
Perl 5.005, which requires all branches to match the same
1876
+
a lookbehind assertion. This is an extension compared with
1877
+
Perl 5.005, which requires all branches to match the same
1870
1878
length of string. An assertion such as
1871
1879

1872
1880
<literal>(?&lt;=ab(c|de))</literal>
1873
1881

1874
-
is not permitted, because its single top-level branch can
1882
+
is not permitted, because its single top-level branch can
1875
1883
match two different lengths, but it is acceptable if rewritten
1876
1884
to use two top-level branches:
1877
1885

1878
1886
<literal>(?&lt;=abc|abde)</literal>
1879
1887

1880
-
The implementation of lookbehind assertions is, for each
1881
-
alternative, to temporarily move the current position back
1882
-
by the fixed width and then try to match. If there are
1883
-
insufficient characters before the current position, the
1884
-
match is deemed to fail. Lookbehinds in conjunction with
1885
-
once-only subpatterns can be particularly useful for matching
1886
-
at the ends of strings; an example is given at the end
1888
+
The implementation of lookbehind assertions is, for each
1889
+
alternative, to temporarily move the current position back
1890
+
by the fixed width and then try to match. If there are
1891
+
insufficient characters before the current position, the
1892
+
match is deemed to fail. Lookbehinds in conjunction with
1893
+
once-only subpatterns can be particularly useful for matching
1894
+
at the ends of strings; an example is given at the end
1887
1895
of the section on once-only subpatterns.
1888
1896
</para>
1889
1897
<para>
1890
-
Several assertions (of any sort) may occur in succession.
1898
+
Several assertions (of any sort) may occur in succession.
1891
1899
For example,
1892
1900

1893
1901
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1894
1902

1895
-
matches "foo" preceded by three digits that are not "999".
1896
-
Notice that each of the assertions is applied independently
1897
-
at the same point in the subject string. First there is a
1898
-
check that the previous three characters are all digits,
1903
+
matches "foo" preceded by three digits that are not "999".
1904
+
Notice that each of the assertions is applied independently
1905
+
at the same point in the subject string. First there is a
1906
+
check that the previous three characters are all digits,
1899
1907
then there is a check that the same three characters are not
1900
-
"999". This pattern does not match "foo" preceded by six
1908
+
"999". This pattern does not match "foo" preceded by six
1901
1909
characters, the first of which are digits and the last three
1902
-
of which are not "999". For example, it doesn't match
1910
+
of which are not "999". For example, it doesn't match
1903
1911
"123abcfoo". A pattern to do that is
1904
1912

1905
1913
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1906
1914
</para>
1907
1915
<para>
1908
-
This time the first assertion looks at the preceding six
1909
-
characters, checking that the first three are digits, and
1910
-
then the second assertion checks that the preceding three
1916
+
This time the first assertion looks at the preceding six
1917
+
characters, checking that the first three are digits, and
1918
+
then the second assertion checks that the preceding three
1911
1919
characters are not "999".
1912
1920
</para>
1913
1921
<para>
...
...
@@ -1915,26 +1923,26 @@
1915
1923

1916
1924
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1917
1925

1918
-
matches an occurrence of "baz" that is preceded by "bar"
1926
+
matches an occurrence of "baz" that is preceded by "bar"
1919
1927
which in turn is not preceded by "foo", while
1920
1928

1921
1929
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1922
1930

1923
-
is another pattern which matches "foo" preceded by three
1931
+
is another pattern which matches "foo" preceded by three
1924
1932
digits and any three characters that are not "999".
1925
1933
</para>
1926
1934
<para>
1927
1935
Assertion subpatterns are not capturing subpatterns, and may
1928
-
not be repeated, because it makes no sense to assert the
1929
-
same thing several times. If any kind of assertion contains
1930
-
capturing subpatterns within it, these are counted for the
1936
+
not be repeated, because it makes no sense to assert the
1937
+
same thing several times. If any kind of assertion contains
1938
+
capturing subpatterns within it, these are counted for the
1931
1939
purposes of numbering the capturing subpatterns in the whole
1932
-
pattern. However, substring capturing is carried out only
1933
-
for positive assertions, because it does not make sense for
1940
+
pattern. However, substring capturing is carried out only
1941
+
for positive assertions, because it does not make sense for
1934
1942
negative assertions.
1935
1943
</para>
1936
1944
<para>
1937
-
Assertions count towards the maximum of 200 parenthesized
1945
+
Assertions count towards the maximum of 200 parenthesized
1938
1946
subpatterns.
1939
1947
</para>
1940
1948
</section>
...
...
@@ -1942,17 +1950,17 @@
1942
1950
<section xml:id="regexp.reference.onlyonce">
1943
1951
<title>Once-only subpatterns</title>
1944
1952
<para>
1945
-
With both maximizing and minimizing repetition, failure of
1946
-
what follows normally causes the repeated item to be
1953
+
With both maximizing and minimizing repetition, failure of
1954
+
what follows normally causes the repeated item to be
1947
1955
re-evaluated to see if a different number of repeats allows the
1948
-
rest of the pattern to match. Sometimes it is useful to
1949
-
prevent this, either to change the nature of the match, or
1950
-
to cause it fail earlier than it otherwise might, when the
1951
-
author of the pattern knows there is no point in carrying
1956
+
rest of the pattern to match. Sometimes it is useful to
1957
+
prevent this, either to change the nature of the match, or
1958
+
to cause it fail earlier than it otherwise might, when the
1959
+
author of the pattern knows there is no point in carrying
1952
1960
on.
1953
1961
</para>
1954
1962
<para>
1955
-
Consider, for example, the pattern \d+foo when applied to
1963
+
Consider, for example, the pattern \d+foo when applied to
1956
1964
the subject line
1957
1965

1958
1966
<literal>123456bar</literal>
...
...
@@ -1960,108 +1968,108 @@
1960
1968
<para>
1961
1969
After matching all 6 digits and then failing to match "foo",
1962
1970
the normal action of the matcher is to try again with only 5
1963
-
digits matching the \d+ item, and then with 4, and so on,
1971
+
digits matching the \d+ item, and then with 4, and so on,
1964
1972
before ultimately failing. Once-only subpatterns provide the
1965
-
means for specifying that once a portion of the pattern has
1966
-
matched, it is not to be re-evaluated in this way, so the
1967
-
matcher would give up immediately on failing to match "foo"
1968
-
the first time. The notation is another kind of special
1973
+
means for specifying that once a portion of the pattern has
1974
+
matched, it is not to be re-evaluated in this way, so the
1975
+
matcher would give up immediately on failing to match "foo"
1976
+
the first time. The notation is another kind of special
1969
1977
parenthesis, starting with (?&gt; as in this example:
1970
1978

1971
1979
<literal>(?&gt;\d+)bar</literal>
1972
1980
</para>
1973
1981
<para>
1974
-
This kind of parenthesis "locks up" the part of the pattern
1975
-
it contains once it has matched, and a failure further into
1976
-
the pattern is prevented from backtracking into it.
1977
-
Backtracking past it to previous items, however, works as normal.
1982
+
This kind of parenthesis "locks up" the part of the pattern
1983
+
it contains once it has matched, and a failure further into
1984
+
the pattern is prevented from backtracking into it.
1985
+
Backtracking past it to previous items, however, works as normal.
1978
1986
</para>
1979
1987
<para>
1980
1988
An alternative description is that a subpattern of this type
1981
-
matches the string of characters that an identical standalone
1989
+
matches the string of characters that an identical standalone
1982
1990
pattern would match, if anchored at the current point
1983
1991
in the subject string.
1984
1992
</para>
1985
1993
<para>
1986
-
Once-only subpatterns are not capturing subpatterns. Simple
1987
-
cases such as the above example can be thought of as a maximizing
1988
-
repeat that must swallow everything it can. So,
1994
+
Once-only subpatterns are not capturing subpatterns. Simple
1995
+
cases such as the above example can be thought of as a maximizing
1996
+
repeat that must swallow everything it can. So,
1989
1997
while both \d+ and \d+? are prepared to adjust the number of
1990
-
digits they match in order to make the rest of the pattern
1998
+
digits they match in order to make the rest of the pattern
1991
1999
match, (?&gt;\d+) can only match an entire sequence of digits.
1992
2000
</para>
1993
2001
<para>
1994
-
This construction can of course contain arbitrarily complicated
2002
+
This construction can of course contain arbitrarily complicated
1995
2003
subpatterns, and it can be nested.
1996
2004
</para>
1997
2005
<para>
1998
2006
Once-only subpatterns can be used in conjunction with
1999
-
lookbehind assertions to specify efficient matching at the end
2007
+
lookbehind assertions to specify efficient matching at the end
2000
2008
of the subject string. Consider a simple pattern such as
2001
2009

2002
2010
<literal>abcd$</literal>
2003
2011

2004
-
when applied to a long string which does not match. Because
2005
-
matching proceeds from left to right, PCRE will look for
2012
+
when applied to a long string which does not match. Because
2013
+
matching proceeds from left to right, PCRE will look for
2006
2014
each "a" in the subject and then see if what follows matches
2007
2015
the rest of the pattern. If the pattern is specified as
2008
2016

2009
2017
<literal>^.*abcd$</literal>
2010
2018

2011
-
then the initial .* matches the entire string at first, but
2012
-
when this fails (because there is no following "a"), it
2019
+
then the initial .* matches the entire string at first, but
2020
+
when this fails (because there is no following "a"), it
2013
2021
backtracks to match all but the last character, then all but
2014
-
the last two characters, and so on. Once again the search
2015
-
for "a" covers the entire string, from right to left, so we
2022
+
the last two characters, and so on. Once again the search
2023
+
for "a" covers the entire string, from right to left, so we
2016
2024
are no better off. However, if the pattern is written as
2017
2025

2018
2026
<literal>^(?>.*)(?&lt;=abcd)</literal>
2019
2027

2020
-
then there can be no backtracking for the .* item; it can
2021
-
match only the entire string. The subsequent lookbehind
2028
+
then there can be no backtracking for the .* item; it can
2029
+
match only the entire string. The subsequent lookbehind
2022
2030
assertion does a single test on the last four characters. If
2023
-
it fails, the match fails immediately. For long strings,
2031
+
it fails, the match fails immediately. For long strings,
2024
2032
this approach makes a significant difference to the processing time.
2025
2033
</para>
2026
2034
<para>
2027
2035
When a pattern contains an unlimited repeat inside a subpattern
2028
2036
that can itself be repeated an unlimited number of
2029
-
times, the use of a once-only subpattern is the only way to
2030
-
avoid some failing matches taking a very long time indeed.
2037
+
times, the use of a once-only subpattern is the only way to
2038
+
avoid some failing matches taking a very long time indeed.
2031
2039
The pattern
2032
2040

2033
2041
<literal>(\D+|&lt;\d+>)*[!?]</literal>
2034
2042

2035
-
matches an unlimited number of substrings that either consist
2036
-
of non-digits, or digits enclosed in &lt;>, followed by
2043
+
matches an unlimited number of substrings that either consist
2044
+
of non-digits, or digits enclosed in &lt;>, followed by
2037
2045
either ! or ?. When it matches, it runs quickly. However, if
2038
2046
it is applied to
2039
2047

2040
2048
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
2041
2049

2042
-
it takes a long time before reporting failure. This is
2050
+
it takes a long time before reporting failure. This is
2043
2051
because the string can be divided between the two repeats in
2044
2052
a large number of ways, and all have to be tried. (The example
2045
-
used [!?] rather than a single character at the end,
2046
-
because both PCRE and Perl have an optimization that allows
2047
-
for fast failure when a single character is used. They
2048
-
remember the last single character that is required for a
2049
-
match, and fail early if it is not present in the string.)
2053
+
used [!?] rather than a single character at the end,
2054
+
because both PCRE and Perl have an optimization that allows
2055
+
for fast failure when a single character is used. They
2056
+
remember the last single character that is required for a
2057
+
match, and fail early if it is not present in the string.)
2050
2058
If the pattern is changed to
2051
2059

2052
2060
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
2053
2061

2054
-
sequences of non-digits cannot be broken, and failure happens quickly.
2062
+
sequences of non-digits cannot be broken, and failure happens quickly.
2055
2063
</para>
2056
2064
</section>
2057
2065

2058
2066
<section xml:id="regexp.reference.conditional">
2059
2067
<title>Conditional subpatterns</title>
2060
2068
<para>
2061
-
It is possible to cause the matching process to obey a subpattern
2062
-
conditionally or to choose between two alternative
2063
-
subpatterns, depending on the result of an assertion, or
2064
-
whether a previous capturing subpattern matched or not. The
2069
+
It is possible to cause the matching process to obey a subpattern
2070
+
conditionally or to choose between two alternative
2071
+
subpatterns, depending on the result of an assertion, or
2072
+
whether a previous capturing subpattern matched or not. The
2065
2073
two possible forms of conditional subpattern are
2066
2074
</para>
2067
2075

...
...
@@ -2075,39 +2083,39 @@
2075
2083
</informalexample>
2076
2084
<para>
2077
2085
If the condition is satisfied, the yes-pattern is used; otherwise
2078
-
the no-pattern (if present) is used. If there are
2086
+
the no-pattern (if present) is used. If there are
2079
2087
more than two alternatives in the subpattern, a compile-time
2080
2088
error occurs.
2081
2089
</para>
2082
2090
<para>
2083
-
There are two kinds of condition. If the text between the
2084
-
parentheses consists of a sequence of digits, then the
2085
-
condition is satisfied if the capturing subpattern of that
2086
-
number has previously matched. Consider the following pattern,
2087
-
which contains non-significant white space to make it
2088
-
more readable (assume the <link
2091
+
There are two kinds of condition. If the text between the
2092
+
parentheses consists of a sequence of digits, then the
2093
+
condition is satisfied if the capturing subpattern of that
2094
+
number has previously matched. Consider the following pattern,
2095
+
which contains non-significant white space to make it
2096
+
more readable (assume the <link
2089
2097
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2090
-
option) and to divide it into three parts for ease of discussion:
2098
+
option) and to divide it into three parts for ease of discussion:
2091
2099
</para>
2092
2100
<informalexample>
2093
2101
<programlisting>
2094
2102
<![CDATA[
2095
-
( \( )? [^()]+ (?(1) \) )
2103
+
( \( )? [^()]+ (?(1) \) )
2096
2104
]]>
2097
2105
</programlisting>
2098
2106
</informalexample>
2099
2107
<para>
2100
-
The first part matches an optional opening parenthesis, and
2101
-
if that character is present, sets it as the first captured
2102
-
substring. The second part matches one or more characters
2103
-
that are not parentheses. The third part is a conditional
2104
-
subpattern that tests whether the first set of parentheses
2105
-
matched or not. If they did, that is, if subject started
2106
-
with an opening parenthesis, the condition is &true;, and so
2107
-
the yes-pattern is executed and a closing parenthesis is
2108
-
required. Otherwise, since no-pattern is not present, the
2109
-
subpattern matches nothing. In other words, this pattern
2110
-
matches a sequence of non-parentheses, optionally enclosed
2108
+
The first part matches an optional opening parenthesis, and
2109
+
if that character is present, sets it as the first captured
2110
+
substring. The second part matches one or more characters
2111
+
that are not parentheses. The third part is a conditional
2112
+
subpattern that tests whether the first set of parentheses
2113
+
matched or not. If they did, that is, if subject started
2114
+
with an opening parenthesis, the condition is &true;, and so
2115
+
the yes-pattern is executed and a closing parenthesis is
2116
+
required. Otherwise, since no-pattern is not present, the
2117
+
subpattern matches nothing. In other words, this pattern
2118
+
matches a sequence of non-parentheses, optionally enclosed
2111
2119
in parentheses.
2112
2120
</para>
2113
2121
<para>
...
...
@@ -2116,10 +2124,10 @@
2116
2124
level", the condition is false.
2117
2125
</para>
2118
2126
<para>
2119
-
If the condition is not a sequence of digits or (R), it must be an
2120
-
assertion. This may be a positive or negative lookahead or
2121
-
lookbehind assertion. Consider this pattern, again containing
2122
-
non-significant white space, and with the two alternatives on
2127
+
If the condition is not a sequence of digits or (R), it must be an
2128
+
assertion. This may be a positive or negative lookahead or
2129
+
lookbehind assertion. Consider this pattern, again containing
2130
+
non-significant white space, and with the two alternatives on
2123
2131
the second line:
2124
2132
</para>
2125
2133

...
...
@@ -2127,18 +2135,18 @@
2127
2135
<programlisting>
2128
2136
<![CDATA[
2129
2137
(?(?=[^a-z]*[a-z])
2130
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2138
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2131
2139
]]>
2132
2140
</programlisting>
2133
2141
</informalexample>
2134
2142
<para>
2135
2143
The condition is a positive lookahead assertion that matches
2136
2144
an optional sequence of non-letters followed by a letter. In
2137
-
other words, it tests for the presence of at least one
2138
-
letter in the subject. If a letter is found, the subject is
2139
-
matched against the first alternative; otherwise it is
2140
-
matched against the second. This pattern matches strings in
2141
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2145
+
other words, it tests for the presence of at least one
2146
+
letter in the subject. If a letter is found, the subject is
2147
+
matched against the first alternative; otherwise it is
2148
+
matched against the second. This pattern matches strings in
2149
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2142
2150
letters and dd are digits.
2143
2151
</para>
2144
2152
</section>
...
...
@@ -2146,31 +2154,66 @@
2146
2154
<section xml:id="regexp.reference.comments">
2147
2155
<title>Comments</title>
2148
2156
<para>
2149
-
The sequence (?# marks the start of a comment which
2150
-
continues up to the next closing parenthesis. Nested
2157
+
The sequence (?# marks the start of a comment which
2158
+
continues up to the next closing parenthesis. Nested
2151
2159
parentheses are not permitted. The characters that make up a
2152
2160
comment play no part in the pattern matching at all.
2153
2161
</para>
2154
2162
<para>
2155
2163
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2156
-
option is set, an unescaped # character outside a character class
2164
+
option is set, an unescaped # character outside a character class
2157
2165
introduces a comment that continues up to the next newline character
2158
2166
in the pattern.
2159
2167
</para>
2168
+
<para>
2169
+
<example>
2170
+
<title>Usage of comments in PCRE pattern</title>
2171
+
<programlisting role="php">
2172
+
<![CDATA[
2173
+
<?php
2174
+

2175
+
$subject = 'test';
2176
+

2177
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2178
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2179
+
var_dump($match);
2180
+

2181
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2182
+
$match = preg_match('/te #~~~~
2183
+
st/', $subject);
2184
+
var_dump($match);
2185
+

2186
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2187
+
that follows an unescaped # on the same line is ignored */
2188
+
$match = preg_match('/te #~~~~
2189
+
st/x', $subject);
2190
+
var_dump($match);
2191
+
]]>
2192
+
</programlisting>
2193
+
&example.outputs;
2194
+
<screen>
2195
+
<![CDATA[
2196
+
int(1)
2197
+
int(0)
2198
+
int(1)
2199
+
]]>
2200
+
</screen>
2201
+
</example>
2202
+
</para>
2160
2203
</section>
2161
2204

2162
2205
<section xml:id="regexp.reference.recursive">
2163
2206
<title>Recursive patterns</title>
2164
2207
<para>
2165
-
Consider the problem of matching a string in parentheses,
2166
-
allowing for unlimited nested parentheses. Without the use
2167
-
of recursion, the best that can be done is to use a pattern
2168
-
that matches up to some fixed depth of nesting. It is not
2169
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
2170
-
provided an experimental facility that allows regular
2171
-
expressions to recurse (among other things). The special
2172
-
item (?R) is provided for the specific case of recursion.
2173
-
This PCRE pattern solves the parentheses problem (assume
2208
+
Consider the problem of matching a string in parentheses,
2209
+
allowing for unlimited nested parentheses. Without the use
2210
+
of recursion, the best that can be done is to use a pattern
2211
+
that matches up to some fixed depth of nesting. It is not
2212
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2213
+
provided an experimental facility that allows regular
2214
+
expressions to recurse (among other things). The special
2215
+
item (?R) is provided for the specific case of recursion.
2216
+
This PCRE pattern solves the parentheses problem (assume
2174
2217
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2175
2218
option is set so that white space is
2176
2219
ignored):
...
...
@@ -2179,45 +2222,45 @@
2179
2222
</para>
2180
2223
<para>
2181
2224
First it matches an opening parenthesis. Then it matches any
2182
-
number of substrings which can either be a sequence of
2183
-
non-parentheses, or a recursive match of the pattern itself
2225
+
number of substrings which can either be a sequence of
2226
+
non-parentheses, or a recursive match of the pattern itself
2184
2227
(i.e. a correctly parenthesized substring). Finally there is
2185
2228
a closing parenthesis.
2186
2229
</para>
2187
2230
<para>
2188
-
This particular example pattern contains nested unlimited
2231
+
This particular example pattern contains nested unlimited
2189
2232
repeats, and so the use of a once-only subpattern for matching
2190
-
strings of non-parentheses is important when applying
2191
-
the pattern to strings that do not match. For example, when
2233
+
strings of non-parentheses is important when applying
2234
+
the pattern to strings that do not match. For example, when
2192
2235
it is applied to
2193
2236

2194
2237
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
2195
2238

2196
-
it yields "no match" quickly. However, if a once-only subpattern
2197
-
is not used, the match runs for a very long time
2198
-
indeed because there are so many different ways the + and *
2199
-
repeats can carve up the subject, and all have to be tested
2239
+
it yields "no match" quickly. However, if a once-only subpattern
2240
+
is not used, the match runs for a very long time
2241
+
indeed because there are so many different ways the + and *
2242
+
repeats can carve up the subject, and all have to be tested
2200
2243
before failure can be reported.
2201
2244
</para>
2202
2245
<para>
2203
-
The values set for any capturing subpatterns are those from
2246
+
The values set for any capturing subpatterns are those from
2204
2247
the outermost level of the recursion at which the subpattern
2205
2248
value is set. If the pattern above is matched against
2206
2249

2207
2250
<literal>(ab(cd)ef)</literal>
2208
2251

2209
-
the value for the capturing parentheses is "ef", which is
2210
-
the last value taken on at the top level. If additional
2252
+
the value for the capturing parentheses is "ef", which is
2253
+
the last value taken on at the top level. If additional
2211
2254
parentheses are added, giving
2212
2255

2213
2256
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2214
2257
then the string they capture
2215
2258
is "ab(cd)ef", the contents of the top level parentheses. If
2216
-
there are more than 15 capturing parentheses in a pattern,
2217
-
PCRE has to obtain extra memory to store data during a
2218
-
recursion, which it does by using pcre_malloc, freeing it
2219
-
via pcre_free afterwards. If no memory can be obtained, it
2220
-
saves data for the first 15 capturing parentheses only, as
2259
+
there are more than 15 capturing parentheses in a pattern,
2260
+
PCRE has to obtain extra memory to store data during a
2261
+
recursion, which it does by using pcre_malloc, freeing it
2262
+
via pcre_free afterwards. If no memory can be obtained, it
2263
+
saves data for the first 15 capturing parentheses only, as
2221
2264
there is no way to give an out-of-memory error from within a
2222
2265
recursion.
2223
2266
</para>
...
...
@@ -2256,75 +2299,75 @@
2256
2299
<title>Performance</title>
2257
2300
<para>
2258
2301
Certain items that may appear in patterns are more efficient
2259
-
than others. It is more efficient to use a character class
2302
+
than others. It is more efficient to use a character class
2260
2303
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2261
-
In general, the simplest construction that provides the
2262
-
required behaviour is usually the most efficient. Jeffrey
2263
-
Friedl's book contains a lot of discussion about optimizing
2304
+
In general, the simplest construction that provides the
2305
+
required behaviour is usually the most efficient. Jeffrey
2306
+
Friedl's book contains a lot of discussion about optimizing
2264
2307
regular expressions for efficient performance.
2265
2308
</para>
2266
2309
<para>
2267
2310
When a pattern begins with .* and the <link
2268
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2269
-
set, the pattern is implicitly anchored by PCRE, since it
2311
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2312
+
set, the pattern is implicitly anchored by PCRE, since it
2270
2313
can match only at the start of a subject string. However, if
2271
2314
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2272
2315
is not set, PCRE cannot make this optimization,
2273
-
because the . metacharacter does not then match a newline,
2316
+
because the . metacharacter does not then match a newline,
2274
2317
and if the subject string contains newlines, the pattern may
2275
-
match from the character immediately following one of them
2318
+
match from the character immediately following one of them
2276
2319
instead of from the very start. For example, the pattern
2277
2320

2278
2321
<literal>(.*) second</literal>
2279
2322

2280
2323
matches the subject "first\nand second" (where \n stands for
2281
2324
a newline character) with the first captured substring being
2282
-
"and". In order to do this, PCRE has to retry the match
2325
+
"and". In order to do this, PCRE has to retry the match
2283
2326
starting after every newline in the subject.
2284
2327
</para>
2285
2328
<para>
2286
2329
If you are using such a pattern with subject strings that do
2287
-
not contain newlines, the best performance is obtained by
2330
+
not contain newlines, the best performance is obtained by
2288
2331
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2289
-
or starting the pattern with ^.* to
2290
-
indicate explicit anchoring. That saves PCRE from having to
2332
+
or starting the pattern with ^.* to
2333
+
indicate explicit anchoring. That saves PCRE from having to
2291
2334
scan along the subject looking for a newline to restart at.
2292
2335
</para>
2293
2336
<para>
2294
-
Beware of patterns that contain nested indefinite repeats.
2295
-
These can take a long time to run when applied to a string
2337
+
Beware of patterns that contain nested indefinite repeats.
2338
+
These can take a long time to run when applied to a string
2296
2339
that does not match. Consider the pattern fragment
2297
2340

2298
2341
<literal>(a+)*</literal>
2299
2342
</para>
2300
2343
<para>
2301
-
This can match "aaaa" in 33 different ways, and this number
2302
-
increases very rapidly as the string gets longer. (The *
2303
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2304
-
those cases other than 0, the + repeats can match different
2344
+
This can match "aaaa" in 33 different ways, and this number
2345
+
increases very rapidly as the string gets longer. (The *
2346
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2347
+
those cases other than 0, the + repeats can match different
2305
2348
numbers of times.) When the remainder of the pattern is such
2306
-
that the entire match is going to fail, PCRE has in principle
2307
-
to try every possible variation, and this can take an
2349
+
that the entire match is going to fail, PCRE has in principle
2350
+
to try every possible variation, and this can take an
2308
2351
extremely long time.
2309
2352
</para>
2310
2353
<para>
2311
-
An optimization catches some of the more simple cases such
2354
+
An optimization catches some of the more simple cases such
2312
2355
as
2313
2356

2314
2357
<literal>(a+)*b</literal>
2315
2358

2316
-
where a literal character follows. Before embarking on the
2359
+
where a literal character follows. Before embarking on the
2317
2360
standard matching procedure, PCRE checks that there is a "b"
2318
-
later in the subject string, and if there is not, it fails
2319
-
the match immediately. However, when there is no following
2320
-
literal this optimization cannot be used. You can see the
2361
+
later in the subject string, and if there is not, it fails
2362
+
the match immediately. However, when there is no following
2363
+
literal this optimization cannot be used. You can see the
2321
2364
difference by comparing the behaviour of
2322
2365

2323
2366
<literal>(a+)*\d</literal>
2324
2367

2325
-
with the pattern above. The former gives a failure almost
2326
-
instantly when applied to a whole line of "a" characters,
2327
-
whereas the latter takes an appreciable time with strings
2368
+
with the pattern above. The former gives a failure almost
2369
+
instantly when applied to a whole line of "a" characters,
2370
+
whereas the latter takes an appreciable time with strings
2328
2371
longer than about 20 characters.
2329
2372
</para>
2330
2373
</section>
2331
2374