reference/pcre/pattern.syntax.xml
bb4abab22bf0204b4dba0140ac5fc9daa6888e0f
...
...
@@ -8,21 +8,21 @@
8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described in this section. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -102,15 +102,15 @@
102
102
<section xml:id="regexp.reference.meta">
103
103
<title>Meta-characters</title>
104
104
<para>
105
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
106
106
ability to include alternatives and repetitions in the
107
-
pattern. These are encoded in the pattern by the use of
108
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
109
109
are interpreted in some special way.
110
110
</para>
111
111
<para>
112
-
There are two different sets of meta-characters: those that
113
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
114
114
brackets, and those that are recognized in square brackets.
115
115
Outside square brackets, the meta-characters are as follows:
116
116

...
...
@@ -130,7 +130,8 @@
130
130
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
131
</row>
132
132
<row>
133
-
<entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
134
135
</row>
135
136
<row>
136
137
<entry>.</entry><entry>match any character except newline (by default)</entry>
...
...
@@ -204,9 +205,9 @@
204
205
<section xml:id="regexp.reference.escape">
205
206
<title>Escape sequences</title>
206
207
<para>
207
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
208
209
followed by a non-alphanumeric character, it takes away any
209
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
210
211
backslash as an escape character applies both inside and
211
212
outside character classes.
212
213
</para>
...
...
@@ -215,7 +216,7 @@
215
216
"\*" in the pattern. This applies whether or not the
216
217
following character would otherwise be interpreted as a
217
218
meta-character, so it is always safe to precede a non-alphanumeric
218
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
219
220
particular, if you want to match a backslash, you write "\\".
220
221
</para>
221
222
<note>
...
...
@@ -237,10 +238,10 @@
237
238
<para>
238
239
A second use of backslash provides a way of encoding
239
240
non-printing characters in patterns in a visible manner. There
240
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
241
242
apart from the binary zero that terminates a pattern,
242
243
but when a pattern is being prepared by text editing, it is
243
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
244
245
than the binary character it represents:
245
246
</para>
246
247
<para>
...
...
@@ -331,9 +332,9 @@
331
332
</para>
332
333
<para>
333
334
The precise effect of "<literal>\cx</literal>" is as follows:
334
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
335
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
336
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
338
339
becomes hex 7B.
339
340
</para>
...
...
@@ -349,7 +350,7 @@
349
350
</para>
350
351
<para>
351
352
After "<literal>\0</literal>" up to two further octal digits are read.
352
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
353
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
354
355
specifies two binary zeros followed by a BEL character. Make sure you
355
356
supply two digits after the initial zero if the character
...
...
@@ -358,20 +359,20 @@
358
359
<para>
359
360
The handling of a backslash followed by a digit other than 0
360
361
is complicated. Outside a character class, PCRE reads it
361
-
and any following digits as a decimal number. If the number
362
-
is less than 10, or if there have been at least that many
363
-
previous capturing left parentheses in the expression, the
364
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
365
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
366
367
of parenthesized subpatterns.
367
368
</para>
368
369
<para>
369
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
370
371
greater than 9 and there have not been that many capturing
371
372
subpatterns, PCRE re-reads up to three octal digits following
372
373
the backslash, and generates a single byte from the
373
374
least significant 8 bits of the value. Any subsequent digits
374
-
stand for themselves. For example:
375
+
stand for themselves. For example:
375
376
</para>
376
377
<para>
377
378
<variablelist>
...
...
@@ -439,7 +440,7 @@
439
440
digits are ever read.
440
441
</para>
441
442
<para>
442
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
443
444
used both inside and outside character classes. In addition,
444
445
inside a character class, the sequence "<literal>\b</literal>"
445
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -506,7 +507,7 @@
506
507
</para>
507
508
<para>
508
509
A "word" character is any letter or digit or the underscore
509
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
510
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
511
512
controlled by PCRE's character tables, and may vary if locale-specific
512
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -515,15 +516,15 @@
515
516
</para>
516
517
<para>
517
518
These character type sequences can appear both inside and
518
-
outside character classes. They each match one character of
519
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
520
521
the end of the subject string, all of them fail, since there
521
522
is no character to match.
522
523
</para>
523
524
<para>
524
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
525
526
assertions. An assertion specifies a condition that has to be met
526
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
527
528
characters from the subject string. The use of subpatterns
528
529
for more complicated assertions is described below. The
529
530
backslashed assertions are
...
...
@@ -562,7 +563,7 @@
562
563
</variablelist>
563
564
</para>
564
565
<para>
565
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
566
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
567
568
character, inside a character class).
568
569
</para>
...
...
@@ -570,20 +571,20 @@
570
571
A word boundary is a position in the subject string where
571
572
the current character and the previous character do not both
572
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
573
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
574
575
<literal>\W</literal>), or the start or end of the string if the first
575
576
or last character matches <literal>\w</literal>, respectively.
576
577
</para>
577
578
<para>
578
579
The <literal>\A</literal>, <literal>\Z</literal>, and
579
-
<literal>\z</literal> assertions differ from the traditional
580
-
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only
581
-
ever match at the very start and end of the subject string,
582
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
583
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
584
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
585
-
options. The difference between <literal>\Z</literal> and
586
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
587
588
newline that is the last character of the string as well as at the end of
588
589
the string, whereas <literal>\z</literal> matches only at the end.
589
590
</para>
...
...
@@ -600,7 +601,11 @@
600
601
regexp metacharacters in the pattern. For example:
601
602
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
602
603
followed by literals <literal>.$.</literal> and anchored at the end of
603
-
the string.
604
+
the string. Note that this does not change the behavior of
605
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
606
+
is not valid, because the second <literal>#</literal> marks the end
607
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
608
+
modifiers.
604
609
</para>
605
610

606
611
<para>
...
...
@@ -835,7 +840,7 @@
835
840
<row rowsep="1">
836
841
<entry><literal>So</literal></entry>
837
842
<entry>Other symbol</entry>
838
-
<entry></entry>
843
+
<entry>Includes emojis</entry>
839
844
</row>
840
845
<row>
841
846
<entry><literal>Z</literal></entry>
...
...
@@ -869,8 +874,8 @@
869
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
870
875
</para>
871
876
<para>
872
-
Sets of Unicode characters are defined as belonging to certain scripts. A
873
-
character from one of these sets can be matched using a script name. For
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
874
879
example:
875
880
</para>
876
881
<itemizedlist>
...
...
@@ -882,7 +887,7 @@
882
887
</listitem>
883
888
</itemizedlist>
884
889
<para>
885
-
Those that are not part of an identified script are lumped together as
890
+
Those that are not part of an identified script are lumped together as
886
891
<literal>Common</literal>. The current list of scripts is:
887
892
</para>
888
893
<table>
...
...
@@ -1051,7 +1056,7 @@
1051
1056
<para>
1052
1057
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1053
1058
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1054
-
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1055
1060
character without the "mark" property, followed by zero or more characters
1056
1061
with the "mark" property, and treats the sequence as an atomic group (see
1057
1062
below). Characters with the "mark" property are typically accents that
...
...
@@ -1071,8 +1076,8 @@
1071
1076
<para>
1072
1077
Outside a character class, in the default matching mode, the
1073
1078
circumflex character (<literal>^</literal>) is an assertion which
1074
-
is true only if the current matching point is at the start of
1075
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1076
1081
has an entirely different meaning (see below).
1077
1082
</para>
1078
1083
<para>
...
...
@@ -1087,12 +1092,12 @@
1087
1092
</para>
1088
1093
<para>
1089
1094
A dollar character (<literal>$</literal>) is an assertion which is
1090
-
&true; only if the current matching point is at the end of the subject
1091
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
1092
1097
character in the string (by default). Dollar (<literal>$</literal>)
1093
-
need not be the last character of the pattern if a number of
1094
-
alternatives are involved, but it should be the last item in any branch
1095
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
1096
1101
character class.
1097
1102
</para>
1098
1103
<para>
...
...
@@ -1118,9 +1123,9 @@
1118
1123
set.
1119
1124
</para>
1120
1125
<para>
1121
-
Note that the sequences \A, \Z, and \z can be used to match
1122
-
the start and end of the subject in both modes, and if all
1123
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
1124
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1125
1130
is set or not.
1126
1131
</para>
...
...
@@ -1129,14 +1134,14 @@
1129
1134
<section xml:id="regexp.reference.dot">
1130
1135
<title>Dot</title>
1131
1136
<para>
1132
-
Outside a character class, a dot in the pattern matches any
1133
-
one character in the subject, including a non-printing
1134
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
1135
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1136
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
1137
1142
handling of dot is entirely independent of the handling of
1138
-
circumflex and dollar, the only relationship being that they
1139
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
1140
1145
in a character class.
1141
1146
</para>
1142
1147
<para>
...
...
@@ -1150,29 +1155,29 @@
1150
1155
<title>Character classes</title>
1151
1156
<para>
1152
1157
An opening square bracket introduces a character class,
1153
-
terminated by a closing square bracket. A closing square
1154
-
bracket on its own is not special. If a closing square
1155
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
1156
1161
the first data character in the class (after an initial
1157
1162
circumflex, if present) or escaped with a backslash.
1158
1163
</para>
1159
1164
<para>
1160
1165
A character class matches a single character in the subject;
1161
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
1162
1167
the class, unless the first character in the class is a
1163
-
circumflex, in which case the subject character must not be in
1164
-
the set defined by the class. If a circumflex is actually
1165
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
1166
1171
first character, or escape it with a backslash.
1167
1172
</para>
1168
1173
<para>
1169
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
1170
1175
case vowel, while [^aeiou] matches any character that is not
1171
-
a lower case vowel. Note that a circumflex is just a
1172
-
convenient notation for specifying the characters which are in
1173
-
the class by enumerating those that are not. It is not an
1174
-
assertion: it still consumes a character from the subject
1175
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
1176
1181
the string.
1177
1182
</para>
1178
1183
<para>
...
...
@@ -1184,61 +1189,62 @@
1184
1189
</para>
1185
1190
<para>
1186
1191
The newline character is never treated in any special way in
1187
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
1188
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1189
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1190
1195
options is. A class such as [^a] will always match a newline.
1191
1196
</para>
1192
1197
<para>
1193
-
The minus (hyphen) character can be used to specify a range
1194
-
of characters in a character class. For example, [d-m]
1195
-
matches any letter between d and m, inclusive. If a minus
1196
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1197
1202
backslash or appear in a position where it cannot be
1198
1203
interpreted as indicating a range, typically as the first or last
1199
1204
character in the class.
1200
1205
</para>
1201
1206
<para>
1202
-
It is not possible to have the literal character "]" as the
1203
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1204
1209
interpreted as a class of two characters ("W" and "-")
1205
1210
followed by a literal string "46]", so it would match "W46]" or
1206
-
"-46]". However, if the "]" is escaped with a backslash it
1207
-
is interpreted as the end of range, so [W-\]46] is
1208
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1209
1214
separate characters. The octal or hexadecimal representation
1210
1215
of "]" can also be used to end a range.
1211
1216
</para>
1212
1217
<para>
1213
1218
Ranges operate in ASCII collating sequence. They can also be
1214
-
used for characters specified numerically, for example
1215
-
[\000-\037]. If a range that includes letters is used when
1216
-
case-insensitive (caseless) matching is set, it matches the
1217
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1218
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1219
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1220
1225
accented E characters in both cases.
1221
1226
</para>
1222
1227
<para>
1223
-
The character types \d, \D, \s, \S, \w, and \W may also
1224
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1225
1230
they match to the class. For example, [\dABCDEF] matches any
1226
-
hexadecimal digit. A circumflex can conveniently be used
1227
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1228
1233
restricted set of characters than the matching lower case type.
1229
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1230
1235
but not underscore.
1231
1236
</para>
1232
1237
<para>
1233
-
All non-alphanumeric characters other than \, -, ^ (at the
1234
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1235
1240
classes, but it does no harm if they are escaped. The pattern
1236
1241
terminator is always special and must be escaped when used
1237
1242
within an expression.
1238
1243
</para>
1239
1244
<para>
1240
1245
Perl supports the POSIX notation for character classes. This uses names
1241
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1242
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1243
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1244
1250
names are:
...
...
@@ -1293,16 +1299,16 @@
1293
1299
<section xml:id="regexp.reference.alternation">
1294
1300
<title>Alternation</title>
1295
1301
<para>
1296
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1297
1303
patterns. For example, the pattern
1298
1304
<literal>gilbert|sullivan</literal>
1299
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1300
-
may appear, and an empty alternative is permitted
1301
-
(matching the empty string). The matching process tries
1302
-
each alternative in turn, from left to right, and the first
1303
-
one that succeeds is used. If the alternatives are within a
1304
-
subpattern (defined below), "succeeds" means matching the
1305
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1306
1312
subpattern.
1307
1313
</para>
1308
1314
</section>
...
...
@@ -1317,7 +1323,7 @@
1317
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1318
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1319
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1320
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1321
1327
")". The option letters are:
1322
1328

1323
1329
<table>
...
...
@@ -1346,7 +1352,8 @@
1346
1352
</row>
1347
1353
<row>
1348
1354
<entry><literal>X</literal></entry>
1349
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1350
1357
</row>
1351
1358
<row>
1352
1359
<entry><literal>J</literal></entry>
...
...
@@ -1357,16 +1364,16 @@
1357
1364
</table>
1358
1365
</para>
1359
1366
<para>
1360
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1361
1368
also possible to unset these options by preceding the letter
1362
-
with a hyphen, and a combined setting and unsetting such as
1363
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1364
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1365
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1366
1373
while unsetting <link
1367
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1368
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1369
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1370
1377
hyphen, the option is unset.
1371
1378
</para>
1372
1379
<para>
...
...
@@ -1376,14 +1383,14 @@
1376
1383
and "abC".
1377
1384
</para>
1378
1385
<para>
1379
-
If an option change occurs inside a subpattern, the effect
1380
-
is different. This is a change of behaviour in Perl 5.005.
1381
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1382
1389
of the subpattern that follows it, so
1383
1390

1384
1391
<literal>(a(?i)b)c</literal>
1385
1392

1386
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1387
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1388
1395
used). By this means, options can be made to have different settings in
1389
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1392,18 +1399,18 @@
1392
1399

1393
1400
<literal>(a(?i)b|c)</literal>
1394
1401

1395
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1396
1403
"C" the first branch is abandoned before the option setting.
1397
-
This is because the effects of option settings happen at
1398
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1399
1406
</para>
1400
1407
<para>
1401
1408
The PCRE-specific options <link
1402
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1403
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1404
1411
be changed in the same way as the Perl-compatible options by
1405
-
using the characters U and X respectively. The (?X) flag
1406
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1407
1414
the pattern than any of the additional features it turns on,
1408
1415
even when it is at top level. It is best put at the start.
1409
1416
</para>
...
...
@@ -1412,8 +1419,8 @@
1412
1419
<section xml:id="regexp.reference.subpatterns">
1413
1420
<title>Subpatterns</title>
1414
1421
<para>
1415
-
Subpatterns are delimited by parentheses (round brackets),
1416
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1417
1424
does two things:
1418
1425
</para>
1419
1426
<orderedlist>
...
...
@@ -1442,30 +1449,30 @@
1442
1449

1443
1450
<literal>the ((red|white) (king|queen))</literal>
1444
1451

1445
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1446
1453
and are numbered 1, 2, and 3.
1447
1454
</para>
1448
1455
<para>
1449
-
The fact that plain parentheses fulfill two functions is not
1450
-
always helpful. There are often times when a grouping subpattern
1451
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1452
1459
opening parenthesis is followed by "?:", the subpattern does
1453
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1454
1461
number of any subsequent capturing subpatterns. For example,
1455
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1456
1463
pattern
1457
1464

1458
1465
<literal>the ((?:red|white) (king|queen))</literal>
1459
1466

1460
-
the captured substrings are "white queen" and "queen", and
1461
-
are numbered 1 and 2. The maximum number of captured substrings
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1462
1469
is 65535. It may not be possible to compile such large patterns,
1463
1470
however, depending on the configuration options of libpcre.
1464
1471
</para>
1465
1472
<para>
1466
-
As a convenient shorthand, if any option settings are
1467
-
required at the start of a non-capturing subpattern, the
1468
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1469
1476
the two patterns
1470
1477
</para>
1471
1478

...
...
@@ -1479,10 +1486,10 @@
1479
1486
</informalexample>
1480
1487

1481
1488
<para>
1482
-
match exactly the same set of strings. Because alternative
1483
-
branches are tried from left to right, and options are not
1484
-
reset until the end of the subpattern is reached, an option
1485
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1486
1493
the above patterns match "SUNDAY" as well as "Saturday".
1487
1494
</para>
1488
1495

...
...
@@ -1511,9 +1518,10 @@
1511
1518

1512
1519
<para>
1513
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1514
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1515
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1516
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1517
1525
</para>
1518
1526

1519
1527
<informalexample>
...
...
@@ -1539,45 +1547,56 @@
1539
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1540
1548
<listitem><simpara>a character class</simpara></listitem>
1541
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1542
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1543
1551
see below)</simpara></listitem>
1544
1552
</itemizedlist>
1545
1553
</para>
1546
1554
<para>
1547
-
The general repetition quantifier specifies a minimum and
1548
-
maximum number of permitted matches, by giving the two
1549
-
numbers in curly brackets (braces), separated by a comma.
1550
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1551
1559
less than or equal to the second. For example:
1552
1560

1553
1561
<literal>z{2,4}</literal>
1554
1562

1555
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1556
1564
is not a special character. If the second number is omitted,
1557
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1558
1566
second number and the comma are both omitted, the quantifier
1559
1567
specifies an exact number of required matches. Thus
1560
1568

1561
1569
<literal>[aeiou]{3,}</literal>
1562
1570

1563
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1564
1572
more, while
1565
1573

1566
1574
<literal>\d{8}</literal>
1567
1575

1568
-
matches exactly 8 digits. An opening curly bracket that
1569
-
appears in a position where a quantifier is not allowed, or
1570
-
one that does not match the syntax of a quantifier, is taken
1571
-
as a literal character. For example, {,6} is not a quantifier,
1572
-
but a literal string of four characters.
1576
+
matches exactly 8 digits.
1577
+

1573
1578
</para>
1579
+
<simpara>
1580
+
Prior to PHP 8.4.0, an opening curly bracket that
1581
+
appears in a position where a quantifier is not allowed, or
1582
+
one that does not match the syntax of a quantifier, is taken
1583
+
as a literal character. For example, <literal>{,6}</literal>
1584
+
is not a quantifier, but a literal string of four characters.
1585
+

1586
+
As of PHP 8.4.0, the PCRE extension is bundled with PCRE2 version 10.44,
1587
+
which allows patterns such as <literal>\d{,8}</literal> and they are
1588
+
interpreted as <literal>\d{0,8}</literal>.
1589
+

1590
+
Further, as of PHP 8.4.0, space characters around quantifiers such as
1591
+
<literal>\d{0 , 8}</literal> and <literal>\d{ 0 , 8 }</literal> are allowed.
1592
+
</simpara>
1574
1593
<para>
1575
-
The quantifier {0} is permitted, causing the expression to
1576
-
behave as if the previous item and the quantifier were not
1594
+
The quantifier {0} is permitted, causing the expression to
1595
+
behave as if the previous item and the quantifier were not
1577
1596
present.
1578
1597
</para>
1579
1598
<para>
1580
-
For convenience (and historical compatibility) the three
1599
+
For convenience (and historical compatibility) the three
1581
1600
most common quantifiers have single-character abbreviations:
1582
1601

1583
1602
<table>
...
...
@@ -1601,63 +1620,63 @@
1601
1620
</table>
1602
1621
</para>
1603
1622
<para>
1604
-
It is possible to construct infinite loops by following a
1605
-
subpattern that can match no characters with a quantifier
1623
+
It is possible to construct infinite loops by following a
1624
+
subpattern that can match no characters with a quantifier
1606
1625
that has no upper limit, for example:
1607
1626

1608
1627
<literal>(a?)*</literal>
1609
1628
</para>
1610
1629
<para>
1611
-
Earlier versions of Perl and PCRE used to give an error at
1612
-
compile time for such patterns. However, because there are
1613
-
cases where this can be useful, such patterns are now
1614
-
accepted, but if any repetition of the subpattern does in
1630
+
Earlier versions of Perl and PCRE used to give an error at
1631
+
compile time for such patterns. However, because there are
1632
+
cases where this can be useful, such patterns are now
1633
+
accepted, but if any repetition of the subpattern does in
1615
1634
fact match no characters, the loop is forcibly broken.
1616
1635
</para>
1617
1636
<para>
1618
-
By default, the quantifiers are "greedy", that is, they
1619
-
match as much as possible (up to the maximum number of permitted
1620
-
times), without causing the rest of the pattern to
1637
+
By default, the quantifiers are "greedy", that is, they
1638
+
match as much as possible (up to the maximum number of permitted
1639
+
times), without causing the rest of the pattern to
1621
1640
fail. The classic example of where this gives problems is in
1622
1641
trying to match comments in C programs. These appear between
1623
-
the sequences /* and */ and within the sequence, individual
1624
-
* and / characters may appear. An attempt to match C comments
1642
+
the sequences /* and */ and within the sequence, individual
1643
+
* and / characters may appear. An attempt to match C comments
1625
1644
by applying the pattern
1626
1645

1627
1646
<literal>/\*.*\*/</literal>
1628
1647

1629
1648
to the string
1630
1649

1631
-
<literal>/* first comment */ not comment /* second comment */</literal>
1650
+
<literal>/* first comment */ not comment /* second comment */</literal>
1632
1651

1633
-
fails, because it matches the entire string due to the
1634
-
greediness of the .* item.
1652
+
fails, because it matches the entire string due to the
1653
+
greediness of the .* item.
1635
1654
</para>
1636
1655
<para>
1637
-
However, if a quantifier is followed by a question mark,
1656
+
However, if a quantifier is followed by a question mark,
1638
1657
then it becomes lazy, and instead matches the minimum
1639
1658
number of times possible, so the pattern
1640
1659

1641
1660
<literal>/\*.*?\*/</literal>
1642
1661

1643
1662
does the right thing with the C comments. The meaning of the
1644
-
various quantifiers is not otherwise changed, just the preferred
1645
-
number of matches. Do not confuse this use of
1646
-
question mark with its use as a quantifier in its own right.
1663
+
various quantifiers is not otherwise changed, just the preferred
1664
+
number of matches. Do not confuse this use of
1665
+
question mark with its use as a quantifier in its own right.
1647
1666
Because it has two uses, it can sometimes appear doubled, as
1648
1667
in
1649
1668

1650
1669
<literal>\d??\d</literal>
1651
1670

1652
-
which matches one digit by preference, but can match two if
1671
+
which matches one digit by preference, but can match two if
1653
1672
that is the only way the rest of the pattern matches.
1654
1673
</para>
1655
1674
<para>
1656
1675
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1657
-
option is set (an option which is not
1658
-
available in Perl) then the quantifiers are not greedy by
1676
+
option is set (an option which is not
1677
+
available in Perl) then the quantifiers are not greedy by
1659
1678
default, but individual ones can be made greedy by following
1660
-
them with a question mark. In other words, it inverts the
1679
+
them with a question mark. In other words, it inverts the
1661
1680
default behaviour.
1662
1681
</para>
1663
1682
<para>
...
...
@@ -1669,41 +1688,41 @@
1669
1688
</para>
1670
1689
<para>
1671
1690
When a parenthesized subpattern is quantified with a minimum
1672
-
repeat count that is greater than 1 or with a limited maximum,
1673
-
more store is required for the compiled pattern, in
1691
+
repeat count that is greater than 1 or with a limited maximum,
1692
+
more store is required for the compiled pattern, in
1674
1693
proportion to the size of the minimum or maximum.
1675
1694
</para>
1676
1695
<para>
1677
-
If a pattern starts with .* or .{0,} and the <link
1696
+
If a pattern starts with .* or .{0,} and the <link
1678
1697
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1679
1698
option (equivalent to Perl's /s) is set, thus allowing the .
1680
-
to match newlines, then the pattern is implicitly anchored,
1699
+
to match newlines, then the pattern is implicitly anchored,
1681
1700
because whatever follows will be tried against every character
1682
-
position in the subject string, so there is no point in
1683
-
retrying the overall match at any position after the first.
1701
+
position in the subject string, so there is no point in
1702
+
retrying the overall match at any position after the first.
1684
1703
PCRE treats such a pattern as though it were preceded by \A.
1685
-
In cases where it is known that the subject string contains
1704
+
In cases where it is known that the subject string contains
1686
1705
no newlines, it is worth setting <link
1687
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1706
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1688
1707
pattern begins with .* in order to
1689
1708
obtain this optimization, or
1690
1709
alternatively using ^ to indicate anchoring explicitly.
1691
1710
</para>
1692
1711
<para>
1693
-
When a capturing subpattern is repeated, the value captured
1712
+
When a capturing subpattern is repeated, the value captured
1694
1713
is the substring that matched the final iteration. For example, after
1695
1714

1696
1715
<literal>(tweedle[dume]{3}\s*)+</literal>
1697
1716

1698
-
has matched "tweedledum tweedledee" the value of the captured
1699
-
substring is "tweedledee". However, if there are
1700
-
nested capturing subpatterns, the corresponding captured
1701
-
values may have been set in previous iterations. For example,
1717
+
has matched "tweedledum tweedledee" the value of the captured
1718
+
substring is "tweedledee". However, if there are
1719
+
nested capturing subpatterns, the corresponding captured
1720
+
values may have been set in previous iterations. For example,
1702
1721
after
1703
1722

1704
1723
<literal>/(a|(b))+/</literal>
1705
1724

1706
-
matches "aba" the value of the second captured substring is
1725
+
matches "aba" the value of the second captured substring is
1707
1726
"b".
1708
1727
</para>
1709
1728
</section>
...
...
@@ -1711,74 +1730,74 @@
1711
1730
<section xml:id="regexp.reference.back-references">
1712
1731
<title>Back references</title>
1713
1732
<para>
1714
-
Outside a character class, a backslash followed by a digit
1715
-
greater than 0 (and possibly further digits) is a back
1716
-
reference to a capturing subpattern earlier (i.e. to its
1717
-
left) in the pattern, provided there have been that many
1733
+
Outside a character class, a backslash followed by a digit
1734
+
greater than 0 (and possibly further digits) is a back
1735
+
reference to a capturing subpattern earlier (i.e. to its
1736
+
left) in the pattern, provided there have been that many
1718
1737
previous capturing left parentheses.
1719
1738
</para>
1720
1739
<para>
1721
-
However, if the decimal number following the backslash is
1722
-
less than 10, it is always taken as a back reference, and
1723
-
causes an error only if there are not that many capturing
1724
-
left parentheses in the entire pattern. In other words, the
1725
-
parentheses that are referenced need not be to the left of
1726
-
the reference for numbers less than 10.
1740
+
However, if the decimal number following the backslash is
1741
+
less than 10, it is always taken as a back reference, and
1742
+
causes an error only if there are not that many capturing
1743
+
left parentheses in the entire pattern. In other words, the
1744
+
parentheses that are referenced need not be to the left of
1745
+
the reference for numbers less than 10.
1727
1746
A "forward back reference" can make sense when a repetition
1728
1747
is involved and the subpattern to the right has participated
1729
1748
in an earlier iteration. See the section
1730
-
entitled "Backslash" above for further details of the handling
1749
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1731
1750
of digits following a backslash.
1732
1751
</para>
1733
1752
<para>
1734
-
A back reference matches whatever actually matched the capturing
1753
+
A back reference matches whatever actually matched the capturing
1735
1754
subpattern in the current subject string, rather than
1736
1755
anything matching the subpattern itself. So the pattern
1737
1756

1738
1757
<literal>(sens|respons)e and \1ibility</literal>
1739
1758

1740
-
matches "sense and sensibility" and "response and responsibility",
1741
-
but not "sense and responsibility". If case-sensitive (caseful)
1759
+
matches "sense and sensibility" and "response and responsibility",
1760
+
but not "sense and responsibility". If case-sensitive (caseful)
1742
1761
matching is in force at the time of the back reference, then
1743
1762
the case of letters is relevant. For example,
1744
1763

1745
1764
<literal>((?i)rah)\s+\1</literal>
1746
1765

1747
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1748
-
though the original capturing subpattern is matched
1766
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1767
+
though the original capturing subpattern is matched
1749
1768
case-insensitively (caselessly).
1750
1769
</para>
1751
1770
<para>
1752
-
There may be more than one back reference to the same subpattern.
1753
-
If a subpattern has not actually been used in a
1754
-
particular match, then any back references to it always
1771
+
There may be more than one back reference to the same subpattern.
1772
+
If a subpattern has not actually been used in a
1773
+
particular match, then any back references to it always
1755
1774
fail. For example, the pattern
1756
1775

1757
1776
<literal>(a|(bc))\2</literal>
1758
1777

1759
-
always fails if it starts to match "a" rather than "bc".
1760
-
Because there may be up to 99 back references, all digits
1761
-
following the backslash are taken as part of a potential
1778
+
always fails if it starts to match "a" rather than "bc".
1779
+
Because there may be up to 99 back references, all digits
1780
+
following the backslash are taken as part of a potential
1762
1781
back reference number. If the pattern continues with a digit
1763
1782
character, then some delimiter must be used to terminate the
1764
1783
back reference. If the <link
1765
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1766
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1784
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1785
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1767
1786
</para>
1768
1787
<para>
1769
1788
A back reference that occurs inside the parentheses to which
1770
-
it refers fails when the subpattern is first used, so, for
1771
-
example, (a\1) never matches. However, such references can
1789
+
it refers fails when the subpattern is first used, so, for
1790
+
example, (a\1) never matches. However, such references can
1772
1791
be useful inside repeated subpatterns. For example, the pattern
1773
1792

1774
1793
<literal>(a|b\1)+</literal>
1775
1794

1776
-
matches any number of "a"s and also "aba", "ababba" etc. At
1795
+
matches any number of "a"s and also "aba", "ababba" etc. At
1777
1796
each iteration of the subpattern, the back reference matches
1778
-
the character string corresponding to the previous iteration.
1797
+
the character string corresponding to the previous iteration.
1779
1798
In order for this to work, the pattern must be such
1780
-
that the first iteration does not need to match the back
1781
-
reference. This can be done using alternation, as in the
1799
+
that the first iteration does not need to match the back
1800
+
reference. This can be done using alternation, as in the
1782
1801
example above, or by a quantifier with a minimum of zero.
1783
1802
</para>
1784
1803
<para>
...
...
@@ -1813,18 +1832,18 @@
1813
1832
<section xml:id="regexp.reference.assertions">
1814
1833
<title>Assertions</title>
1815
1834
<para>
1816
-
An assertion is a test on the characters following or
1817
-
preceding the current matching point that does not actually
1818
-
consume any characters. The simple assertions coded as \b,
1819
-
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1820
-
assertions are coded as subpatterns. There are two
1821
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1835
+
An assertion is a test on the characters following or
1836
+
preceding the current matching point that does not actually
1837
+
consume any characters. The simple assertions coded as \b,
1838
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1839
+
assertions are coded as subpatterns. There are two
1840
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1822
1841
subject string, and those that <emphasis>look behind</emphasis> it.
1823
1842
</para>
1824
1843
<para>
1825
1844
An assertion subpattern is matched in the normal way, except
1826
-
that it does not cause the current matching position to be
1827
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1845
+
that it does not cause the current matching position to be
1846
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1828
1847
assertions and (?! for negative assertions. For example,
1829
1848

1830
1849
<literal>\w+(?=;)</literal>
...
...
@@ -1834,27 +1853,27 @@
1834
1853

1835
1854
<literal>foo(?!bar)</literal>
1836
1855

1837
-
matches any occurrence of "foo" that is not followed by
1856
+
matches any occurrence of "foo" that is not followed by
1838
1857
"bar". Note that the apparently similar pattern
1839
1858

1840
1859
<literal>(?!foo)bar</literal>
1841
1860

1842
-
does not find an occurrence of "bar" that is preceded by
1861
+
does not find an occurrence of "bar" that is preceded by
1843
1862
something other than "foo"; it finds any occurrence of "bar"
1844
-
whatsoever, because the assertion (?!foo) is always &true;
1845
-
when the next three characters are "bar". A lookbehind
1863
+
whatsoever, because the assertion (?!foo) is always &true;
1864
+
when the next three characters are "bar". A lookbehind
1846
1865
assertion is needed to achieve this effect.
1847
1866
</para>
1848
1867
<para>
1849
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1868
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1850
1869
and (?&lt;! for negative assertions. For example,
1851
1870

1852
1871
<literal>(?&lt;!foo)bar</literal>
1853
1872

1854
-
does find an occurrence of "bar" that is not preceded by
1873
+
does find an occurrence of "bar" that is not preceded by
1855
1874
"foo". The contents of a lookbehind assertion are restricted
1856
-
such that all the strings it matches must have a fixed
1857
-
length. However, if there are several alternatives, they do
1875
+
such that all the strings it matches must have a fixed
1876
+
length. However, if there are several alternatives, they do
1858
1877
not all have to have the same fixed length. Thus
1859
1878

1860
1879
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1863,51 +1882,51 @@
1863
1882

1864
1883
<literal>(?&lt;!dogs?|cats?)</literal>
1865
1884

1866
-
causes an error at compile time. Branches that match different
1885
+
causes an error at compile time. Branches that match different
1867
1886
length strings are permitted only at the top level of
1868
-
a lookbehind assertion. This is an extension compared with
1869
-
Perl 5.005, which requires all branches to match the same
1887
+
a lookbehind assertion. This is an extension compared with
1888
+
Perl 5.005, which requires all branches to match the same
1870
1889
length of string. An assertion such as
1871
1890

1872
1891
<literal>(?&lt;=ab(c|de))</literal>
1873
1892

1874
-
is not permitted, because its single top-level branch can
1893
+
is not permitted, because its single top-level branch can
1875
1894
match two different lengths, but it is acceptable if rewritten
1876
1895
to use two top-level branches:
1877
1896

1878
1897
<literal>(?&lt;=abc|abde)</literal>
1879
1898

1880
-
The implementation of lookbehind assertions is, for each
1881
-
alternative, to temporarily move the current position back
1882
-
by the fixed width and then try to match. If there are
1883
-
insufficient characters before the current position, the
1884
-
match is deemed to fail. Lookbehinds in conjunction with
1885
-
once-only subpatterns can be particularly useful for matching
1886
-
at the ends of strings; an example is given at the end
1899
+
The implementation of lookbehind assertions is, for each
1900
+
alternative, to temporarily move the current position back
1901
+
by the fixed width and then try to match. If there are
1902
+
insufficient characters before the current position, the
1903
+
match is deemed to fail. Lookbehinds in conjunction with
1904
+
once-only subpatterns can be particularly useful for matching
1905
+
at the ends of strings; an example is given at the end
1887
1906
of the section on once-only subpatterns.
1888
1907
</para>
1889
1908
<para>
1890
-
Several assertions (of any sort) may occur in succession.
1909
+
Several assertions (of any sort) may occur in succession.
1891
1910
For example,
1892
1911

1893
1912
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1894
1913

1895
-
matches "foo" preceded by three digits that are not "999".
1896
-
Notice that each of the assertions is applied independently
1897
-
at the same point in the subject string. First there is a
1898
-
check that the previous three characters are all digits,
1914
+
matches "foo" preceded by three digits that are not "999".
1915
+
Notice that each of the assertions is applied independently
1916
+
at the same point in the subject string. First there is a
1917
+
check that the previous three characters are all digits,
1899
1918
then there is a check that the same three characters are not
1900
-
"999". This pattern does not match "foo" preceded by six
1919
+
"999". This pattern does not match "foo" preceded by six
1901
1920
characters, the first of which are digits and the last three
1902
-
of which are not "999". For example, it doesn't match
1921
+
of which are not "999". For example, it doesn't match
1903
1922
"123abcfoo". A pattern to do that is
1904
1923

1905
1924
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1906
1925
</para>
1907
1926
<para>
1908
-
This time the first assertion looks at the preceding six
1909
-
characters, checking that the first three are digits, and
1910
-
then the second assertion checks that the preceding three
1927
+
This time the first assertion looks at the preceding six
1928
+
characters, checking that the first three are digits, and
1929
+
then the second assertion checks that the preceding three
1911
1930
characters are not "999".
1912
1931
</para>
1913
1932
<para>
...
...
@@ -1915,26 +1934,26 @@
1915
1934

1916
1935
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1917
1936

1918
-
matches an occurrence of "baz" that is preceded by "bar"
1937
+
matches an occurrence of "baz" that is preceded by "bar"
1919
1938
which in turn is not preceded by "foo", while
1920
1939

1921
1940
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1922
1941

1923
-
is another pattern which matches "foo" preceded by three
1942
+
is another pattern which matches "foo" preceded by three
1924
1943
digits and any three characters that are not "999".
1925
1944
</para>
1926
1945
<para>
1927
1946
Assertion subpatterns are not capturing subpatterns, and may
1928
-
not be repeated, because it makes no sense to assert the
1929
-
same thing several times. If any kind of assertion contains
1930
-
capturing subpatterns within it, these are counted for the
1947
+
not be repeated, because it makes no sense to assert the
1948
+
same thing several times. If any kind of assertion contains
1949
+
capturing subpatterns within it, these are counted for the
1931
1950
purposes of numbering the capturing subpatterns in the whole
1932
-
pattern. However, substring capturing is carried out only
1933
-
for positive assertions, because it does not make sense for
1951
+
pattern. However, substring capturing is carried out only
1952
+
for positive assertions, because it does not make sense for
1934
1953
negative assertions.
1935
1954
</para>
1936
1955
<para>
1937
-
Assertions count towards the maximum of 200 parenthesized
1956
+
Assertions count towards the maximum of 200 parenthesized
1938
1957
subpatterns.
1939
1958
</para>
1940
1959
</section>
...
...
@@ -1942,17 +1961,17 @@
1942
1961
<section xml:id="regexp.reference.onlyonce">
1943
1962
<title>Once-only subpatterns</title>
1944
1963
<para>
1945
-
With both maximizing and minimizing repetition, failure of
1946
-
what follows normally causes the repeated item to be
1964
+
With both maximizing and minimizing repetition, failure of
1965
+
what follows normally causes the repeated item to be
1947
1966
re-evaluated to see if a different number of repeats allows the
1948
-
rest of the pattern to match. Sometimes it is useful to
1949
-
prevent this, either to change the nature of the match, or
1950
-
to cause it fail earlier than it otherwise might, when the
1951
-
author of the pattern knows there is no point in carrying
1967
+
rest of the pattern to match. Sometimes it is useful to
1968
+
prevent this, either to change the nature of the match, or
1969
+
to cause it fail earlier than it otherwise might, when the
1970
+
author of the pattern knows there is no point in carrying
1952
1971
on.
1953
1972
</para>
1954
1973
<para>
1955
-
Consider, for example, the pattern \d+foo when applied to
1974
+
Consider, for example, the pattern \d+foo when applied to
1956
1975
the subject line
1957
1976

1958
1977
<literal>123456bar</literal>
...
...
@@ -1960,108 +1979,108 @@
1960
1979
<para>
1961
1980
After matching all 6 digits and then failing to match "foo",
1962
1981
the normal action of the matcher is to try again with only 5
1963
-
digits matching the \d+ item, and then with 4, and so on,
1982
+
digits matching the \d+ item, and then with 4, and so on,
1964
1983
before ultimately failing. Once-only subpatterns provide the
1965
-
means for specifying that once a portion of the pattern has
1966
-
matched, it is not to be re-evaluated in this way, so the
1967
-
matcher would give up immediately on failing to match "foo"
1968
-
the first time. The notation is another kind of special
1984
+
means for specifying that once a portion of the pattern has
1985
+
matched, it is not to be re-evaluated in this way, so the
1986
+
matcher would give up immediately on failing to match "foo"
1987
+
the first time. The notation is another kind of special
1969
1988
parenthesis, starting with (?&gt; as in this example:
1970
1989

1971
1990
<literal>(?&gt;\d+)bar</literal>
1972
1991
</para>
1973
1992
<para>
1974
-
This kind of parenthesis "locks up" the part of the pattern
1975
-
it contains once it has matched, and a failure further into
1976
-
the pattern is prevented from backtracking into it.
1977
-
Backtracking past it to previous items, however, works as normal.
1993
+
This kind of parenthesis "locks up" the part of the pattern
1994
+
it contains once it has matched, and a failure further into
1995
+
the pattern is prevented from backtracking into it.
1996
+
Backtracking past it to previous items, however, works as normal.
1978
1997
</para>
1979
1998
<para>
1980
1999
An alternative description is that a subpattern of this type
1981
-
matches the string of characters that an identical standalone
2000
+
matches the string of characters that an identical standalone
1982
2001
pattern would match, if anchored at the current point
1983
2002
in the subject string.
1984
2003
</para>
1985
2004
<para>
1986
-
Once-only subpatterns are not capturing subpatterns. Simple
1987
-
cases such as the above example can be thought of as a maximizing
1988
-
repeat that must swallow everything it can. So,
2005
+
Once-only subpatterns are not capturing subpatterns. Simple
2006
+
cases such as the above example can be thought of as a maximizing
2007
+
repeat that must swallow everything it can. So,
1989
2008
while both \d+ and \d+? are prepared to adjust the number of
1990
-
digits they match in order to make the rest of the pattern
2009
+
digits they match in order to make the rest of the pattern
1991
2010
match, (?&gt;\d+) can only match an entire sequence of digits.
1992
2011
</para>
1993
2012
<para>
1994
-
This construction can of course contain arbitrarily complicated
2013
+
This construction can of course contain arbitrarily complicated
1995
2014
subpatterns, and it can be nested.
1996
2015
</para>
1997
2016
<para>
1998
2017
Once-only subpatterns can be used in conjunction with
1999
-
lookbehind assertions to specify efficient matching at the end
2018
+
lookbehind assertions to specify efficient matching at the end
2000
2019
of the subject string. Consider a simple pattern such as
2001
2020

2002
2021
<literal>abcd$</literal>
2003
2022

2004
-
when applied to a long string which does not match. Because
2005
-
matching proceeds from left to right, PCRE will look for
2023
+
when applied to a long string which does not match. Because
2024
+
matching proceeds from left to right, PCRE will look for
2006
2025
each "a" in the subject and then see if what follows matches
2007
2026
the rest of the pattern. If the pattern is specified as
2008
2027

2009
2028
<literal>^.*abcd$</literal>
2010
2029

2011
-
then the initial .* matches the entire string at first, but
2012
-
when this fails (because there is no following "a"), it
2030
+
then the initial .* matches the entire string at first, but
2031
+
when this fails (because there is no following "a"), it
2013
2032
backtracks to match all but the last character, then all but
2014
-
the last two characters, and so on. Once again the search
2015
-
for "a" covers the entire string, from right to left, so we
2033
+
the last two characters, and so on. Once again the search
2034
+
for "a" covers the entire string, from right to left, so we
2016
2035
are no better off. However, if the pattern is written as
2017
2036

2018
2037
<literal>^(?>.*)(?&lt;=abcd)</literal>
2019
2038

2020
-
then there can be no backtracking for the .* item; it can
2021
-
match only the entire string. The subsequent lookbehind
2039
+
then there can be no backtracking for the .* item; it can
2040
+
match only the entire string. The subsequent lookbehind
2022
2041
assertion does a single test on the last four characters. If
2023
-
it fails, the match fails immediately. For long strings,
2042
+
it fails, the match fails immediately. For long strings,
2024
2043
this approach makes a significant difference to the processing time.
2025
2044
</para>
2026
2045
<para>
2027
2046
When a pattern contains an unlimited repeat inside a subpattern
2028
2047
that can itself be repeated an unlimited number of
2029
-
times, the use of a once-only subpattern is the only way to
2030
-
avoid some failing matches taking a very long time indeed.
2048
+
times, the use of a once-only subpattern is the only way to
2049
+
avoid some failing matches taking a very long time indeed.
2031
2050
The pattern
2032
2051

2033
2052
<literal>(\D+|&lt;\d+>)*[!?]</literal>
2034
2053

2035
-
matches an unlimited number of substrings that either consist
2036
-
of non-digits, or digits enclosed in &lt;>, followed by
2054
+
matches an unlimited number of substrings that either consist
2055
+
of non-digits, or digits enclosed in &lt;>, followed by
2037
2056
either ! or ?. When it matches, it runs quickly. However, if
2038
2057
it is applied to
2039
2058

2040
2059
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
2041
2060

2042
-
it takes a long time before reporting failure. This is
2061
+
it takes a long time before reporting failure. This is
2043
2062
because the string can be divided between the two repeats in
2044
2063
a large number of ways, and all have to be tried. (The example
2045
-
used [!?] rather than a single character at the end,
2046
-
because both PCRE and Perl have an optimization that allows
2047
-
for fast failure when a single character is used. They
2048
-
remember the last single character that is required for a
2049
-
match, and fail early if it is not present in the string.)
2064
+
used [!?] rather than a single character at the end,
2065
+
because both PCRE and Perl have an optimization that allows
2066
+
for fast failure when a single character is used. They
2067
+
remember the last single character that is required for a
2068
+
match, and fail early if it is not present in the string.)
2050
2069
If the pattern is changed to
2051
2070

2052
2071
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
2053
2072

2054
-
sequences of non-digits cannot be broken, and failure happens quickly.
2073
+
sequences of non-digits cannot be broken, and failure happens quickly.
2055
2074
</para>
2056
2075
</section>
2057
2076

2058
2077
<section xml:id="regexp.reference.conditional">
2059
2078
<title>Conditional subpatterns</title>
2060
2079
<para>
2061
-
It is possible to cause the matching process to obey a subpattern
2062
-
conditionally or to choose between two alternative
2063
-
subpatterns, depending on the result of an assertion, or
2064
-
whether a previous capturing subpattern matched or not. The
2080
+
It is possible to cause the matching process to obey a subpattern
2081
+
conditionally or to choose between two alternative
2082
+
subpatterns, depending on the result of an assertion, or
2083
+
whether a previous capturing subpattern matched or not. The
2065
2084
two possible forms of conditional subpattern are
2066
2085
</para>
2067
2086

...
...
@@ -2075,39 +2094,39 @@
2075
2094
</informalexample>
2076
2095
<para>
2077
2096
If the condition is satisfied, the yes-pattern is used; otherwise
2078
-
the no-pattern (if present) is used. If there are
2097
+
the no-pattern (if present) is used. If there are
2079
2098
more than two alternatives in the subpattern, a compile-time
2080
2099
error occurs.
2081
2100
</para>
2082
2101
<para>
2083
-
There are two kinds of condition. If the text between the
2084
-
parentheses consists of a sequence of digits, then the
2085
-
condition is satisfied if the capturing subpattern of that
2086
-
number has previously matched. Consider the following pattern,
2087
-
which contains non-significant white space to make it
2088
-
more readable (assume the <link
2102
+
There are two kinds of condition. If the text between the
2103
+
parentheses consists of a sequence of digits, then the
2104
+
condition is satisfied if the capturing subpattern of that
2105
+
number has previously matched. Consider the following pattern,
2106
+
which contains non-significant white space to make it
2107
+
more readable (assume the <link
2089
2108
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2090
-
option) and to divide it into three parts for ease of discussion:
2109
+
option) and to divide it into three parts for ease of discussion:
2091
2110
</para>
2092
2111
<informalexample>
2093
2112
<programlisting>
2094
2113
<![CDATA[
2095
-
( \( )? [^()]+ (?(1) \) )
2114
+
( \( )? [^()]+ (?(1) \) )
2096
2115
]]>
2097
2116
</programlisting>
2098
2117
</informalexample>
2099
2118
<para>
2100
-
The first part matches an optional opening parenthesis, and
2101
-
if that character is present, sets it as the first captured
2102
-
substring. The second part matches one or more characters
2103
-
that are not parentheses. The third part is a conditional
2104
-
subpattern that tests whether the first set of parentheses
2105
-
matched or not. If they did, that is, if subject started
2106
-
with an opening parenthesis, the condition is &true;, and so
2107
-
the yes-pattern is executed and a closing parenthesis is
2108
-
required. Otherwise, since no-pattern is not present, the
2109
-
subpattern matches nothing. In other words, this pattern
2110
-
matches a sequence of non-parentheses, optionally enclosed
2119
+
The first part matches an optional opening parenthesis, and
2120
+
if that character is present, sets it as the first captured
2121
+
substring. The second part matches one or more characters
2122
+
that are not parentheses. The third part is a conditional
2123
+
subpattern that tests whether the first set of parentheses
2124
+
matched or not. If they did, that is, if subject started
2125
+
with an opening parenthesis, the condition is &true;, and so
2126
+
the yes-pattern is executed and a closing parenthesis is
2127
+
required. Otherwise, since no-pattern is not present, the
2128
+
subpattern matches nothing. In other words, this pattern
2129
+
matches a sequence of non-parentheses, optionally enclosed
2111
2130
in parentheses.
2112
2131
</para>
2113
2132
<para>
...
...
@@ -2116,10 +2135,10 @@
2116
2135
level", the condition is false.
2117
2136
</para>
2118
2137
<para>
2119
-
If the condition is not a sequence of digits or (R), it must be an
2120
-
assertion. This may be a positive or negative lookahead or
2121
-
lookbehind assertion. Consider this pattern, again containing
2122
-
non-significant white space, and with the two alternatives on
2138
+
If the condition is not a sequence of digits or (R), it must be an
2139
+
assertion. This may be a positive or negative lookahead or
2140
+
lookbehind assertion. Consider this pattern, again containing
2141
+
non-significant white space, and with the two alternatives on
2123
2142
the second line:
2124
2143
</para>
2125
2144

...
...
@@ -2127,18 +2146,18 @@
2127
2146
<programlisting>
2128
2147
<![CDATA[
2129
2148
(?(?=[^a-z]*[a-z])
2130
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2149
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2131
2150
]]>
2132
2151
</programlisting>
2133
2152
</informalexample>
2134
2153
<para>
2135
2154
The condition is a positive lookahead assertion that matches
2136
2155
an optional sequence of non-letters followed by a letter. In
2137
-
other words, it tests for the presence of at least one
2138
-
letter in the subject. If a letter is found, the subject is
2139
-
matched against the first alternative; otherwise it is
2140
-
matched against the second. This pattern matches strings in
2141
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2156
+
other words, it tests for the presence of at least one
2157
+
letter in the subject. If a letter is found, the subject is
2158
+
matched against the first alternative; otherwise it is
2159
+
matched against the second. This pattern matches strings in
2160
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2142
2161
letters and dd are digits.
2143
2162
</para>
2144
2163
</section>
...
...
@@ -2146,31 +2165,66 @@
2146
2165
<section xml:id="regexp.reference.comments">
2147
2166
<title>Comments</title>
2148
2167
<para>
2149
-
The sequence (?# marks the start of a comment which
2150
-
continues up to the next closing parenthesis. Nested
2168
+
The sequence (?# marks the start of a comment which
2169
+
continues up to the next closing parenthesis. Nested
2151
2170
parentheses are not permitted. The characters that make up a
2152
2171
comment play no part in the pattern matching at all.
2153
2172
</para>
2154
2173
<para>
2155
2174
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2156
-
option is set, an unescaped # character outside a character class
2175
+
option is set, an unescaped # character outside a character class
2157
2176
introduces a comment that continues up to the next newline character
2158
2177
in the pattern.
2159
2178
</para>
2179
+
<para>
2180
+
<example>
2181
+
<title>Usage of comments in PCRE pattern</title>
2182
+
<programlisting role="php">
2183
+
<![CDATA[
2184
+
<?php
2185
+

2186
+
$subject = 'test';
2187
+

2188
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2189
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2190
+
var_dump($match);
2191
+

2192
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2193
+
$match = preg_match('/te #~~~~
2194
+
st/', $subject);
2195
+
var_dump($match);
2196
+

2197
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2198
+
that follows an unescaped # on the same line is ignored */
2199
+
$match = preg_match('/te #~~~~
2200
+
st/x', $subject);
2201
+
var_dump($match);
2202
+
]]>
2203
+
</programlisting>
2204
+
&example.outputs;
2205
+
<screen>
2206
+
<![CDATA[
2207
+
int(1)
2208
+
int(0)
2209
+
int(1)
2210
+
]]>
2211
+
</screen>
2212
+
</example>
2213
+
</para>
2160
2214
</section>
2161
2215

2162
2216
<section xml:id="regexp.reference.recursive">
2163
2217
<title>Recursive patterns</title>
2164
2218
<para>
2165
-
Consider the problem of matching a string in parentheses,
2166
-
allowing for unlimited nested parentheses. Without the use
2167
-
of recursion, the best that can be done is to use a pattern
2168
-
that matches up to some fixed depth of nesting. It is not
2169
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
2170
-
provided an experimental facility that allows regular
2171
-
expressions to recurse (among other things). The special
2172
-
item (?R) is provided for the specific case of recursion.
2173
-
This PCRE pattern solves the parentheses problem (assume
2219
+
Consider the problem of matching a string in parentheses,
2220
+
allowing for unlimited nested parentheses. Without the use
2221
+
of recursion, the best that can be done is to use a pattern
2222
+
that matches up to some fixed depth of nesting. It is not
2223
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2224
+
provided an experimental facility that allows regular
2225
+
expressions to recurse (among other things). The special
2226
+
item (?R) is provided for the specific case of recursion.
2227
+
This PCRE pattern solves the parentheses problem (assume
2174
2228
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2175
2229
option is set so that white space is
2176
2230
ignored):
...
...
@@ -2179,45 +2233,45 @@
2179
2233
</para>
2180
2234
<para>
2181
2235
First it matches an opening parenthesis. Then it matches any
2182
-
number of substrings which can either be a sequence of
2183
-
non-parentheses, or a recursive match of the pattern itself
2236
+
number of substrings which can either be a sequence of
2237
+
non-parentheses, or a recursive match of the pattern itself
2184
2238
(i.e. a correctly parenthesized substring). Finally there is
2185
2239
a closing parenthesis.
2186
2240
</para>
2187
2241
<para>
2188
-
This particular example pattern contains nested unlimited
2242
+
This particular example pattern contains nested unlimited
2189
2243
repeats, and so the use of a once-only subpattern for matching
2190
-
strings of non-parentheses is important when applying
2191
-
the pattern to strings that do not match. For example, when
2244
+
strings of non-parentheses is important when applying
2245
+
the pattern to strings that do not match. For example, when
2192
2246
it is applied to
2193
2247

2194
2248
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
2195
2249

2196
-
it yields "no match" quickly. However, if a once-only subpattern
2197
-
is not used, the match runs for a very long time
2198
-
indeed because there are so many different ways the + and *
2199
-
repeats can carve up the subject, and all have to be tested
2250
+
it yields "no match" quickly. However, if a once-only subpattern
2251
+
is not used, the match runs for a very long time
2252
+
indeed because there are so many different ways the + and *
2253
+
repeats can carve up the subject, and all have to be tested
2200
2254
before failure can be reported.
2201
2255
</para>
2202
2256
<para>
2203
-
The values set for any capturing subpatterns are those from
2257
+
The values set for any capturing subpatterns are those from
2204
2258
the outermost level of the recursion at which the subpattern
2205
2259
value is set. If the pattern above is matched against
2206
2260

2207
2261
<literal>(ab(cd)ef)</literal>
2208
2262

2209
-
the value for the capturing parentheses is "ef", which is
2210
-
the last value taken on at the top level. If additional
2263
+
the value for the capturing parentheses is "ef", which is
2264
+
the last value taken on at the top level. If additional
2211
2265
parentheses are added, giving
2212
2266

2213
2267
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2214
2268
then the string they capture
2215
2269
is "ab(cd)ef", the contents of the top level parentheses. If
2216
-
there are more than 15 capturing parentheses in a pattern,
2217
-
PCRE has to obtain extra memory to store data during a
2218
-
recursion, which it does by using pcre_malloc, freeing it
2219
-
via pcre_free afterwards. If no memory can be obtained, it
2220
-
saves data for the first 15 capturing parentheses only, as
2270
+
there are more than 15 capturing parentheses in a pattern,
2271
+
PCRE has to obtain extra memory to store data during a
2272
+
recursion, which it does by using pcre_malloc, freeing it
2273
+
via pcre_free afterwards. If no memory can be obtained, it
2274
+
saves data for the first 15 capturing parentheses only, as
2221
2275
there is no way to give an out-of-memory error from within a
2222
2276
recursion.
2223
2277
</para>
...
...
@@ -2256,75 +2310,75 @@
2256
2310
<title>Performance</title>
2257
2311
<para>
2258
2312
Certain items that may appear in patterns are more efficient
2259
-
than others. It is more efficient to use a character class
2313
+
than others. It is more efficient to use a character class
2260
2314
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2261
-
In general, the simplest construction that provides the
2262
-
required behaviour is usually the most efficient. Jeffrey
2263
-
Friedl's book contains a lot of discussion about optimizing
2315
+
In general, the simplest construction that provides the
2316
+
required behaviour is usually the most efficient. Jeffrey
2317
+
Friedl's book contains a lot of discussion about optimizing
2264
2318
regular expressions for efficient performance.
2265
2319
</para>
2266
2320
<para>
2267
2321
When a pattern begins with .* and the <link
2268
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2269
-
set, the pattern is implicitly anchored by PCRE, since it
2322
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2323
+
set, the pattern is implicitly anchored by PCRE, since it
2270
2324
can match only at the start of a subject string. However, if
2271
2325
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2272
2326
is not set, PCRE cannot make this optimization,
2273
-
because the . metacharacter does not then match a newline,
2327
+
because the . metacharacter does not then match a newline,
2274
2328
and if the subject string contains newlines, the pattern may
2275
-
match from the character immediately following one of them
2329
+
match from the character immediately following one of them
2276
2330
instead of from the very start. For example, the pattern
2277
2331

2278
2332
<literal>(.*) second</literal>
2279
2333

2280
2334
matches the subject "first\nand second" (where \n stands for
2281
2335
a newline character) with the first captured substring being
2282
-
"and". In order to do this, PCRE has to retry the match
2336
+
"and". In order to do this, PCRE has to retry the match
2283
2337
starting after every newline in the subject.
2284
2338
</para>
2285
2339
<para>
2286
2340
If you are using such a pattern with subject strings that do
2287
-
not contain newlines, the best performance is obtained by
2341
+
not contain newlines, the best performance is obtained by
2288
2342
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2289
-
or starting the pattern with ^.* to
2290
-
indicate explicit anchoring. That saves PCRE from having to
2343
+
or starting the pattern with ^.* to
2344
+
indicate explicit anchoring. That saves PCRE from having to
2291
2345
scan along the subject looking for a newline to restart at.
2292
2346
</para>
2293
2347
<para>
2294
-
Beware of patterns that contain nested indefinite repeats.
2295
-
These can take a long time to run when applied to a string
2348
+
Beware of patterns that contain nested indefinite repeats.
2349
+
These can take a long time to run when applied to a string
2296
2350
that does not match. Consider the pattern fragment
2297
2351

2298
2352
<literal>(a+)*</literal>
2299
2353
</para>
2300
2354
<para>
2301
-
This can match "aaaa" in 33 different ways, and this number
2302
-
increases very rapidly as the string gets longer. (The *
2303
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2304
-
those cases other than 0, the + repeats can match different
2355
+
This can match "aaaa" in 33 different ways, and this number
2356
+
increases very rapidly as the string gets longer. (The *
2357
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2358
+
those cases other than 0, the + repeats can match different
2305
2359
numbers of times.) When the remainder of the pattern is such
2306
-
that the entire match is going to fail, PCRE has in principle
2307
-
to try every possible variation, and this can take an
2360
+
that the entire match is going to fail, PCRE has in principle
2361
+
to try every possible variation, and this can take an
2308
2362
extremely long time.
2309
2363
</para>
2310
2364
<para>
2311
-
An optimization catches some of the more simple cases such
2365
+
An optimization catches some of the more simple cases such
2312
2366
as
2313
2367

2314
2368
<literal>(a+)*b</literal>
2315
2369

2316
-
where a literal character follows. Before embarking on the
2370
+
where a literal character follows. Before embarking on the
2317
2371
standard matching procedure, PCRE checks that there is a "b"
2318
-
later in the subject string, and if there is not, it fails
2319
-
the match immediately. However, when there is no following
2320
-
literal this optimization cannot be used. You can see the
2372
+
later in the subject string, and if there is not, it fails
2373
+
the match immediately. However, when there is no following
2374
+
literal this optimization cannot be used. You can see the
2321
2375
difference by comparing the behaviour of
2322
2376

2323
2377
<literal>(a+)*\d</literal>
2324
2378

2325
-
with the pattern above. The former gives a failure almost
2326
-
instantly when applied to a whole line of "a" characters,
2327
-
whereas the latter takes an appreciable time with strings
2379
+
with the pattern above. The former gives a failure almost
2380
+
instantly when applied to a whole line of "a" characters,
2381
+
whereas the latter takes an appreciable time with strings
2328
2382
longer than about 20 characters.
2329
2383
</para>
2330
2384
</section>
2331
2385