reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77
...
...
@@ -8,21 +8,21 @@
8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described below. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -32,6 +32,7 @@
32
32
When using the PCRE functions, it is required that the pattern is enclosed
33
33
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
34
34
non-backslash, non-whitespace character.
35
+
Leading whitespace before a valid delimiter is silently ignored.
35
36
</para>
36
37
<para>
37
38
Often used delimiters are forward slashes (<literal>/</literal>), hash
...
...
@@ -101,15 +102,15 @@
101
102
<section xml:id="regexp.reference.meta">
102
103
<title>Meta-characters</title>
103
104
<para>
104
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
105
106
ability to include alternatives and repetitions in the
106
-
pattern. These are encoded in the pattern by the use of
107
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
108
109
are interpreted in some special way.
109
110
</para>
110
111
<para>
111
-
There are two different sets of meta-characters: those that
112
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
113
114
brackets, and those that are recognized in square brackets.
114
115
Outside square brackets, the meta-characters are as follows:
115
116

...
...
@@ -129,7 +130,8 @@
129
130
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
130
131
</row>
131
132
<row>
132
-
<entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
133
135
</row>
134
136
<row>
135
137
<entry>.</entry><entry>match any character except newline (by default)</entry>
...
...
@@ -203,9 +205,9 @@
203
205
<section xml:id="regexp.reference.escape">
204
206
<title>Escape sequences</title>
205
207
<para>
206
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
207
209
followed by a non-alphanumeric character, it takes away any
208
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
209
211
backslash as an escape character applies both inside and
210
212
outside character classes.
211
213
</para>
...
...
@@ -214,7 +216,7 @@
214
216
"\*" in the pattern. This applies whether or not the
215
217
following character would otherwise be interpreted as a
216
218
meta-character, so it is always safe to precede a non-alphanumeric
217
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
218
220
particular, if you want to match a backslash, you write "\\".
219
221
</para>
220
222
<note>
...
...
@@ -236,10 +238,10 @@
236
238
<para>
237
239
A second use of backslash provides a way of encoding
238
240
non-printing characters in patterns in a visible manner. There
239
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
240
242
apart from the binary zero that terminates a pattern,
241
243
but when a pattern is being prepared by text editing, it is
242
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
243
245
than the binary character it represents:
244
246
</para>
245
247
<para>
...
...
@@ -330,9 +332,9 @@
330
332
</para>
331
333
<para>
332
334
The precise effect of "<literal>\cx</literal>" is as follows:
333
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
334
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
335
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
336
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
337
339
becomes hex 7B.
338
340
</para>
...
...
@@ -348,7 +350,7 @@
348
350
</para>
349
351
<para>
350
352
After "<literal>\0</literal>" up to two further octal digits are read.
351
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
352
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
353
355
specifies two binary zeros followed by a BEL character. Make sure you
354
356
supply two digits after the initial zero if the character
...
...
@@ -357,20 +359,20 @@
357
359
<para>
358
360
The handling of a backslash followed by a digit other than 0
359
361
is complicated. Outside a character class, PCRE reads it
360
-
and any following digits as a decimal number. If the number
361
-
is less than 10, or if there have been at least that many
362
-
previous capturing left parentheses in the expression, the
363
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
364
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
365
367
of parenthesized subpatterns.
366
368
</para>
367
369
<para>
368
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
369
371
greater than 9 and there have not been that many capturing
370
372
subpatterns, PCRE re-reads up to three octal digits following
371
373
the backslash, and generates a single byte from the
372
374
least significant 8 bits of the value. Any subsequent digits
373
-
stand for themselves. For example:
375
+
stand for themselves. For example:
374
376
</para>
375
377
<para>
376
378
<variablelist>
...
...
@@ -438,7 +440,7 @@
438
440
digits are ever read.
439
441
</para>
440
442
<para>
441
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
442
444
used both inside and outside character classes. In addition,
443
445
inside a character class, the sequence "<literal>\b</literal>"
444
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -460,11 +462,11 @@
460
462
</varlistentry>
461
463
<varlistentry>
462
464
<term><emphasis>\h</emphasis></term>
463
-
<listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
465
+
<listitem><simpara>any horizontal whitespace character</simpara></listitem>
464
466
</varlistentry>
465
467
<varlistentry>
466
468
<term><emphasis>\H</emphasis></term>
467
-
<listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
469
+
<listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>
468
470
</varlistentry>
469
471
<varlistentry>
470
472
<term><emphasis>\s</emphasis></term>
...
...
@@ -476,11 +478,11 @@
476
478
</varlistentry>
477
479
<varlistentry>
478
480
<term><emphasis>\v</emphasis></term>
479
-
<listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
481
+
<listitem><simpara>any vertical whitespace character</simpara></listitem>
480
482
</varlistentry>
481
483
<varlistentry>
482
484
<term><emphasis>\V</emphasis></term>
483
-
<listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
485
+
<listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>
484
486
</varlistentry>
485
487
<varlistentry>
486
488
<term><emphasis>\w</emphasis></term>
...
...
@@ -505,7 +507,7 @@
505
507
</para>
506
508
<para>
507
509
A "word" character is any letter or digit or the underscore
508
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
509
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
510
512
controlled by PCRE's character tables, and may vary if locale-specific
511
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -514,15 +516,15 @@
514
516
</para>
515
517
<para>
516
518
These character type sequences can appear both inside and
517
-
outside character classes. They each match one character of
518
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
519
521
the end of the subject string, all of them fail, since there
520
522
is no character to match.
521
523
</para>
522
524
<para>
523
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
524
526
assertions. An assertion specifies a condition that has to be met
525
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
526
528
characters from the subject string. The use of subpatterns
527
529
for more complicated assertions is described below. The
528
530
backslashed assertions are
...
...
@@ -561,7 +563,7 @@
561
563
</variablelist>
562
564
</para>
563
565
<para>
564
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
565
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
566
568
character, inside a character class).
567
569
</para>
...
...
@@ -569,20 +571,20 @@
569
571
A word boundary is a position in the subject string where
570
572
the current character and the previous character do not both
571
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
572
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
573
575
<literal>\W</literal>), or the start or end of the string if the first
574
576
or last character matches <literal>\w</literal>, respectively.
575
577
</para>
576
578
<para>
577
579
The <literal>\A</literal>, <literal>\Z</literal>, and
578
-
<literal>\z</literal> assertions differ from the traditional
579
-
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only
580
-
ever match at the very start and end of the subject string,
581
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
582
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
583
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
584
-
options. The difference between <literal>\Z</literal> and
585
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
588
newline that is the last character of the string as well as at the end of
587
589
the string, whereas <literal>\z</literal> matches only at the end.
588
590
</para>
...
...
@@ -599,12 +601,16 @@
599
601
regexp metacharacters in the pattern. For example:
600
602
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
601
603
followed by literals <literal>.$.</literal> and anchored at the end of
602
-
the string.
604
+
the string. Note that this does not change the behavior of
605
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
606
+
is not valid, because the second <literal>#</literal> marks the end
607
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
608
+
modifiers.
603
609
</para>
604
610

605
611
<para>
606
-
<literal>\K</literal> can be used to reset the match start since
607
-
PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches
612
+
<literal>\K</literal> can be used to reset the match start.
613
+
For example, the pattern <literal>foo\Kbar</literal> matches
608
614
"foobar", but reports that it has matched "bar". The use of
609
615
<literal>\K</literal> does not interfere with the setting of captured
610
616
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
...
...
@@ -868,8 +874,8 @@
868
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
869
875
</para>
870
876
<para>
871
-
Sets of Unicode characters are defined as belonging to certain scripts. A
872
-
character from one of these sets can be matched using a script name. For
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
873
879
example:
874
880
</para>
875
881
<itemizedlist>
...
...
@@ -881,7 +887,7 @@
881
887
</listitem>
882
888
</itemizedlist>
883
889
<para>
884
-
Those that are not part of an identified script are lumped together as
890
+
Those that are not part of an identified script are lumped together as
885
891
<literal>Common</literal>. The current list of scripts is:
886
892
</para>
887
893
<table>
...
...
@@ -1050,7 +1056,7 @@
1050
1056
<para>
1051
1057
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1052
1058
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1053
-
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1054
1060
character without the "mark" property, followed by zero or more characters
1055
1061
with the "mark" property, and treats the sequence as an atomic group (see
1056
1062
below). Characters with the "mark" property are typically accents that
...
...
@@ -1070,8 +1076,8 @@
1070
1076
<para>
1071
1077
Outside a character class, in the default matching mode, the
1072
1078
circumflex character (<literal>^</literal>) is an assertion which
1073
-
is true only if the current matching point is at the start of
1074
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1075
1081
has an entirely different meaning (see below).
1076
1082
</para>
1077
1083
<para>
...
...
@@ -1086,12 +1092,12 @@
1086
1092
</para>
1087
1093
<para>
1088
1094
A dollar character (<literal>$</literal>) is an assertion which is
1089
-
&true; only if the current matching point is at the end of the subject
1090
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
1091
1097
character in the string (by default). Dollar (<literal>$</literal>)
1092
-
need not be the last character of the pattern if a number of
1093
-
alternatives are involved, but it should be the last item in any branch
1094
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
1095
1101
character class.
1096
1102
</para>
1097
1103
<para>
...
...
@@ -1117,9 +1123,9 @@
1117
1123
set.
1118
1124
</para>
1119
1125
<para>
1120
-
Note that the sequences \A, \Z, and \z can be used to match
1121
-
the start and end of the subject in both modes, and if all
1122
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
1123
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1124
1130
is set or not.
1125
1131
</para>
...
...
@@ -1128,14 +1134,14 @@
1128
1134
<section xml:id="regexp.reference.dot">
1129
1135
<title>Dot</title>
1130
1136
<para>
1131
-
Outside a character class, a dot in the pattern matches any
1132
-
one character in the subject, including a non-printing
1133
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
1134
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1135
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
1136
1142
handling of dot is entirely independent of the handling of
1137
-
circumflex and dollar, the only relationship being that they
1138
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
1139
1145
in a character class.
1140
1146
</para>
1141
1147
<para>
...
...
@@ -1149,29 +1155,29 @@
1149
1155
<title>Character classes</title>
1150
1156
<para>
1151
1157
An opening square bracket introduces a character class,
1152
-
terminated by a closing square bracket. A closing square
1153
-
bracket on its own is not special. If a closing square
1154
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
1155
1161
the first data character in the class (after an initial
1156
1162
circumflex, if present) or escaped with a backslash.
1157
1163
</para>
1158
1164
<para>
1159
1165
A character class matches a single character in the subject;
1160
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
1161
1167
the class, unless the first character in the class is a
1162
-
circumflex, in which case the subject character must not be in
1163
-
the set defined by the class. If a circumflex is actually
1164
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
1165
1171
first character, or escape it with a backslash.
1166
1172
</para>
1167
1173
<para>
1168
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
1169
1175
case vowel, while [^aeiou] matches any character that is not
1170
-
a lower case vowel. Note that a circumflex is just a
1171
-
convenient notation for specifying the characters which are in
1172
-
the class by enumerating those that are not. It is not an
1173
-
assertion: it still consumes a character from the subject
1174
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
1175
1181
the string.
1176
1182
</para>
1177
1183
<para>
...
...
@@ -1183,61 +1189,62 @@
1183
1189
</para>
1184
1190
<para>
1185
1191
The newline character is never treated in any special way in
1186
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
1187
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1188
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1189
1195
options is. A class such as [^a] will always match a newline.
1190
1196
</para>
1191
1197
<para>
1192
-
The minus (hyphen) character can be used to specify a range
1193
-
of characters in a character class. For example, [d-m]
1194
-
matches any letter between d and m, inclusive. If a minus
1195
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1196
1202
backslash or appear in a position where it cannot be
1197
1203
interpreted as indicating a range, typically as the first or last
1198
1204
character in the class.
1199
1205
</para>
1200
1206
<para>
1201
-
It is not possible to have the literal character "]" as the
1202
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1203
1209
interpreted as a class of two characters ("W" and "-")
1204
1210
followed by a literal string "46]", so it would match "W46]" or
1205
-
"-46]". However, if the "]" is escaped with a backslash it
1206
-
is interpreted as the end of range, so [W-\]46] is
1207
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1208
1214
separate characters. The octal or hexadecimal representation
1209
1215
of "]" can also be used to end a range.
1210
1216
</para>
1211
1217
<para>
1212
1218
Ranges operate in ASCII collating sequence. They can also be
1213
-
used for characters specified numerically, for example
1214
-
[\000-\037]. If a range that includes letters is used when
1215
-
case-insensitive (caseless) matching is set, it matches the
1216
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1217
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1218
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1219
1225
accented E characters in both cases.
1220
1226
</para>
1221
1227
<para>
1222
-
The character types \d, \D, \s, \S, \w, and \W may also
1223
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1224
1230
they match to the class. For example, [\dABCDEF] matches any
1225
-
hexadecimal digit. A circumflex can conveniently be used
1226
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1227
1233
restricted set of characters than the matching lower case type.
1228
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1229
1235
but not underscore.
1230
1236
</para>
1231
1237
<para>
1232
-
All non-alphanumeric characters other than \, -, ^ (at the
1233
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1234
1240
classes, but it does no harm if they are escaped. The pattern
1235
1241
terminator is always special and must be escaped when used
1236
1242
within an expression.
1237
1243
</para>
1238
1244
<para>
1239
1245
Perl supports the POSIX notation for character classes. This uses names
1240
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1241
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1242
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1243
1250
names are:
...
...
@@ -1276,7 +1283,7 @@
1276
1283
<para>
1277
1284
In UTF-8 mode, characters with values greater than 128 do not match any
1278
1285
of the POSIX character classes.
1279
-
As of PHP 5.3.0 and libpcre 8.10 some character classes are changed to use
1286
+
As of libpcre 8.10 some character classes are changed to use
1280
1287
Unicode character properties, in which case the mentioned restriction does
1281
1288
not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>
1282
1289
for details.
...
...
@@ -1292,16 +1299,16 @@
1292
1299
<section xml:id="regexp.reference.alternation">
1293
1300
<title>Alternation</title>
1294
1301
<para>
1295
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1296
1303
patterns. For example, the pattern
1297
1304
<literal>gilbert|sullivan</literal>
1298
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1299
-
may appear, and an empty alternative is permitted
1300
-
(matching the empty string). The matching process tries
1301
-
each alternative in turn, from left to right, and the first
1302
-
one that succeeds is used. If the alternatives are within a
1303
-
subpattern (defined below), "succeeds" means matching the
1304
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1305
1312
subpattern.
1306
1313
</para>
1307
1314
</section>
...
...
@@ -1316,7 +1323,7 @@
1316
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1317
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1318
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1319
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1320
1327
")". The option letters are:
1321
1328

1322
1329
<table>
...
...
@@ -1345,7 +1352,8 @@
1345
1352
</row>
1346
1353
<row>
1347
1354
<entry><literal>X</literal></entry>
1348
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1349
1357
</row>
1350
1358
<row>
1351
1359
<entry><literal>J</literal></entry>
...
...
@@ -1356,16 +1364,16 @@
1356
1364
</table>
1357
1365
</para>
1358
1366
<para>
1359
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1360
1368
also possible to unset these options by preceding the letter
1361
-
with a hyphen, and a combined setting and unsetting such as
1362
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1363
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1364
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1365
1373
while unsetting <link
1366
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1367
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1368
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1369
1377
hyphen, the option is unset.
1370
1378
</para>
1371
1379
<para>
...
...
@@ -1375,14 +1383,14 @@
1375
1383
and "abC".
1376
1384
</para>
1377
1385
<para>
1378
-
If an option change occurs inside a subpattern, the effect
1379
-
is different. This is a change of behaviour in Perl 5.005.
1380
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1381
1389
of the subpattern that follows it, so
1382
1390

1383
1391
<literal>(a(?i)b)c</literal>
1384
1392

1385
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1386
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1387
1395
used). By this means, options can be made to have different settings in
1388
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1391,18 +1399,18 @@
1391
1399

1392
1400
<literal>(a(?i)b|c)</literal>
1393
1401

1394
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1395
1403
"C" the first branch is abandoned before the option setting.
1396
-
This is because the effects of option settings happen at
1397
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1398
1406
</para>
1399
1407
<para>
1400
1408
The PCRE-specific options <link
1401
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1402
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1403
1411
be changed in the same way as the Perl-compatible options by
1404
-
using the characters U and X respectively. The (?X) flag
1405
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1406
1414
the pattern than any of the additional features it turns on,
1407
1415
even when it is at top level. It is best put at the start.
1408
1416
</para>
...
...
@@ -1411,8 +1419,8 @@
1411
1419
<section xml:id="regexp.reference.subpatterns">
1412
1420
<title>Subpatterns</title>
1413
1421
<para>
1414
-
Subpatterns are delimited by parentheses (round brackets),
1415
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1416
1424
does two things:
1417
1425
</para>
1418
1426
<orderedlist>
...
...
@@ -1441,30 +1449,30 @@
1441
1449

1442
1450
<literal>the ((red|white) (king|queen))</literal>
1443
1451

1444
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1445
1453
and are numbered 1, 2, and 3.
1446
1454
</para>
1447
1455
<para>
1448
-
The fact that plain parentheses fulfill two functions is not
1449
-
always helpful. There are often times when a grouping subpattern
1450
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1451
1459
opening parenthesis is followed by "?:", the subpattern does
1452
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1453
1461
number of any subsequent capturing subpatterns. For example,
1454
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1455
1463
pattern
1456
1464

1457
1465
<literal>the ((?:red|white) (king|queen))</literal>
1458
1466

1459
-
the captured substrings are "white queen" and "queen", and
1460
-
are numbered 1 and 2. The maximum number of captured substrings
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1461
1469
is 65535. It may not be possible to compile such large patterns,
1462
1470
however, depending on the configuration options of libpcre.
1463
1471
</para>
1464
1472
<para>
1465
-
As a convenient shorthand, if any option settings are
1466
-
required at the start of a non-capturing subpattern, the
1467
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1468
1476
the two patterns
1469
1477
</para>
1470
1478

...
...
@@ -1478,10 +1486,10 @@
1478
1486
</informalexample>
1479
1487

1480
1488
<para>
1481
-
match exactly the same set of strings. Because alternative
1482
-
branches are tried from left to right, and options are not
1483
-
reset until the end of the subpattern is reached, an option
1484
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1485
1493
the above patterns match "SUNDAY" as well as "Saturday".
1486
1494
</para>
1487
1495

...
...
@@ -1489,7 +1497,7 @@
1489
1497
It is possible to name a subpattern using the syntax
1490
1498
<literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then
1491
1499
be indexed in the matches array by its normal numeric position and
1492
-
also by name. PHP 5.2.2 introduced two alternative syntaxes
1500
+
also by name. There are two alternative syntaxes
1493
1501
<literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.
1494
1502
</para>
1495
1503

...
...
@@ -1510,9 +1518,10 @@
1510
1518

1511
1519
<para>
1512
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1513
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1514
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1515
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1516
1525
</para>
1517
1526

1518
1527
<informalexample>
...
...
@@ -1538,45 +1547,45 @@
1538
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1539
1548
<listitem><simpara>a character class</simpara></listitem>
1540
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1541
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1542
1551
see below)</simpara></listitem>
1543
1552
</itemizedlist>
1544
1553
</para>
1545
1554
<para>
1546
-
The general repetition quantifier specifies a minimum and
1547
-
maximum number of permitted matches, by giving the two
1548
-
numbers in curly brackets (braces), separated by a comma.
1549
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1550
1559
less than or equal to the second. For example:
1551
1560

1552
1561
<literal>z{2,4}</literal>
1553
1562

1554
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1555
1564
is not a special character. If the second number is omitted,
1556
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1557
1566
second number and the comma are both omitted, the quantifier
1558
1567
specifies an exact number of required matches. Thus
1559
1568

1560
1569
<literal>[aeiou]{3,}</literal>
1561
1570

1562
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1563
1572
more, while
1564
1573

1565
1574
<literal>\d{8}</literal>
1566
1575

1567
-
matches exactly 8 digits. An opening curly bracket that
1568
-
appears in a position where a quantifier is not allowed, or
1576
+
matches exactly 8 digits. An opening curly bracket that
1577
+
appears in a position where a quantifier is not allowed, or
1569
1578
one that does not match the syntax of a quantifier, is taken
1570
-
as a literal character. For example, {,6} is not a quantifier,
1579
+
as a literal character. For example, {,6} is not a quantifier,
1571
1580
but a literal string of four characters.
1572
1581
</para>
1573
1582
<para>
1574
-
The quantifier {0} is permitted, causing the expression to
1575
-
behave as if the previous item and the quantifier were not
1583
+
The quantifier {0} is permitted, causing the expression to
1584
+
behave as if the previous item and the quantifier were not
1576
1585
present.
1577
1586
</para>
1578
1587
<para>
1579
-
For convenience (and historical compatibility) the three
1588
+
For convenience (and historical compatibility) the three
1580
1589
most common quantifiers have single-character abbreviations:
1581
1590

1582
1591
<table>
...
...
@@ -1600,63 +1609,63 @@
1600
1609
</table>
1601
1610
</para>
1602
1611
<para>
1603
-
It is possible to construct infinite loops by following a
1604
-
subpattern that can match no characters with a quantifier
1612
+
It is possible to construct infinite loops by following a
1613
+
subpattern that can match no characters with a quantifier
1605
1614
that has no upper limit, for example:
1606
1615

1607
1616
<literal>(a?)*</literal>
1608
1617
</para>
1609
1618
<para>
1610
-
Earlier versions of Perl and PCRE used to give an error at
1611
-
compile time for such patterns. However, because there are
1612
-
cases where this can be useful, such patterns are now
1613
-
accepted, but if any repetition of the subpattern does in
1619
+
Earlier versions of Perl and PCRE used to give an error at
1620
+
compile time for such patterns. However, because there are
1621
+
cases where this can be useful, such patterns are now
1622
+
accepted, but if any repetition of the subpattern does in
1614
1623
fact match no characters, the loop is forcibly broken.
1615
1624
</para>
1616
1625
<para>
1617
-
By default, the quantifiers are "greedy", that is, they
1618
-
match as much as possible (up to the maximum number of permitted
1619
-
times), without causing the rest of the pattern to
1626
+
By default, the quantifiers are "greedy", that is, they
1627
+
match as much as possible (up to the maximum number of permitted
1628
+
times), without causing the rest of the pattern to
1620
1629
fail. The classic example of where this gives problems is in
1621
1630
trying to match comments in C programs. These appear between
1622
-
the sequences /* and */ and within the sequence, individual
1623
-
* and / characters may appear. An attempt to match C comments
1631
+
the sequences /* and */ and within the sequence, individual
1632
+
* and / characters may appear. An attempt to match C comments
1624
1633
by applying the pattern
1625
1634

1626
1635
<literal>/\*.*\*/</literal>
1627
1636

1628
1637
to the string
1629
1638

1630
-
<literal>/* first comment */ not comment /* second comment */</literal>
1639
+
<literal>/* first comment */ not comment /* second comment */</literal>
1631
1640

1632
-
fails, because it matches the entire string due to the
1633
-
greediness of the .* item.
1641
+
fails, because it matches the entire string due to the
1642
+
greediness of the .* item.
1634
1643
</para>
1635
1644
<para>
1636
-
However, if a quantifier is followed by a question mark,
1645
+
However, if a quantifier is followed by a question mark,
1637
1646
then it becomes lazy, and instead matches the minimum
1638
1647
number of times possible, so the pattern
1639
1648

1640
1649
<literal>/\*.*?\*/</literal>
1641
1650

1642
1651
does the right thing with the C comments. The meaning of the
1643
-
various quantifiers is not otherwise changed, just the preferred
1644
-
number of matches. Do not confuse this use of
1645
-
question mark with its use as a quantifier in its own right.
1652
+
various quantifiers is not otherwise changed, just the preferred
1653
+
number of matches. Do not confuse this use of
1654
+
question mark with its use as a quantifier in its own right.
1646
1655
Because it has two uses, it can sometimes appear doubled, as
1647
1656
in
1648
1657

1649
1658
<literal>\d??\d</literal>
1650
1659

1651
-
which matches one digit by preference, but can match two if
1660
+
which matches one digit by preference, but can match two if
1652
1661
that is the only way the rest of the pattern matches.
1653
1662
</para>
1654
1663
<para>
1655
1664
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1656
-
option is set (an option which is not
1657
-
available in Perl) then the quantifiers are not greedy by
1665
+
option is set (an option which is not
1666
+
available in Perl) then the quantifiers are not greedy by
1658
1667
default, but individual ones can be made greedy by following
1659
-
them with a question mark. In other words, it inverts the
1668
+
them with a question mark. In other words, it inverts the
1660
1669
default behaviour.
1661
1670
</para>
1662
1671
<para>
...
...
@@ -1668,41 +1677,41 @@
1668
1677
</para>
1669
1678
<para>
1670
1679
When a parenthesized subpattern is quantified with a minimum
1671
-
repeat count that is greater than 1 or with a limited maximum,
1672
-
more store is required for the compiled pattern, in
1680
+
repeat count that is greater than 1 or with a limited maximum,
1681
+
more store is required for the compiled pattern, in
1673
1682
proportion to the size of the minimum or maximum.
1674
1683
</para>
1675
1684
<para>
1676
-
If a pattern starts with .* or .{0,} and the <link
1685
+
If a pattern starts with .* or .{0,} and the <link
1677
1686
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1678
1687
option (equivalent to Perl's /s) is set, thus allowing the .
1679
-
to match newlines, then the pattern is implicitly anchored,
1688
+
to match newlines, then the pattern is implicitly anchored,
1680
1689
because whatever follows will be tried against every character
1681
-
position in the subject string, so there is no point in
1682
-
retrying the overall match at any position after the first.
1690
+
position in the subject string, so there is no point in
1691
+
retrying the overall match at any position after the first.
1683
1692
PCRE treats such a pattern as though it were preceded by \A.
1684
-
In cases where it is known that the subject string contains
1693
+
In cases where it is known that the subject string contains
1685
1694
no newlines, it is worth setting <link
1686
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1695
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1687
1696
pattern begins with .* in order to
1688
1697
obtain this optimization, or
1689
1698
alternatively using ^ to indicate anchoring explicitly.
1690
1699
</para>
1691
1700
<para>
1692
-
When a capturing subpattern is repeated, the value captured
1701
+
When a capturing subpattern is repeated, the value captured
1693
1702
is the substring that matched the final iteration. For example, after
1694
1703

1695
1704
<literal>(tweedle[dume]{3}\s*)+</literal>
1696
1705

1697
-
has matched "tweedledum tweedledee" the value of the captured
1698
-
substring is "tweedledee". However, if there are
1699
-
nested capturing subpatterns, the corresponding captured
1700
-
values may have been set in previous iterations. For example,
1706
+
has matched "tweedledum tweedledee" the value of the captured
1707
+
substring is "tweedledee". However, if there are
1708
+
nested capturing subpatterns, the corresponding captured
1709
+
values may have been set in previous iterations. For example,
1701
1710
after
1702
1711

1703
1712
<literal>/(a|(b))+/</literal>
1704
1713

1705
-
matches "aba" the value of the second captured substring is
1714
+
matches "aba" the value of the second captured substring is
1706
1715
"b".
1707
1716
</para>
1708
1717
</section>
...
...
@@ -1710,78 +1719,78 @@
1710
1719
<section xml:id="regexp.reference.back-references">
1711
1720
<title>Back references</title>
1712
1721
<para>
1713
-
Outside a character class, a backslash followed by a digit
1714
-
greater than 0 (and possibly further digits) is a back
1715
-
reference to a capturing subpattern earlier (i.e. to its
1716
-
left) in the pattern, provided there have been that many
1722
+
Outside a character class, a backslash followed by a digit
1723
+
greater than 0 (and possibly further digits) is a back
1724
+
reference to a capturing subpattern earlier (i.e. to its
1725
+
left) in the pattern, provided there have been that many
1717
1726
previous capturing left parentheses.
1718
1727
</para>
1719
1728
<para>
1720
-
However, if the decimal number following the backslash is
1721
-
less than 10, it is always taken as a back reference, and
1722
-
causes an error only if there are not that many capturing
1723
-
left parentheses in the entire pattern. In other words, the
1724
-
parentheses that are referenced need not be to the left of
1725
-
the reference for numbers less than 10.
1729
+
However, if the decimal number following the backslash is
1730
+
less than 10, it is always taken as a back reference, and
1731
+
causes an error only if there are not that many capturing
1732
+
left parentheses in the entire pattern. In other words, the
1733
+
parentheses that are referenced need not be to the left of
1734
+
the reference for numbers less than 10.
1726
1735
A "forward back reference" can make sense when a repetition
1727
1736
is involved and the subpattern to the right has participated
1728
1737
in an earlier iteration. See the section
1729
-
entitled "Backslash" above for further details of the handling
1738
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1730
1739
of digits following a backslash.
1731
1740
</para>
1732
1741
<para>
1733
-
A back reference matches whatever actually matched the capturing
1742
+
A back reference matches whatever actually matched the capturing
1734
1743
subpattern in the current subject string, rather than
1735
1744
anything matching the subpattern itself. So the pattern
1736
1745

1737
1746
<literal>(sens|respons)e and \1ibility</literal>
1738
1747

1739
-
matches "sense and sensibility" and "response and responsibility",
1740
-
but not "sense and responsibility". If case-sensitive (caseful)
1748
+
matches "sense and sensibility" and "response and responsibility",
1749
+
but not "sense and responsibility". If case-sensitive (caseful)
1741
1750
matching is in force at the time of the back reference, then
1742
1751
the case of letters is relevant. For example,
1743
1752

1744
1753
<literal>((?i)rah)\s+\1</literal>
1745
1754

1746
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1747
-
though the original capturing subpattern is matched
1755
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1756
+
though the original capturing subpattern is matched
1748
1757
case-insensitively (caselessly).
1749
1758
</para>
1750
1759
<para>
1751
-
There may be more than one back reference to the same subpattern.
1752
-
If a subpattern has not actually been used in a
1753
-
particular match, then any back references to it always
1760
+
There may be more than one back reference to the same subpattern.
1761
+
If a subpattern has not actually been used in a
1762
+
particular match, then any back references to it always
1754
1763
fail. For example, the pattern
1755
1764

1756
1765
<literal>(a|(bc))\2</literal>
1757
1766

1758
-
always fails if it starts to match "a" rather than "bc".
1759
-
Because there may be up to 99 back references, all digits
1760
-
following the backslash are taken as part of a potential
1767
+
always fails if it starts to match "a" rather than "bc".
1768
+
Because there may be up to 99 back references, all digits
1769
+
following the backslash are taken as part of a potential
1761
1770
back reference number. If the pattern continues with a digit
1762
1771
character, then some delimiter must be used to terminate the
1763
1772
back reference. If the <link
1764
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1765
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1773
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1774
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1766
1775
</para>
1767
1776
<para>
1768
1777
A back reference that occurs inside the parentheses to which
1769
-
it refers fails when the subpattern is first used, so, for
1770
-
example, (a\1) never matches. However, such references can
1778
+
it refers fails when the subpattern is first used, so, for
1779
+
example, (a\1) never matches. However, such references can
1771
1780
be useful inside repeated subpatterns. For example, the pattern
1772
1781

1773
1782
<literal>(a|b\1)+</literal>
1774
1783

1775
-
matches any number of "a"s and also "aba", "ababba" etc. At
1784
+
matches any number of "a"s and also "aba", "ababba" etc. At
1776
1785
each iteration of the subpattern, the back reference matches
1777
-
the character string corresponding to the previous iteration.
1786
+
the character string corresponding to the previous iteration.
1778
1787
In order for this to work, the pattern must be such
1779
-
that the first iteration does not need to match the back
1780
-
reference. This can be done using alternation, as in the
1788
+
that the first iteration does not need to match the back
1789
+
reference. This can be done using alternation, as in the
1781
1790
example above, or by a quantifier with a minimum of zero.
1782
1791
</para>
1783
1792
<para>
1784
-
As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be
1793
+
The <literal>\g</literal> escape sequence can be
1785
1794
used for absolute and relative referencing of subpatterns.
1786
1795
This escape sequence must be followed by an unsigned number or a negative
1787
1796
number, optionally enclosed in braces. The sequences <literal>\1</literal>,
...
...
@@ -1802,29 +1811,28 @@
1802
1811
</para>
1803
1812
<para>
1804
1813
Back references to the named subpatterns can be achieved by
1805
-
<literal>(?P=name)</literal> or, since PHP 5.2.2, also by
1806
-
<literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>.
1807
-
Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>
1808
-
and <literal>\g{name}</literal>, and PHP 5.2.7 for
1809
-
<literal>\g&lt;name&gt;</literal> and <literal>\g'name'</literal>.
1814
+
<literal>(?P=name)</literal>,
1815
+
<literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,
1816
+
<literal>\k{name}</literal>, <literal>\g{name}</literal>,
1817
+
<literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.
1810
1818
</para>
1811
1819
</section>
1812
1820

1813
1821
<section xml:id="regexp.reference.assertions">
1814
1822
<title>Assertions</title>
1815
1823
<para>
1816
-
An assertion is a test on the characters following or
1817
-
preceding the current matching point that does not actually
1818
-
consume any characters. The simple assertions coded as \b,
1819
-
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1820
-
assertions are coded as subpatterns. There are two
1821
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1824
+
An assertion is a test on the characters following or
1825
+
preceding the current matching point that does not actually
1826
+
consume any characters. The simple assertions coded as \b,
1827
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1828
+
assertions are coded as subpatterns. There are two
1829
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1822
1830
subject string, and those that <emphasis>look behind</emphasis> it.
1823
1831
</para>
1824
1832
<para>
1825
1833
An assertion subpattern is matched in the normal way, except
1826
-
that it does not cause the current matching position to be
1827
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1834
+
that it does not cause the current matching position to be
1835
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1828
1836
assertions and (?! for negative assertions. For example,
1829
1837

1830
1838
<literal>\w+(?=;)</literal>
...
...
@@ -1834,27 +1842,27 @@
1834
1842

1835
1843
<literal>foo(?!bar)</literal>
1836
1844

1837
-
matches any occurrence of "foo" that is not followed by
1845
+
matches any occurrence of "foo" that is not followed by
1838
1846
"bar". Note that the apparently similar pattern
1839
1847

1840
1848
<literal>(?!foo)bar</literal>
1841
1849

1842
-
does not find an occurrence of "bar" that is preceded by
1850
+
does not find an occurrence of "bar" that is preceded by
1843
1851
something other than "foo"; it finds any occurrence of "bar"
1844
-
whatsoever, because the assertion (?!foo) is always &true;
1845
-
when the next three characters are "bar". A lookbehind
1852
+
whatsoever, because the assertion (?!foo) is always &true;
1853
+
when the next three characters are "bar". A lookbehind
1846
1854
assertion is needed to achieve this effect.
1847
1855
</para>
1848
1856
<para>
1849
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1857
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1850
1858
and (?&lt;! for negative assertions. For example,
1851
1859

1852
1860
<literal>(?&lt;!foo)bar</literal>
1853
1861

1854
-
does find an occurrence of "bar" that is not preceded by
1862
+
does find an occurrence of "bar" that is not preceded by
1855
1863
"foo". The contents of a lookbehind assertion are restricted
1856
-
such that all the strings it matches must have a fixed
1857
-
length. However, if there are several alternatives, they do
1864
+
such that all the strings it matches must have a fixed
1865
+
length. However, if there are several alternatives, they do
1858
1866
not all have to have the same fixed length. Thus
1859
1867

1860
1868
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1863,51 +1871,51 @@
1863
1871

1864
1872
<literal>(?&lt;!dogs?|cats?)</literal>
1865
1873

1866
-
causes an error at compile time. Branches that match different
1874
+
causes an error at compile time. Branches that match different
1867
1875
length strings are permitted only at the top level of
1868
-
a lookbehind assertion. This is an extension compared with
1869
-
Perl 5.005, which requires all branches to match the same
1876
+
a lookbehind assertion. This is an extension compared with
1877
+
Perl 5.005, which requires all branches to match the same
1870
1878
length of string. An assertion such as
1871
1879

1872
1880
<literal>(?&lt;=ab(c|de))</literal>
1873
1881

1874
-
is not permitted, because its single top-level branch can
1882
+
is not permitted, because its single top-level branch can
1875
1883
match two different lengths, but it is acceptable if rewritten
1876
1884
to use two top-level branches:
1877
1885

1878
1886
<literal>(?&lt;=abc|abde)</literal>
1879
1887

1880
-
The implementation of lookbehind assertions is, for each
1881
-
alternative, to temporarily move the current position back
1882
-
by the fixed width and then try to match. If there are
1883
-
insufficient characters before the current position, the
1884
-
match is deemed to fail. Lookbehinds in conjunction with
1885
-
once-only subpatterns can be particularly useful for matching
1886
-
at the ends of strings; an example is given at the end
1888
+
The implementation of lookbehind assertions is, for each
1889
+
alternative, to temporarily move the current position back
1890
+
by the fixed width and then try to match. If there are
1891
+
insufficient characters before the current position, the
1892
+
match is deemed to fail. Lookbehinds in conjunction with
1893
+
once-only subpatterns can be particularly useful for matching
1894
+
at the ends of strings; an example is given at the end
1887
1895
of the section on once-only subpatterns.
1888
1896
</para>
1889
1897
<para>
1890
-
Several assertions (of any sort) may occur in succession.
1898
+
Several assertions (of any sort) may occur in succession.
1891
1899
For example,
1892
1900

1893
1901
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1894
1902

1895
-
matches "foo" preceded by three digits that are not "999".
1896
-
Notice that each of the assertions is applied independently
1897
-
at the same point in the subject string. First there is a
1898
-
check that the previous three characters are all digits,
1903
+
matches "foo" preceded by three digits that are not "999".
1904
+
Notice that each of the assertions is applied independently
1905
+
at the same point in the subject string. First there is a
1906
+
check that the previous three characters are all digits,
1899
1907
then there is a check that the same three characters are not
1900
-
"999". This pattern does not match "foo" preceded by six
1908
+
"999". This pattern does not match "foo" preceded by six
1901
1909
characters, the first of which are digits and the last three
1902
-
of which are not "999". For example, it doesn't match
1910
+
of which are not "999". For example, it doesn't match
1903
1911
"123abcfoo". A pattern to do that is
1904
1912

1905
1913
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1906
1914
</para>
1907
1915
<para>
1908
-
This time the first assertion looks at the preceding six
1909
-
characters, checking that the first three are digits, and
1910
-
then the second assertion checks that the preceding three
1916
+
This time the first assertion looks at the preceding six
1917
+
characters, checking that the first three are digits, and
1918
+
then the second assertion checks that the preceding three
1911
1919
characters are not "999".
1912
1920
</para>
1913
1921
<para>
...
...
@@ -1915,26 +1923,26 @@
1915
1923

1916
1924
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1917
1925

1918
-
matches an occurrence of "baz" that is preceded by "bar"
1926
+
matches an occurrence of "baz" that is preceded by "bar"
1919
1927
which in turn is not preceded by "foo", while
1920
1928

1921
1929
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1922
1930

1923
-
is another pattern which matches "foo" preceded by three
1931
+
is another pattern which matches "foo" preceded by three
1924
1932
digits and any three characters that are not "999".
1925
1933
</para>
1926
1934
<para>
1927
1935
Assertion subpatterns are not capturing subpatterns, and may
1928
-
not be repeated, because it makes no sense to assert the
1929
-
same thing several times. If any kind of assertion contains
1930
-
capturing subpatterns within it, these are counted for the
1936
+
not be repeated, because it makes no sense to assert the
1937
+
same thing several times. If any kind of assertion contains
1938
+
capturing subpatterns within it, these are counted for the
1931
1939
purposes of numbering the capturing subpatterns in the whole
1932
-
pattern. However, substring capturing is carried out only
1933
-
for positive assertions, because it does not make sense for
1940
+
pattern. However, substring capturing is carried out only
1941
+
for positive assertions, because it does not make sense for
1934
1942
negative assertions.
1935
1943
</para>
1936
1944
<para>
1937
-
Assertions count towards the maximum of 200 parenthesized
1945
+
Assertions count towards the maximum of 200 parenthesized
1938
1946
subpatterns.
1939
1947
</para>
1940
1948
</section>
...
...
@@ -1942,17 +1950,17 @@
1942
1950
<section xml:id="regexp.reference.onlyonce">
1943
1951
<title>Once-only subpatterns</title>
1944
1952
<para>
1945
-
With both maximizing and minimizing repetition, failure of
1946
-
what follows normally causes the repeated item to be
1953
+
With both maximizing and minimizing repetition, failure of
1954
+
what follows normally causes the repeated item to be
1947
1955
re-evaluated to see if a different number of repeats allows the
1948
-
rest of the pattern to match. Sometimes it is useful to
1949
-
prevent this, either to change the nature of the match, or
1950
-
to cause it fail earlier than it otherwise might, when the
1951
-
author of the pattern knows there is no point in carrying
1956
+
rest of the pattern to match. Sometimes it is useful to
1957
+
prevent this, either to change the nature of the match, or
1958
+
to cause it fail earlier than it otherwise might, when the
1959
+
author of the pattern knows there is no point in carrying
1952
1960
on.
1953
1961
</para>
1954
1962
<para>
1955
-
Consider, for example, the pattern \d+foo when applied to
1963
+
Consider, for example, the pattern \d+foo when applied to
1956
1964
the subject line
1957
1965

1958
1966
<literal>123456bar</literal>
...
...
@@ -1960,108 +1968,108 @@
1960
1968
<para>
1961
1969
After matching all 6 digits and then failing to match "foo",
1962
1970
the normal action of the matcher is to try again with only 5
1963
-
digits matching the \d+ item, and then with 4, and so on,
1971
+
digits matching the \d+ item, and then with 4, and so on,
1964
1972
before ultimately failing. Once-only subpatterns provide the
1965
-
means for specifying that once a portion of the pattern has
1966
-
matched, it is not to be re-evaluated in this way, so the
1967
-
matcher would give up immediately on failing to match "foo"
1968
-
the first time. The notation is another kind of special
1973
+
means for specifying that once a portion of the pattern has
1974
+
matched, it is not to be re-evaluated in this way, so the
1975
+
matcher would give up immediately on failing to match "foo"
1976
+
the first time. The notation is another kind of special
1969
1977
parenthesis, starting with (?&gt; as in this example:
1970
1978

1971
1979
<literal>(?&gt;\d+)bar</literal>
1972
1980
</para>
1973
1981
<para>
1974
-
This kind of parenthesis "locks up" the part of the pattern
1975
-
it contains once it has matched, and a failure further into
1976
-
the pattern is prevented from backtracking into it.
1977
-
Backtracking past it to previous items, however, works as normal.
1982
+
This kind of parenthesis "locks up" the part of the pattern
1983
+
it contains once it has matched, and a failure further into
1984
+
the pattern is prevented from backtracking into it.
1985
+
Backtracking past it to previous items, however, works as normal.
1978
1986
</para>
1979
1987
<para>
1980
1988
An alternative description is that a subpattern of this type
1981
-
matches the string of characters that an identical standalone
1989
+
matches the string of characters that an identical standalone
1982
1990
pattern would match, if anchored at the current point
1983
1991
in the subject string.
1984
1992
</para>
1985
1993
<para>
1986
-
Once-only subpatterns are not capturing subpatterns. Simple
1987
-
cases such as the above example can be thought of as a maximizing
1988
-
repeat that must swallow everything it can. So,
1994
+
Once-only subpatterns are not capturing subpatterns. Simple
1995
+
cases such as the above example can be thought of as a maximizing
1996
+
repeat that must swallow everything it can. So,
1989
1997
while both \d+ and \d+? are prepared to adjust the number of
1990
-
digits they match in order to make the rest of the pattern
1998
+
digits they match in order to make the rest of the pattern
1991
1999
match, (?&gt;\d+) can only match an entire sequence of digits.
1992
2000
</para>
1993
2001
<para>
1994
-
This construction can of course contain arbitrarily complicated
2002
+
This construction can of course contain arbitrarily complicated
1995
2003
subpatterns, and it can be nested.
1996
2004
</para>
1997
2005
<para>
1998
2006
Once-only subpatterns can be used in conjunction with
1999
-
lookbehind assertions to specify efficient matching at the end
2007
+
lookbehind assertions to specify efficient matching at the end
2000
2008
of the subject string. Consider a simple pattern such as
2001
2009

2002
2010
<literal>abcd$</literal>
2003
2011

2004
-
when applied to a long string which does not match. Because
2005
-
matching proceeds from left to right, PCRE will look for
2012
+
when applied to a long string which does not match. Because
2013
+
matching proceeds from left to right, PCRE will look for
2006
2014
each "a" in the subject and then see if what follows matches
2007
2015
the rest of the pattern. If the pattern is specified as
2008
2016

2009
2017
<literal>^.*abcd$</literal>
2010
2018

2011
-
then the initial .* matches the entire string at first, but
2012
-
when this fails (because there is no following "a"), it
2019
+
then the initial .* matches the entire string at first, but
2020
+
when this fails (because there is no following "a"), it
2013
2021
backtracks to match all but the last character, then all but
2014
-
the last two characters, and so on. Once again the search
2015
-
for "a" covers the entire string, from right to left, so we
2022
+
the last two characters, and so on. Once again the search
2023
+
for "a" covers the entire string, from right to left, so we
2016
2024
are no better off. However, if the pattern is written as
2017
2025

2018
2026
<literal>^(?>.*)(?&lt;=abcd)</literal>
2019
2027

2020
-
then there can be no backtracking for the .* item; it can
2021
-
match only the entire string. The subsequent lookbehind
2028
+
then there can be no backtracking for the .* item; it can
2029
+
match only the entire string. The subsequent lookbehind
2022
2030
assertion does a single test on the last four characters. If
2023
-
it fails, the match fails immediately. For long strings,
2031
+
it fails, the match fails immediately. For long strings,
2024
2032
this approach makes a significant difference to the processing time.
2025
2033
</para>
2026
2034
<para>
2027
2035
When a pattern contains an unlimited repeat inside a subpattern
2028
2036
that can itself be repeated an unlimited number of
2029
-
times, the use of a once-only subpattern is the only way to
2030
-
avoid some failing matches taking a very long time indeed.
2037
+
times, the use of a once-only subpattern is the only way to
2038
+
avoid some failing matches taking a very long time indeed.
2031
2039
The pattern
2032
2040

2033
2041
<literal>(\D+|&lt;\d+>)*[!?]</literal>
2034
2042

2035
-
matches an unlimited number of substrings that either consist
2036
-
of non-digits, or digits enclosed in &lt;>, followed by
2043
+
matches an unlimited number of substrings that either consist
2044
+
of non-digits, or digits enclosed in &lt;>, followed by
2037
2045
either ! or ?. When it matches, it runs quickly. However, if
2038
2046
it is applied to
2039
2047

2040
2048
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
2041
2049

2042
-
it takes a long time before reporting failure. This is
2050
+
it takes a long time before reporting failure. This is
2043
2051
because the string can be divided between the two repeats in
2044
2052
a large number of ways, and all have to be tried. (The example
2045
-
used [!?] rather than a single character at the end,
2046
-
because both PCRE and Perl have an optimization that allows
2047
-
for fast failure when a single character is used. They
2048
-
remember the last single character that is required for a
2049
-
match, and fail early if it is not present in the string.)
2053
+
used [!?] rather than a single character at the end,
2054
+
because both PCRE and Perl have an optimization that allows
2055
+
for fast failure when a single character is used. They
2056
+
remember the last single character that is required for a
2057
+
match, and fail early if it is not present in the string.)
2050
2058
If the pattern is changed to
2051
2059

2052
2060
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
2053
2061

2054
-
sequences of non-digits cannot be broken, and failure happens quickly.
2062
+
sequences of non-digits cannot be broken, and failure happens quickly.
2055
2063
</para>
2056
2064
</section>
2057
2065

2058
2066
<section xml:id="regexp.reference.conditional">
2059
2067
<title>Conditional subpatterns</title>
2060
2068
<para>
2061
-
It is possible to cause the matching process to obey a subpattern
2062
-
conditionally or to choose between two alternative
2063
-
subpatterns, depending on the result of an assertion, or
2064
-
whether a previous capturing subpattern matched or not. The
2069
+
It is possible to cause the matching process to obey a subpattern
2070
+
conditionally or to choose between two alternative
2071
+
subpatterns, depending on the result of an assertion, or
2072
+
whether a previous capturing subpattern matched or not. The
2065
2073
two possible forms of conditional subpattern are
2066
2074
</para>
2067
2075

...
...
@@ -2075,39 +2083,39 @@
2075
2083
</informalexample>
2076
2084
<para>
2077
2085
If the condition is satisfied, the yes-pattern is used; otherwise
2078
-
the no-pattern (if present) is used. If there are
2086
+
the no-pattern (if present) is used. If there are
2079
2087
more than two alternatives in the subpattern, a compile-time
2080
2088
error occurs.
2081
2089
</para>
2082
2090
<para>
2083
-
There are two kinds of condition. If the text between the
2084
-
parentheses consists of a sequence of digits, then the
2085
-
condition is satisfied if the capturing subpattern of that
2086
-
number has previously matched. Consider the following pattern,
2087
-
which contains non-significant white space to make it
2088
-
more readable (assume the <link
2091
+
There are two kinds of condition. If the text between the
2092
+
parentheses consists of a sequence of digits, then the
2093
+
condition is satisfied if the capturing subpattern of that
2094
+
number has previously matched. Consider the following pattern,
2095
+
which contains non-significant white space to make it
2096
+
more readable (assume the <link
2089
2097
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2090
-
option) and to divide it into three parts for ease of discussion:
2098
+
option) and to divide it into three parts for ease of discussion:
2091
2099
</para>
2092
2100
<informalexample>
2093
2101
<programlisting>
2094
2102
<![CDATA[
2095
-
( \( )? [^()]+ (?(1) \) )
2103
+
( \( )? [^()]+ (?(1) \) )
2096
2104
]]>
2097
2105
</programlisting>
2098
2106
</informalexample>
2099
2107
<para>
2100
-
The first part matches an optional opening parenthesis, and
2101
-
if that character is present, sets it as the first captured
2102
-
substring. The second part matches one or more characters
2103
-
that are not parentheses. The third part is a conditional
2104
-
subpattern that tests whether the first set of parentheses
2105
-
matched or not. If they did, that is, if subject started
2106
-
with an opening parenthesis, the condition is &true;, and so
2107
-
the yes-pattern is executed and a closing parenthesis is
2108
-
required. Otherwise, since no-pattern is not present, the
2109
-
subpattern matches nothing. In other words, this pattern
2110
-
matches a sequence of non-parentheses, optionally enclosed
2108
+
The first part matches an optional opening parenthesis, and
2109
+
if that character is present, sets it as the first captured
2110
+
substring. The second part matches one or more characters
2111
+
that are not parentheses. The third part is a conditional
2112
+
subpattern that tests whether the first set of parentheses
2113
+
matched or not. If they did, that is, if subject started
2114
+
with an opening parenthesis, the condition is &true;, and so
2115
+
the yes-pattern is executed and a closing parenthesis is
2116
+
required. Otherwise, since no-pattern is not present, the
2117
+
subpattern matches nothing. In other words, this pattern
2118
+
matches a sequence of non-parentheses, optionally enclosed
2111
2119
in parentheses.
2112
2120
</para>
2113
2121
<para>
...
...
@@ -2116,10 +2124,10 @@
2116
2124
level", the condition is false.
2117
2125
</para>
2118
2126
<para>
2119
-
If the condition is not a sequence of digits or (R), it must be an
2120
-
assertion. This may be a positive or negative lookahead or
2121
-
lookbehind assertion. Consider this pattern, again containing
2122
-
non-significant white space, and with the two alternatives on
2127
+
If the condition is not a sequence of digits or (R), it must be an
2128
+
assertion. This may be a positive or negative lookahead or
2129
+
lookbehind assertion. Consider this pattern, again containing
2130
+
non-significant white space, and with the two alternatives on
2123
2131
the second line:
2124
2132
</para>
2125
2133

...
...
@@ -2127,18 +2135,18 @@
2127
2135
<programlisting>
2128
2136
<![CDATA[
2129
2137
(?(?=[^a-z]*[a-z])
2130
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2138
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2131
2139
]]>
2132
2140
</programlisting>
2133
2141
</informalexample>
2134
2142
<para>
2135
2143
The condition is a positive lookahead assertion that matches
2136
2144
an optional sequence of non-letters followed by a letter. In
2137
-
other words, it tests for the presence of at least one
2138
-
letter in the subject. If a letter is found, the subject is
2139
-
matched against the first alternative; otherwise it is
2140
-
matched against the second. This pattern matches strings in
2141
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2145
+
other words, it tests for the presence of at least one
2146
+
letter in the subject. If a letter is found, the subject is
2147
+
matched against the first alternative; otherwise it is
2148
+
matched against the second. This pattern matches strings in
2149
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2142
2150
letters and dd are digits.
2143
2151
</para>
2144
2152
</section>
...
...
@@ -2146,31 +2154,66 @@
2146
2154
<section xml:id="regexp.reference.comments">
2147
2155
<title>Comments</title>
2148
2156
<para>
2149
-
The sequence (?# marks the start of a comment which
2150
-
continues up to the next closing parenthesis. Nested
2157
+
The sequence (?# marks the start of a comment which
2158
+
continues up to the next closing parenthesis. Nested
2151
2159
parentheses are not permitted. The characters that make up a
2152
2160
comment play no part in the pattern matching at all.
2153
2161
</para>
2154
2162
<para>
2155
2163
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2156
-
option is set, an unescaped # character outside a character class
2164
+
option is set, an unescaped # character outside a character class
2157
2165
introduces a comment that continues up to the next newline character
2158
2166
in the pattern.
2159
2167
</para>
2168
+
<para>
2169
+
<example>
2170
+
<title>Usage of comments in PCRE pattern</title>
2171
+
<programlisting role="php">
2172
+
<![CDATA[
2173
+
<?php
2174
+

2175
+
$subject = 'test';
2176
+

2177
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2178
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2179
+
var_dump($match);
2180
+

2181
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2182
+
$match = preg_match('/te #~~~~
2183
+
st/', $subject);
2184
+
var_dump($match);
2185
+

2186
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2187
+
that follows an unescaped # on the same line is ignored */
2188
+
$match = preg_match('/te #~~~~
2189
+
st/x', $subject);
2190
+
var_dump($match);
2191
+
]]>
2192
+
</programlisting>
2193
+
&example.outputs;
2194
+
<screen>
2195
+
<![CDATA[
2196
+
int(1)
2197
+
int(0)
2198
+
int(1)
2199
+
]]>
2200
+
</screen>
2201
+
</example>
2202
+
</para>
2160
2203
</section>
2161
2204

2162
2205
<section xml:id="regexp.reference.recursive">
2163
2206
<title>Recursive patterns</title>
2164
2207
<para>
2165
-
Consider the problem of matching a string in parentheses,
2166
-
allowing for unlimited nested parentheses. Without the use
2167
-
of recursion, the best that can be done is to use a pattern
2168
-
that matches up to some fixed depth of nesting. It is not
2169
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
2170
-
provided an experimental facility that allows regular
2171
-
expressions to recurse (among other things). The special
2172
-
item (?R) is provided for the specific case of recursion.
2173
-
This PCRE pattern solves the parentheses problem (assume
2208
+
Consider the problem of matching a string in parentheses,
2209
+
allowing for unlimited nested parentheses. Without the use
2210
+
of recursion, the best that can be done is to use a pattern
2211
+
that matches up to some fixed depth of nesting. It is not
2212
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2213
+
provided an experimental facility that allows regular
2214
+
expressions to recurse (among other things). The special
2215
+
item (?R) is provided for the specific case of recursion.
2216
+
This PCRE pattern solves the parentheses problem (assume
2174
2217
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
2175
2218
option is set so that white space is
2176
2219
ignored):
...
...
@@ -2179,45 +2222,45 @@
2179
2222
</para>
2180
2223
<para>
2181
2224
First it matches an opening parenthesis. Then it matches any
2182
-
number of substrings which can either be a sequence of
2183
-
non-parentheses, or a recursive match of the pattern itself
2225
+
number of substrings which can either be a sequence of
2226
+
non-parentheses, or a recursive match of the pattern itself
2184
2227
(i.e. a correctly parenthesized substring). Finally there is
2185
2228
a closing parenthesis.
2186
2229
</para>
2187
2230
<para>
2188
-
This particular example pattern contains nested unlimited
2231
+
This particular example pattern contains nested unlimited
2189
2232
repeats, and so the use of a once-only subpattern for matching
2190
-
strings of non-parentheses is important when applying
2191
-
the pattern to strings that do not match. For example, when
2233
+
strings of non-parentheses is important when applying
2234
+
the pattern to strings that do not match. For example, when
2192
2235
it is applied to
2193
2236

2194
2237
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
2195
2238

2196
-
it yields "no match" quickly. However, if a once-only subpattern
2197
-
is not used, the match runs for a very long time
2198
-
indeed because there are so many different ways the + and *
2199
-
repeats can carve up the subject, and all have to be tested
2239
+
it yields "no match" quickly. However, if a once-only subpattern
2240
+
is not used, the match runs for a very long time
2241
+
indeed because there are so many different ways the + and *
2242
+
repeats can carve up the subject, and all have to be tested
2200
2243
before failure can be reported.
2201
2244
</para>
2202
2245
<para>
2203
-
The values set for any capturing subpatterns are those from
2246
+
The values set for any capturing subpatterns are those from
2204
2247
the outermost level of the recursion at which the subpattern
2205
2248
value is set. If the pattern above is matched against
2206
2249

2207
2250
<literal>(ab(cd)ef)</literal>
2208
2251

2209
-
the value for the capturing parentheses is "ef", which is
2210
-
the last value taken on at the top level. If additional
2252
+
the value for the capturing parentheses is "ef", which is
2253
+
the last value taken on at the top level. If additional
2211
2254
parentheses are added, giving
2212
2255

2213
2256
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2214
2257
then the string they capture
2215
2258
is "ab(cd)ef", the contents of the top level parentheses. If
2216
-
there are more than 15 capturing parentheses in a pattern,
2217
-
PCRE has to obtain extra memory to store data during a
2218
-
recursion, which it does by using pcre_malloc, freeing it
2219
-
via pcre_free afterwards. If no memory can be obtained, it
2220
-
saves data for the first 15 capturing parentheses only, as
2259
+
there are more than 15 capturing parentheses in a pattern,
2260
+
PCRE has to obtain extra memory to store data during a
2261
+
recursion, which it does by using pcre_malloc, freeing it
2262
+
via pcre_free afterwards. If no memory can be obtained, it
2263
+
saves data for the first 15 capturing parentheses only, as
2221
2264
there is no way to give an out-of-memory error from within a
2222
2265
recursion.
2223
2266
</para>
...
...
@@ -2256,75 +2299,75 @@
2256
2299
<title>Performance</title>
2257
2300
<para>
2258
2301
Certain items that may appear in patterns are more efficient
2259
-
than others. It is more efficient to use a character class
2302
+
than others. It is more efficient to use a character class
2260
2303
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2261
-
In general, the simplest construction that provides the
2262
-
required behaviour is usually the most efficient. Jeffrey
2263
-
Friedl's book contains a lot of discussion about optimizing
2304
+
In general, the simplest construction that provides the
2305
+
required behaviour is usually the most efficient. Jeffrey
2306
+
Friedl's book contains a lot of discussion about optimizing
2264
2307
regular expressions for efficient performance.
2265
2308
</para>
2266
2309
<para>
2267
2310
When a pattern begins with .* and the <link
2268
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2269
-
set, the pattern is implicitly anchored by PCRE, since it
2311
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2312
+
set, the pattern is implicitly anchored by PCRE, since it
2270
2313
can match only at the start of a subject string. However, if
2271
2314
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2272
2315
is not set, PCRE cannot make this optimization,
2273
-
because the . metacharacter does not then match a newline,
2316
+
because the . metacharacter does not then match a newline,
2274
2317
and if the subject string contains newlines, the pattern may
2275
-
match from the character immediately following one of them
2318
+
match from the character immediately following one of them
2276
2319
instead of from the very start. For example, the pattern
2277
2320

2278
2321
<literal>(.*) second</literal>
2279
2322

2280
2323
matches the subject "first\nand second" (where \n stands for
2281
2324
a newline character) with the first captured substring being
2282
-
"and". In order to do this, PCRE has to retry the match
2325
+
"and". In order to do this, PCRE has to retry the match
2283
2326
starting after every newline in the subject.
2284
2327
</para>
2285
2328
<para>
2286
2329
If you are using such a pattern with subject strings that do
2287
-
not contain newlines, the best performance is obtained by
2330
+
not contain newlines, the best performance is obtained by
2288
2331
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2289
-
or starting the pattern with ^.* to
2290
-
indicate explicit anchoring. That saves PCRE from having to
2332
+
or starting the pattern with ^.* to
2333
+
indicate explicit anchoring. That saves PCRE from having to
2291
2334
scan along the subject looking for a newline to restart at.
2292
2335
</para>
2293
2336
<para>
2294
-
Beware of patterns that contain nested indefinite repeats.
2295
-
These can take a long time to run when applied to a string
2337
+
Beware of patterns that contain nested indefinite repeats.
2338
+
These can take a long time to run when applied to a string
2296
2339
that does not match. Consider the pattern fragment
2297
2340

2298
2341
<literal>(a+)*</literal>
2299
2342
</para>
2300
2343
<para>
2301
-
This can match "aaaa" in 33 different ways, and this number
2302
-
increases very rapidly as the string gets longer. (The *
2303
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2304
-
those cases other than 0, the + repeats can match different
2344
+
This can match "aaaa" in 33 different ways, and this number
2345
+
increases very rapidly as the string gets longer. (The *
2346
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2347
+
those cases other than 0, the + repeats can match different
2305
2348
numbers of times.) When the remainder of the pattern is such
2306
-
that the entire match is going to fail, PCRE has in principle
2307
-
to try every possible variation, and this can take an
2349
+
that the entire match is going to fail, PCRE has in principle
2350
+
to try every possible variation, and this can take an
2308
2351
extremely long time.
2309
2352
</para>
2310
2353
<para>
2311
-
An optimization catches some of the more simple cases such
2354
+
An optimization catches some of the more simple cases such
2312
2355
as
2313
2356

2314
2357
<literal>(a+)*b</literal>
2315
2358

2316
-
where a literal character follows. Before embarking on the
2359
+
where a literal character follows. Before embarking on the
2317
2360
standard matching procedure, PCRE checks that there is a "b"
2318
-
later in the subject string, and if there is not, it fails
2319
-
the match immediately. However, when there is no following
2320
-
literal this optimization cannot be used. You can see the
2361
+
later in the subject string, and if there is not, it fails
2362
+
the match immediately. However, when there is no following
2363
+
literal this optimization cannot be used. You can see the
2321
2364
difference by comparing the behaviour of
2322
2365

2323
2366
<literal>(a+)*\d</literal>
2324
2367

2325
-
with the pattern above. The former gives a failure almost
2326
-
instantly when applied to a whole line of "a" characters,
2327
-
whereas the latter takes an appreciable time with strings
2368
+
with the pattern above. The former gives a failure almost
2369
+
instantly when applied to a whole line of "a" characters,
2370
+
whereas the latter takes an appreciable time with strings
2328
2371
longer than about 20 characters.
2329
2372
</para>
2330
2373
</section>
2331
2374