reference/pcre/pattern.syntax.xml
bb4abab22bf0204b4dba0140ac5fc9daa6888e0f
...
...
@@ -1,28 +1,28 @@
1
1
<?xml version="1.0" encoding="utf-8"?>
2
2
<!-- $Revision$ -->
3
3
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
4
-
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">
4
+
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
5
5
<title>Pattern Syntax</title>
6
6
<titleabbrev>PCRE regex syntax</titleabbrev>
7
7

8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described in this section. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -32,6 +32,7 @@
32
32
When using the PCRE functions, it is required that the pattern is enclosed
33
33
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
34
34
non-backslash, non-whitespace character.
35
+
Leading whitespace before a valid delimiter is silently ignored.
35
36
</para>
36
37
<para>
37
38
Often used delimiters are forward slashes (<literal>/</literal>), hash
...
...
@@ -48,6 +49,26 @@
48
49
</programlisting>
49
50
</informalexample>
50
51
</para>
52
+
<para>
53
+
It is also possible to use
54
+
bracket style delimiters where the opening and closing brackets are the
55
+
starting and ending delimiter, respectively. <literal>()</literal>,
56
+
<literal>{}</literal>, <literal>[]</literal> and <literal>&lt;&gt;</literal>
57
+
are all valid bracket style delimiter pairs.
58
+
<informalexample>
59
+
<programlisting>
60
+
<![CDATA[
61
+
(this [is] a (pattern))
62
+
{this [is] a (pattern)}
63
+
[this [is] a (pattern)]
64
+
<this [is] a (pattern)>
65
+
]]>
66
+
</programlisting>
67
+
</informalexample>
68
+
Bracket style delimiters do not need to be escaped when they are used as meta
69
+
characters within the pattern, but as with other delimiters they must be
70
+
escaped when they are used as literal characters.
71
+
</para>
51
72
<para>
52
73
If the delimiter needs to be matched inside the pattern it must be
53
74
escaped using a backslash. If the delimiter appears often inside the
...
...
@@ -65,18 +86,6 @@
65
86
for injection into a pattern and its optional second parameter may be used
66
87
to specify the delimiter to be escaped.
67
88
</para>
68
-
<para>
69
-
In addition to the aforementioned delimiters, it is also possible to use
70
-
bracket style delimiters where the opening and closing brackets are the
71
-
starting and ending delimiter, respectively.
72
-
<informalexample>
73
-
<programlisting>
74
-
<![CDATA[
75
-
{this is a pattern}
76
-
]]>
77
-
</programlisting>
78
-
</informalexample>
79
-
</para>
80
89
<para>
81
90
You may add <link linkend="reference.pcre.pattern.modifiers">pattern
82
91
modifiers</link> after the ending delimiter. The following is an example
...
...
@@ -93,103 +102,100 @@
93
102
<section xml:id="regexp.reference.meta">
94
103
<title>Meta-characters</title>
95
104
<para>
96
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
97
106
ability to include alternatives and repetitions in the
98
-
pattern. These are encoded in the pattern by the use of
99
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
100
109
are interpreted in some special way.
101
110
</para>
102
111
<para>
103
-
There are two different sets of meta-characters: those that
104
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
105
114
brackets, and those that are recognized in square brackets.
106
115
Outside square brackets, the meta-characters are as follows:
107
-
<variablelist>
108
-
<varlistentry>
109
-
<term><emphasis>\</emphasis></term>
110
-
<listitem><simpara>general escape character with several uses</simpara></listitem>
111
-
</varlistentry>
112
-
<varlistentry>
113
-
<term><emphasis>^</emphasis></term>
114
-
<listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>
115
-
</varlistentry>
116
-
<varlistentry>
117
-
<term><emphasis>$</emphasis></term>
118
-
<listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem>
119
-
</varlistentry>
120
-
<varlistentry>
121
-
<term><emphasis>.</emphasis></term>
122
-
<listitem><simpara>match any character except newline (by default)</simpara></listitem>
123
-
</varlistentry>
124
-
<varlistentry>
125
-
<term><emphasis>[</emphasis></term>
126
-
<listitem><simpara>start character class definition</simpara></listitem>
127
-
</varlistentry>
128
-
<varlistentry>
129
-
<term><emphasis>]</emphasis></term>
130
-
<listitem><simpara>end character class definition</simpara></listitem>
131
-
</varlistentry>
132
-
<varlistentry>
133
-
<term><emphasis>|</emphasis></term>
134
-
<listitem><simpara>start of alternative branch</simpara></listitem>
135
-
</varlistentry>
136
-
<varlistentry>
137
-
<term><emphasis>(</emphasis></term>
138
-
<listitem><simpara>start subpattern</simpara></listitem>
139
-
</varlistentry>
140
-
<varlistentry>
141
-
<term><emphasis>)</emphasis></term>
142
-
<listitem><simpara>end subpattern</simpara></listitem>
143
-
</varlistentry>
144
-
<varlistentry>
145
-
<term><emphasis>?</emphasis></term>
146
-
<listitem>
147
-
<simpara>
148
-
extends the meaning of (, also 0 or 1 quantifier, also makes greedy
149
-
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)
150
-
</simpara>
151
-
</listitem>
152
-
</varlistentry>
153
-
<varlistentry>
154
-
<term><emphasis>*</emphasis></term>
155
-
<listitem><simpara>0 or more quantifier</simpara></listitem>
156
-
</varlistentry>
157
-
<varlistentry>
158
-
<term><emphasis>+</emphasis></term>
159
-
<listitem><simpara>1 or more quantifier</simpara></listitem>
160
-
</varlistentry>
161
-
<varlistentry>
162
-
<term><emphasis>{</emphasis></term>
163
-
<listitem><simpara>start min/max quantifier</simpara></listitem>
164
-
</varlistentry>
165
-
<varlistentry>
166
-
<term><emphasis>}</emphasis></term>
167
-
<listitem><simpara>end min/max quantifier</simpara></listitem>
168
-
</varlistentry>
169
-
</variablelist>
116
+

117
+
<table>
118
+
<title>Meta-characters outside square brackets</title>
119
+
<tgroup cols="2">
120
+
<thead>
121
+
<row>
122
+
<entry>Meta-character</entry><entry>Description</entry>
123
+
</row>
124
+
</thead>
125
+
<tbody>
126
+
<row>
127
+
<entry>\</entry><entry>general escape character with several uses</entry>
128
+
</row>
129
+
<row>
130
+
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
+
</row>
132
+
<row>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
135
+
</row>
136
+
<row>
137
+
<entry>.</entry><entry>match any character except newline (by default)</entry>
138
+
</row>
139
+
<row>
140
+
<entry>[</entry><entry>start character class definition</entry>
141
+
</row>
142
+
<row>
143
+
<entry>]</entry><entry>end character class definition</entry>
144
+
</row>
145
+
<row>
146
+
<entry>|</entry><entry>start of alternative branch</entry>
147
+
</row>
148
+
<row>
149
+
<entry>(</entry><entry>start subpattern</entry>
150
+
</row>
151
+
<row>
152
+
<entry>)</entry><entry>end subpattern</entry>
153
+
</row>
154
+
<row>
155
+
<entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy
156
+
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>
157
+
</row>
158
+
<row>
159
+
<entry>*</entry><entry>0 or more quantifier</entry>
160
+
</row>
161
+
<row>
162
+
<entry>+</entry><entry>1 or more quantifier</entry>
163
+
</row>
164
+
<row>
165
+
<entry>{</entry><entry>start min/max quantifier</entry>
166
+
</row>
167
+
<row>
168
+
<entry>}</entry><entry>end min/max quantifier</entry>
169
+
</row>
170
+
</tbody>
171
+
</tgroup>
172
+
</table>
170
173

171
174
Part of a pattern that is in square brackets is called a
172
-
"character class". In a character class the only
175
+
<link linkend="regexp.reference.character-classes">character class</link>. In a character class the only
173
176
meta-characters are:
174
177

175
-
<variablelist>
176
-
<varlistentry>
177
-
<term><emphasis>\</emphasis></term>
178
-
<listitem><simpara>general escape character</simpara></listitem>
179
-
</varlistentry>
180
-
<varlistentry>
181
-
<term><emphasis>^</emphasis></term>
182
-
<listitem><simpara>negate the class, but only if the first character</simpara></listitem>
183
-
</varlistentry>
184
-
<varlistentry>
185
-
<term><emphasis>-</emphasis></term>
186
-
<listitem><simpara>indicates character range</simpara></listitem>
187
-
</varlistentry>
188
-
<varlistentry>
189
-
<term><emphasis>]</emphasis></term>
190
-
<listitem><simpara>terminates the character class</simpara></listitem>
191
-
</varlistentry>
192
-
</variablelist>
178
+
<table>
179
+
<title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>
180
+
<tgroup cols="2">
181
+
<thead>
182
+
<row>
183
+
<entry>Meta-character</entry><entry>Description</entry>
184
+
</row>
185
+
</thead>
186
+
<tbody>
187
+
<row>
188
+
<entry>\</entry><entry>general escape character</entry>
189
+
</row>
190
+
<row>
191
+
<entry>^</entry><entry>negate the class, but only if the first character</entry>
192
+
</row>
193
+
<row>
194
+
<entry>-</entry><entry>indicates character range</entry>
195
+
</row>
196
+
</tbody>
197
+
</tgroup>
198
+
</table>
193
199

194
200
The following sections describe the use of each of the
195
201
meta-characters.
...
...
@@ -199,9 +205,9 @@
199
205
<section xml:id="regexp.reference.escape">
200
206
<title>Escape sequences</title>
201
207
<para>
202
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
203
209
followed by a non-alphanumeric character, it takes away any
204
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
205
211
backslash as an escape character applies both inside and
206
212
outside character classes.
207
213
</para>
...
...
@@ -210,7 +216,7 @@
210
216
"\*" in the pattern. This applies whether or not the
211
217
following character would otherwise be interpreted as a
212
218
meta-character, so it is always safe to precede a non-alphanumeric
213
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
214
220
particular, if you want to match a backslash, you write "\\".
215
221
</para>
216
222
<note>
...
...
@@ -232,10 +238,10 @@
232
238
<para>
233
239
A second use of backslash provides a way of encoding
234
240
non-printing characters in patterns in a visible manner. There
235
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
236
242
apart from the binary zero that terminates a pattern,
237
243
but when a pattern is being prepared by text editing, it is
238
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
239
245
than the binary character it represents:
240
246
</para>
241
247
<para>
...
...
@@ -296,6 +302,12 @@
296
302
<simpara>carriage return (hex 0D)</simpara>
297
303
</listitem>
298
304
</varlistentry>
305
+
<varlistentry>
306
+
<term><emphasis>\R</emphasis></term>
307
+
<listitem>
308
+
<simpara>line break: matches \n, \r and \r\n</simpara>
309
+
</listitem>
310
+
</varlistentry>
299
311
<varlistentry>
300
312
<term><emphasis>\t</emphasis></term>
301
313
<listitem>
...
...
@@ -320,9 +332,9 @@
320
332
</para>
321
333
<para>
322
334
The precise effect of "<literal>\cx</literal>" is as follows:
323
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
324
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
325
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
326
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
327
339
becomes hex 7B.
328
340
</para>
...
...
@@ -338,7 +350,7 @@
338
350
</para>
339
351
<para>
340
352
After "<literal>\0</literal>" up to two further octal digits are read.
341
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
342
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
343
355
specifies two binary zeros followed by a BEL character. Make sure you
344
356
supply two digits after the initial zero if the character
...
...
@@ -347,20 +359,20 @@
347
359
<para>
348
360
The handling of a backslash followed by a digit other than 0
349
361
is complicated. Outside a character class, PCRE reads it
350
-
and any following digits as a decimal number. If the number
351
-
is less than 10, or if there have been at least that many
352
-
previous capturing left parentheses in the expression, the
353
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
354
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
355
367
of parenthesized subpatterns.
356
368
</para>
357
369
<para>
358
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
359
371
greater than 9 and there have not been that many capturing
360
372
subpatterns, PCRE re-reads up to three octal digits following
361
373
the backslash, and generates a single byte from the
362
374
least significant 8 bits of the value. Any subsequent digits
363
-
stand for themselves. For example:
375
+
stand for themselves. For example:
364
376
</para>
365
377
<para>
366
378
<variablelist>
...
...
@@ -428,7 +440,7 @@
428
440
digits are ever read.
429
441
</para>
430
442
<para>
431
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
432
444
used both inside and outside character classes. In addition,
433
445
inside a character class, the sequence "<literal>\b</literal>"
434
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -450,11 +462,11 @@
450
462
</varlistentry>
451
463
<varlistentry>
452
464
<term><emphasis>\h</emphasis></term>
453
-
<listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
465
+
<listitem><simpara>any horizontal whitespace character</simpara></listitem>
454
466
</varlistentry>
455
467
<varlistentry>
456
468
<term><emphasis>\H</emphasis></term>
457
-
<listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
469
+
<listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>
458
470
</varlistentry>
459
471
<varlistentry>
460
472
<term><emphasis>\s</emphasis></term>
...
...
@@ -466,11 +478,11 @@
466
478
</varlistentry>
467
479
<varlistentry>
468
480
<term><emphasis>\v</emphasis></term>
469
-
<listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
481
+
<listitem><simpara>any vertical whitespace character</simpara></listitem>
470
482
</varlistentry>
471
483
<varlistentry>
472
484
<term><emphasis>\V</emphasis></term>
473
-
<listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
485
+
<listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>
474
486
</varlistentry>
475
487
<varlistentry>
476
488
<term><emphasis>\w</emphasis></term>
...
...
@@ -487,9 +499,15 @@
487
499
characters into two disjoint sets. Any given character
488
500
matches one, and only one, of each pair.
489
501
</para>
502
+
<para>
503
+
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),
504
+
and space (32). However, if locale-specific matching is happening,
505
+
characters with code points in the range 128-255 may also be considered
506
+
as whitespace characters, for instance, NBSP (A0).
507
+
</para>
490
508
<para>
491
509
A "word" character is any letter or digit or the underscore
492
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
493
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
494
512
controlled by PCRE's character tables, and may vary if locale-specific
495
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -498,15 +516,15 @@
498
516
</para>
499
517
<para>
500
518
These character type sequences can appear both inside and
501
-
outside character classes. They each match one character of
502
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
503
521
the end of the subject string, all of them fail, since there
504
522
is no character to match.
505
523
</para>
506
524
<para>
507
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
508
526
assertions. An assertion specifies a condition that has to be met
509
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
510
528
characters from the subject string. The use of subpatterns
511
529
for more complicated assertions is described below. The
512
530
backslashed assertions are
...
...
@@ -545,7 +563,7 @@
545
563
</variablelist>
546
564
</para>
547
565
<para>
548
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
549
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
550
568
character, inside a character class).
551
569
</para>
...
...
@@ -553,20 +571,20 @@
553
571
A word boundary is a position in the subject string where
554
572
the current character and the previous character do not both
555
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
556
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
557
575
<literal>\W</literal>), or the start or end of the string if the first
558
576
or last character matches <literal>\w</literal>, respectively.
559
577
</para>
560
578
<para>
561
579
The <literal>\A</literal>, <literal>\Z</literal>, and
562
-
<literal>\z</literal> assertions differ from the traditional
563
-
circumflex and dollar (described below) in that they only
564
-
ever match at the very start and end of the subject string,
565
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
566
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
567
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
568
-
options. The difference between <literal>\Z</literal> and
569
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
570
588
newline that is the last character of the string as well as at the end of
571
589
the string, whereas <literal>\z</literal> matches only at the end.
572
590
</para>
...
...
@@ -583,12 +601,16 @@
583
601
regexp metacharacters in the pattern. For example:
584
602
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
585
603
followed by literals <literal>.$.</literal> and anchored at the end of
586
-
the string.
604
+
the string. Note that this does not change the behavior of
605
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
606
+
is not valid, because the second <literal>#</literal> marks the end
607
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
608
+
modifiers.
587
609
</para>
588
610

589
611
<para>
590
-
<literal>\K</literal> can be used to reset the match start since
591
-
PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches
612
+
<literal>\K</literal> can be used to reset the match start.
613
+
For example, the pattern <literal>foo\Kbar</literal> matches
592
614
"foobar", but reports that it has matched "bar". The use of
593
615
<literal>\K</literal> does not interfere with the setting of captured
594
616
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
...
...
@@ -818,7 +840,7 @@
818
840
<row rowsep="1">
819
841
<entry><literal>So</literal></entry>
820
842
<entry>Other symbol</entry>
821
-
<entry></entry>
843
+
<entry>Includes emojis</entry>
822
844
</row>
823
845
<row>
824
846
<entry><literal>Z</literal></entry>
...
...
@@ -844,7 +866,7 @@
844
866
</tgroup>
845
867
</table>
846
868
<para>
847
-
Extended properties such as "Greek" or "InMusicalSymbols" are not
869
+
Extended properties such as <literal>InMusicalSymbols</literal> are not
848
870
supported by PCRE.
849
871
</para>
850
872
<para>
...
...
@@ -852,15 +874,193 @@
852
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
853
875
</para>
854
876
<para>
855
-
The <literal>\X</literal> escape matches any number of Unicode characters
856
-
that form an extended Unicode sequence. <literal>\X</literal> is equivalent
857
-
to <literal>(?>\PM\pM*)</literal>.
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
879
+
example:
858
880
</para>
881
+
<itemizedlist>
882
+
<listitem>
883
+
<simpara><literal>\p{Greek}</literal></simpara>
884
+
</listitem>
885
+
<listitem>
886
+
<simpara><literal>\P{Han}</literal></simpara>
887
+
</listitem>
888
+
</itemizedlist>
859
889
<para>
860
-
That is, it matches a character without the "mark" property, followed
861
-
by zero or more characters with the "mark" property, and treats the
862
-
sequence as an atomic group (see below). Characters with the "mark"
863
-
property are typically accents that affect the preceding character.
890
+
Those that are not part of an identified script are lumped together as
891
+
<literal>Common</literal>. The current list of scripts is:
892
+
</para>
893
+
<table>
894
+
<title>Supported scripts</title>
895
+
<tgroup cols="5">
896
+
<tbody>
897
+
<row>
898
+
<entry><literal>Arabic</literal></entry>
899
+
<entry><literal>Armenian</literal></entry>
900
+
<entry><literal>Avestan</literal></entry>
901
+
<entry><literal>Balinese</literal></entry>
902
+
<entry><literal>Bamum</literal></entry>
903
+
</row>
904
+
<row>
905
+
<entry><literal>Batak</literal></entry>
906
+
<entry><literal>Bengali</literal></entry>
907
+
<entry><literal>Bopomofo</literal></entry>
908
+
<entry><literal>Brahmi</literal></entry>
909
+
<entry><literal>Braille</literal></entry>
910
+
</row>
911
+
<row>
912
+
<entry><literal>Buginese</literal></entry>
913
+
<entry><literal>Buhid</literal></entry>
914
+
<entry><literal>Canadian_Aboriginal</literal></entry>
915
+
<entry><literal>Carian</literal></entry>
916
+
<entry><literal>Chakma</literal></entry>
917
+
</row>
918
+
<row>
919
+
<entry><literal>Cham</literal></entry>
920
+
<entry><literal>Cherokee</literal></entry>
921
+
<entry><literal>Common</literal></entry>
922
+
<entry><literal>Coptic</literal></entry>
923
+
<entry><literal>Cuneiform</literal></entry>
924
+
</row>
925
+
<row>
926
+
<entry><literal>Cypriot</literal></entry>
927
+
<entry><literal>Cyrillic</literal></entry>
928
+
<entry><literal>Deseret</literal></entry>
929
+
<entry><literal>Devanagari</literal></entry>
930
+
<entry><literal>Egyptian_Hieroglyphs</literal></entry>
931
+
</row>
932
+
<row>
933
+
<entry><literal>Ethiopic</literal></entry>
934
+
<entry><literal>Georgian</literal></entry>
935
+
<entry><literal>Glagolitic</literal></entry>
936
+
<entry><literal>Gothic</literal></entry>
937
+
<entry><literal>Greek</literal></entry>
938
+
</row>
939
+
<row>
940
+
<entry><literal>Gujarati</literal></entry>
941
+
<entry><literal>Gurmukhi</literal></entry>
942
+
<entry><literal>Han</literal></entry>
943
+
<entry><literal>Hangul</literal></entry>
944
+
<entry><literal>Hanunoo</literal></entry>
945
+
</row>
946
+
<row>
947
+
<entry><literal>Hebrew</literal></entry>
948
+
<entry><literal>Hiragana</literal></entry>
949
+
<entry><literal>Imperial_Aramaic</literal></entry>
950
+
<entry><literal>Inherited</literal></entry>
951
+
<entry><literal>Inscriptional_Pahlavi</literal></entry>
952
+
</row>
953
+
<row>
954
+
<entry><literal>Inscriptional_Parthian</literal></entry>
955
+
<entry><literal>Javanese</literal></entry>
956
+
<entry><literal>Kaithi</literal></entry>
957
+
<entry><literal>Kannada</literal></entry>
958
+
<entry><literal>Katakana</literal></entry>
959
+
</row>
960
+
<row>
961
+
<entry><literal>Kayah_Li</literal></entry>
962
+
<entry><literal>Kharoshthi</literal></entry>
963
+
<entry><literal>Khmer</literal></entry>
964
+
<entry><literal>Lao</literal></entry>
965
+
<entry><literal>Latin</literal></entry>
966
+
</row>
967
+
<row>
968
+
<entry><literal>Lepcha</literal></entry>
969
+
<entry><literal>Limbu</literal></entry>
970
+
<entry><literal>Linear_B</literal></entry>
971
+
<entry><literal>Lisu</literal></entry>
972
+
<entry><literal>Lycian</literal></entry>
973
+
</row>
974
+
<row>
975
+
<entry><literal>Lydian</literal></entry>
976
+
<entry><literal>Malayalam</literal></entry>
977
+
<entry><literal>Mandaic</literal></entry>
978
+
<entry><literal>Meetei_Mayek</literal></entry>
979
+
<entry><literal>Meroitic_Cursive</literal></entry>
980
+
</row>
981
+
<row>
982
+
<entry><literal>Meroitic_Hieroglyphs</literal></entry>
983
+
<entry><literal>Miao</literal></entry>
984
+
<entry><literal>Mongolian</literal></entry>
985
+
<entry><literal>Myanmar</literal></entry>
986
+
<entry><literal>New_Tai_Lue</literal></entry>
987
+
</row>
988
+
<row>
989
+
<entry><literal>Nko</literal></entry>
990
+
<entry><literal>Ogham</literal></entry>
991
+
<entry><literal>Old_Italic</literal></entry>
992
+
<entry><literal>Old_Persian</literal></entry>
993
+
<entry><literal>Old_South_Arabian</literal></entry>
994
+
</row>
995
+
<row>
996
+
<entry><literal>Old_Turkic</literal></entry>
997
+
<entry><literal>Ol_Chiki</literal></entry>
998
+
<entry><literal>Oriya</literal></entry>
999
+
<entry><literal>Osmanya</literal></entry>
1000
+
<entry><literal>Phags_Pa</literal></entry>
1001
+
</row>
1002
+
<row>
1003
+
<entry><literal>Phoenician</literal></entry>
1004
+
<entry><literal>Rejang</literal></entry>
1005
+
<entry><literal>Runic</literal></entry>
1006
+
<entry><literal>Samaritan</literal></entry>
1007
+
<entry><literal>Saurashtra</literal></entry>
1008
+
</row>
1009
+
<row>
1010
+
<entry><literal>Sharada</literal></entry>
1011
+
<entry><literal>Shavian</literal></entry>
1012
+
<entry><literal>Sinhala</literal></entry>
1013
+
<entry><literal>Sora_Sompeng</literal></entry>
1014
+
<entry><literal>Sundanese</literal></entry>
1015
+
</row>
1016
+
<row>
1017
+
<entry><literal>Syloti_Nagri</literal></entry>
1018
+
<entry><literal>Syriac</literal></entry>
1019
+
<entry><literal>Tagalog</literal></entry>
1020
+
<entry><literal>Tagbanwa</literal></entry>
1021
+
<entry><literal>Tai_Le</literal></entry>
1022
+
</row>
1023
+
<row>
1024
+
<entry><literal>Tai_Tham</literal></entry>
1025
+
<entry><literal>Tai_Viet</literal></entry>
1026
+
<entry><literal>Takri</literal></entry>
1027
+
<entry><literal>Tamil</literal></entry>
1028
+
<entry><literal>Telugu</literal></entry>
1029
+
</row>
1030
+
<row>
1031
+
<entry><literal>Thaana</literal></entry>
1032
+
<entry><literal>Thai</literal></entry>
1033
+
<entry><literal>Tibetan</literal></entry>
1034
+
<entry><literal>Tifinagh</literal></entry>
1035
+
<entry><literal>Ugaritic</literal></entry>
1036
+
</row>
1037
+
<row>
1038
+
<entry><literal>Vai</literal></entry>
1039
+
<entry><literal>Yi</literal></entry>
1040
+
<entry />
1041
+
<entry />
1042
+
<entry />
1043
+
<entry />
1044
+
</row>
1045
+
</tbody>
1046
+
</tgroup>
1047
+
</table>
1048
+
<para>
1049
+
The <literal>\X</literal> escape matches a Unicode extended grapheme
1050
+
cluster. An extended grapheme cluster is one or more Unicode characters
1051
+
that combine to form a single glyph. In effect, this can be thought of as
1052
+
the Unicode equivalent of <literal>.</literal> as it will match one
1053
+
composed character, regardless of how many individual characters are
1054
+
actually used to render it.
1055
+
</para>
1056
+
<para>
1057
+
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1058
+
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1060
+
character without the "mark" property, followed by zero or more characters
1061
+
with the "mark" property, and treats the sequence as an atomic group (see
1062
+
below). Characters with the "mark" property are typically accents that
1063
+
affect the preceding character.
864
1064
</para>
865
1065
<para>
866
1066
Matching characters by Unicode property is not fast, because PCRE has
...
...
@@ -876,8 +1076,8 @@
876
1076
<para>
877
1077
Outside a character class, in the default matching mode, the
878
1078
circumflex character (<literal>^</literal>) is an assertion which
879
-
is true only if the current matching point is at the start of
880
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
881
1081
has an entirely different meaning (see below).
882
1082
</para>
883
1083
<para>
...
...
@@ -892,12 +1092,12 @@
892
1092
</para>
893
1093
<para>
894
1094
A dollar character (<literal>$</literal>) is an assertion which is
895
-
&true; only if the current matching point is at the end of the subject
896
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
897
1097
character in the string (by default). Dollar (<literal>$</literal>)
898
-
need not be the last character of the pattern if a number of
899
-
alternatives are involved, but it should be the last item in any branch
900
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
901
1101
character class.
902
1102
</para>
903
1103
<para>
...
...
@@ -923,9 +1123,9 @@
923
1123
set.
924
1124
</para>
925
1125
<para>
926
-
Note that the sequences \A, \Z, and \z can be used to match
927
-
the start and end of the subject in both modes, and if all
928
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
929
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
930
1130
is set or not.
931
1131
</para>
...
...
@@ -934,14 +1134,14 @@
934
1134
<section xml:id="regexp.reference.dot">
935
1135
<title>Dot</title>
936
1136
<para>
937
-
Outside a character class, a dot in the pattern matches any
938
-
one character in the subject, including a non-printing
939
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
940
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
941
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
942
1142
handling of dot is entirely independent of the handling of
943
-
circumflex and dollar, the only relationship being that they
944
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
945
1145
in a character class.
946
1146
</para>
947
1147
<para>
...
...
@@ -955,29 +1155,29 @@
955
1155
<title>Character classes</title>
956
1156
<para>
957
1157
An opening square bracket introduces a character class,
958
-
terminated by a closing square bracket. A closing square
959
-
bracket on its own is not special. If a closing square
960
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
961
1161
the first data character in the class (after an initial
962
1162
circumflex, if present) or escaped with a backslash.
963
1163
</para>
964
1164
<para>
965
1165
A character class matches a single character in the subject;
966
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
967
1167
the class, unless the first character in the class is a
968
-
circumflex, in which case the subject character must not be in
969
-
the set defined by the class. If a circumflex is actually
970
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
971
1171
first character, or escape it with a backslash.
972
1172
</para>
973
1173
<para>
974
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
975
1175
case vowel, while [^aeiou] matches any character that is not
976
-
a lower case vowel. Note that a circumflex is just a
977
-
convenient notation for specifying the characters which are in
978
-
the class by enumerating those that are not. It is not an
979
-
assertion: it still consumes a character from the subject
980
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
981
1181
the string.
982
1182
</para>
983
1183
<para>
...
...
@@ -989,61 +1189,62 @@
989
1189
</para>
990
1190
<para>
991
1191
The newline character is never treated in any special way in
992
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
993
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
994
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
995
1195
options is. A class such as [^a] will always match a newline.
996
1196
</para>
997
1197
<para>
998
-
The minus (hyphen) character can be used to specify a range
999
-
of characters in a character class. For example, [d-m]
1000
-
matches any letter between d and m, inclusive. If a minus
1001
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1002
1202
backslash or appear in a position where it cannot be
1003
1203
interpreted as indicating a range, typically as the first or last
1004
1204
character in the class.
1005
1205
</para>
1006
1206
<para>
1007
-
It is not possible to have the literal character "]" as the
1008
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1009
1209
interpreted as a class of two characters ("W" and "-")
1010
1210
followed by a literal string "46]", so it would match "W46]" or
1011
-
"-46]". However, if the "]" is escaped with a backslash it
1012
-
is interpreted as the end of range, so [W-\]46] is
1013
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1014
1214
separate characters. The octal or hexadecimal representation
1015
1215
of "]" can also be used to end a range.
1016
1216
</para>
1017
1217
<para>
1018
1218
Ranges operate in ASCII collating sequence. They can also be
1019
-
used for characters specified numerically, for example
1020
-
[\000-\037]. If a range that includes letters is used when
1021
-
case-insensitive (caseless) matching is set, it matches the
1022
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1023
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1024
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1025
1225
accented E characters in both cases.
1026
1226
</para>
1027
1227
<para>
1028
-
The character types \d, \D, \s, \S, \w, and \W may also
1029
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1030
1230
they match to the class. For example, [\dABCDEF] matches any
1031
-
hexadecimal digit. A circumflex can conveniently be used
1032
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1033
1233
restricted set of characters than the matching lower case type.
1034
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1035
1235
but not underscore.
1036
1236
</para>
1037
1237
<para>
1038
-
All non-alphanumeric characters other than \, -, ^ (at the
1039
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1040
1240
classes, but it does no harm if they are escaped. The pattern
1041
1241
terminator is always special and must be escaped when used
1042
1242
within an expression.
1043
1243
</para>
1044
1244
<para>
1045
1245
Perl supports the POSIX notation for character classes. This uses names
1046
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1047
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1048
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1049
1250
names are:
...
...
@@ -1082,22 +1283,32 @@
1082
1283
<para>
1083
1284
In UTF-8 mode, characters with values greater than 128 do not match any
1084
1285
of the POSIX character classes.
1286
+
As of libpcre 8.10 some character classes are changed to use
1287
+
Unicode character properties, in which case the mentioned restriction does
1288
+
not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>
1289
+
for details.
1290
+
</para>
1291
+
<para>
1292
+
Unicode character properties can appear inside a character class. They can
1293
+
not be part of a range. The minus (hyphen) character after a Unicode
1294
+
character class will match literally. Trying to end a range with a Unicode
1295
+
character property will result in a warning.
1085
1296
</para>
1086
1297
</section>
1087
1298

1088
1299
<section xml:id="regexp.reference.alternation">
1089
1300
<title>Alternation</title>
1090
1301
<para>
1091
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1092
1303
patterns. For example, the pattern
1093
1304
<literal>gilbert|sullivan</literal>
1094
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1095
-
may appear, and an empty alternative is permitted
1096
-
(matching the empty string). The matching process tries
1097
-
each alternative in turn, from left to right, and the first
1098
-
one that succeeds is used. If the alternatives are within a
1099
-
subpattern (defined below), "succeeds" means matching the
1100
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1101
1312
subpattern.
1102
1313
</para>
1103
1314
</section>
...
...
@@ -1112,7 +1323,7 @@
1112
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1113
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1114
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1115
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1116
1327
")". The option letters are:
1117
1328

1118
1329
<table>
...
...
@@ -1141,7 +1352,8 @@
1141
1352
</row>
1142
1353
<row>
1143
1354
<entry><literal>X</literal></entry>
1144
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1145
1357
</row>
1146
1358
<row>
1147
1359
<entry><literal>J</literal></entry>
...
...
@@ -1152,16 +1364,16 @@
1152
1364
</table>
1153
1365
</para>
1154
1366
<para>
1155
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1156
1368
also possible to unset these options by preceding the letter
1157
-
with a hyphen, and a combined setting and unsetting such as
1158
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1159
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1160
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1161
1373
while unsetting <link
1162
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1163
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1164
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1165
1377
hyphen, the option is unset.
1166
1378
</para>
1167
1379
<para>
...
...
@@ -1171,14 +1383,14 @@
1171
1383
and "abC".
1172
1384
</para>
1173
1385
<para>
1174
-
If an option change occurs inside a subpattern, the effect
1175
-
is different. This is a change of behaviour in Perl 5.005.
1176
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1177
1389
of the subpattern that follows it, so
1178
1390

1179
1391
<literal>(a(?i)b)c</literal>
1180
1392

1181
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1182
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1183
1395
used). By this means, options can be made to have different settings in
1184
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1187,18 +1399,18 @@
1187
1399

1188
1400
<literal>(a(?i)b|c)</literal>
1189
1401

1190
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1191
1403
"C" the first branch is abandoned before the option setting.
1192
-
This is because the effects of option settings happen at
1193
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1194
1406
</para>
1195
1407
<para>
1196
1408
The PCRE-specific options <link
1197
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1198
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1199
1411
be changed in the same way as the Perl-compatible options by
1200
-
using the characters U and X respectively. The (?X) flag
1201
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1202
1414
the pattern than any of the additional features it turns on,
1203
1415
even when it is at top level. It is best put at the start.
1204
1416
</para>
...
...
@@ -1207,8 +1419,8 @@
1207
1419
<section xml:id="regexp.reference.subpatterns">
1208
1420
<title>Subpatterns</title>
1209
1421
<para>
1210
-
Subpatterns are delimited by parentheses (round brackets),
1211
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1212
1424
does two things:
1213
1425
</para>
1214
1426
<orderedlist>
...
...
@@ -1237,30 +1449,30 @@
1237
1449

1238
1450
<literal>the ((red|white) (king|queen))</literal>
1239
1451

1240
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1241
1453
and are numbered 1, 2, and 3.
1242
1454
</para>
1243
1455
<para>
1244
-
The fact that plain parentheses fulfill two functions is not
1245
-
always helpful. There are often times when a grouping subpattern
1246
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1247
1459
opening parenthesis is followed by "?:", the subpattern does
1248
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1249
1461
number of any subsequent capturing subpatterns. For example,
1250
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1251
1463
pattern
1252
1464

1253
1465
<literal>the ((?:red|white) (king|queen))</literal>
1254
1466

1255
-
the captured substrings are "white queen" and "queen", and
1256
-
are numbered 1 and 2. The maximum number of captured substrings
1257
-
is 99, and the maximum number of all subpatterns,
1258
-
both capturing and non-capturing, is 200.
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1469
+
is 65535. It may not be possible to compile such large patterns,
1470
+
however, depending on the configuration options of libpcre.
1259
1471
</para>
1260
1472
<para>
1261
-
As a convenient shorthand, if any option settings are
1262
-
required at the start of a non-capturing subpattern, the
1263
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1264
1476
the two patterns
1265
1477
</para>
1266
1478

...
...
@@ -1274,10 +1486,10 @@
1274
1486
</informalexample>
1275
1487

1276
1488
<para>
1277
-
match exactly the same set of strings. Because alternative
1278
-
branches are tried from left to right, and options are not
1279
-
reset until the end of the subpattern is reached, an option
1280
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1281
1493
the above patterns match "SUNDAY" as well as "Saturday".
1282
1494
</para>
1283
1495

...
...
@@ -1285,7 +1497,7 @@
1285
1497
It is possible to name a subpattern using the syntax
1286
1498
<literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then
1287
1499
be indexed in the matches array by its normal numeric position and
1288
-
also by name. PHP 5.2.2 introduced two alternative syntaxes
1500
+
also by name. There are two alternative syntaxes
1289
1501
<literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.
1290
1502
</para>
1291
1503

...
...
@@ -1306,9 +1518,10 @@
1306
1518

1307
1519
<para>
1308
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1309
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1310
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1311
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1312
1525
</para>
1313
1526

1314
1527
<informalexample>
...
...
@@ -1334,45 +1547,56 @@
1334
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1335
1548
<listitem><simpara>a character class</simpara></listitem>
1336
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1337
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1338
1551
see below)</simpara></listitem>
1339
1552
</itemizedlist>
1340
1553
</para>
1341
1554
<para>
1342
-
The general repetition quantifier specifies a minimum and
1343
-
maximum number of permitted matches, by giving the two
1344
-
numbers in curly brackets (braces), separated by a comma.
1345
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1346
1559
less than or equal to the second. For example:
1347
1560

1348
1561
<literal>z{2,4}</literal>
1349
1562

1350
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1351
1564
is not a special character. If the second number is omitted,
1352
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1353
1566
second number and the comma are both omitted, the quantifier
1354
1567
specifies an exact number of required matches. Thus
1355
1568

1356
1569
<literal>[aeiou]{3,}</literal>
1357
1570

1358
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1359
1572
more, while
1360
1573

1361
1574
<literal>\d{8}</literal>
1362
1575

1363
-
matches exactly 8 digits. An opening curly bracket that
1364
-
appears in a position where a quantifier is not allowed, or
1365
-
one that does not match the syntax of a quantifier, is taken
1366
-
as a literal character. For example, {,6} is not a quantifier,
1367
-
but a literal string of four characters.
1576
+
matches exactly 8 digits.
1577
+

1368
1578
</para>
1579
+
<simpara>
1580
+
Prior to PHP 8.4.0, an opening curly bracket that
1581
+
appears in a position where a quantifier is not allowed, or
1582
+
one that does not match the syntax of a quantifier, is taken
1583
+
as a literal character. For example, <literal>{,6}</literal>
1584
+
is not a quantifier, but a literal string of four characters.
1585
+

1586
+
As of PHP 8.4.0, the PCRE extension is bundled with PCRE2 version 10.44,
1587
+
which allows patterns such as <literal>\d{,8}</literal> and they are
1588
+
interpreted as <literal>\d{0,8}</literal>.
1589
+

1590
+
Further, as of PHP 8.4.0, space characters around quantifiers such as
1591
+
<literal>\d{0 , 8}</literal> and <literal>\d{ 0 , 8 }</literal> are allowed.
1592
+
</simpara>
1369
1593
<para>
1370
-
The quantifier {0} is permitted, causing the expression to
1371
-
behave as if the previous item and the quantifier were not
1594
+
The quantifier {0} is permitted, causing the expression to
1595
+
behave as if the previous item and the quantifier were not
1372
1596
present.
1373
1597
</para>
1374
1598
<para>
1375
-
For convenience (and historical compatibility) the three
1599
+
For convenience (and historical compatibility) the three
1376
1600
most common quantifiers have single-character abbreviations:
1377
1601

1378
1602
<table>
...
...
@@ -1396,63 +1620,63 @@
1396
1620
</table>
1397
1621
</para>
1398
1622
<para>
1399
-
It is possible to construct infinite loops by following a
1400
-
subpattern that can match no characters with a quantifier
1623
+
It is possible to construct infinite loops by following a
1624
+
subpattern that can match no characters with a quantifier
1401
1625
that has no upper limit, for example:
1402
1626

1403
1627
<literal>(a?)*</literal>
1404
1628
</para>
1405
1629
<para>
1406
-
Earlier versions of Perl and PCRE used to give an error at
1407
-
compile time for such patterns. However, because there are
1408
-
cases where this can be useful, such patterns are now
1409
-
accepted, but if any repetition of the subpattern does in
1630
+
Earlier versions of Perl and PCRE used to give an error at
1631
+
compile time for such patterns. However, because there are
1632
+
cases where this can be useful, such patterns are now
1633
+
accepted, but if any repetition of the subpattern does in
1410
1634
fact match no characters, the loop is forcibly broken.
1411
1635
</para>
1412
1636
<para>
1413
-
By default, the quantifiers are "greedy", that is, they
1414
-
match as much as possible (up to the maximum number of permitted
1415
-
times), without causing the rest of the pattern to
1637
+
By default, the quantifiers are "greedy", that is, they
1638
+
match as much as possible (up to the maximum number of permitted
1639
+
times), without causing the rest of the pattern to
1416
1640
fail. The classic example of where this gives problems is in
1417
1641
trying to match comments in C programs. These appear between
1418
-
the sequences /* and */ and within the sequence, individual
1419
-
* and / characters may appear. An attempt to match C comments
1642
+
the sequences /* and */ and within the sequence, individual
1643
+
* and / characters may appear. An attempt to match C comments
1420
1644
by applying the pattern
1421
1645

1422
1646
<literal>/\*.*\*/</literal>
1423
1647

1424
1648
to the string
1425
1649

1426
-
<literal>/* first comment */ not comment /* second comment */</literal>
1650
+
<literal>/* first comment */ not comment /* second comment */</literal>
1427
1651

1428
-
fails, because it matches the entire string due to the
1429
-
greediness of the .* item.
1652
+
fails, because it matches the entire string due to the
1653
+
greediness of the .* item.
1430
1654
</para>
1431
1655
<para>
1432
-
However, if a quantifier is followed by a question mark,
1656
+
However, if a quantifier is followed by a question mark,
1433
1657
then it becomes lazy, and instead matches the minimum
1434
1658
number of times possible, so the pattern
1435
1659

1436
1660
<literal>/\*.*?\*/</literal>
1437
1661

1438
1662
does the right thing with the C comments. The meaning of the
1439
-
various quantifiers is not otherwise changed, just the preferred
1440
-
number of matches. Do not confuse this use of
1441
-
question mark with its use as a quantifier in its own right.
1663
+
various quantifiers is not otherwise changed, just the preferred
1664
+
number of matches. Do not confuse this use of
1665
+
question mark with its use as a quantifier in its own right.
1442
1666
Because it has two uses, it can sometimes appear doubled, as
1443
1667
in
1444
1668

1445
1669
<literal>\d??\d</literal>
1446
1670

1447
-
which matches one digit by preference, but can match two if
1671
+
which matches one digit by preference, but can match two if
1448
1672
that is the only way the rest of the pattern matches.
1449
1673
</para>
1450
1674
<para>
1451
1675
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1452
-
option is set (an option which is not
1453
-
available in Perl) then the quantifiers are not greedy by
1676
+
option is set (an option which is not
1677
+
available in Perl) then the quantifiers are not greedy by
1454
1678
default, but individual ones can be made greedy by following
1455
-
them with a question mark. In other words, it inverts the
1679
+
them with a question mark. In other words, it inverts the
1456
1680
default behaviour.
1457
1681
</para>
1458
1682
<para>
...
...
@@ -1464,41 +1688,41 @@
1464
1688
</para>
1465
1689
<para>
1466
1690
When a parenthesized subpattern is quantified with a minimum
1467
-
repeat count that is greater than 1 or with a limited maximum,
1468
-
more store is required for the compiled pattern, in
1691
+
repeat count that is greater than 1 or with a limited maximum,
1692
+
more store is required for the compiled pattern, in
1469
1693
proportion to the size of the minimum or maximum.
1470
1694
</para>
1471
1695
<para>
1472
-
If a pattern starts with .* or .{0,} and the <link
1696
+
If a pattern starts with .* or .{0,} and the <link
1473
1697
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1474
1698
option (equivalent to Perl's /s) is set, thus allowing the .
1475
-
to match newlines, then the pattern is implicitly anchored,
1699
+
to match newlines, then the pattern is implicitly anchored,
1476
1700
because whatever follows will be tried against every character
1477
-
position in the subject string, so there is no point in
1478
-
retrying the overall match at any position after the first.
1701
+
position in the subject string, so there is no point in
1702
+
retrying the overall match at any position after the first.
1479
1703
PCRE treats such a pattern as though it were preceded by \A.
1480
-
In cases where it is known that the subject string contains
1704
+
In cases where it is known that the subject string contains
1481
1705
no newlines, it is worth setting <link
1482
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1706
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1483
1707
pattern begins with .* in order to
1484
1708
obtain this optimization, or
1485
1709
alternatively using ^ to indicate anchoring explicitly.
1486
1710
</para>
1487
1711
<para>
1488
-
When a capturing subpattern is repeated, the value captured
1712
+
When a capturing subpattern is repeated, the value captured
1489
1713
is the substring that matched the final iteration. For example, after
1490
1714

1491
1715
<literal>(tweedle[dume]{3}\s*)+</literal>
1492
1716

1493
-
has matched "tweedledum tweedledee" the value of the captured
1494
-
substring is "tweedledee". However, if there are
1495
-
nested capturing subpatterns, the corresponding captured
1496
-
values may have been set in previous iterations. For example,
1717
+
has matched "tweedledum tweedledee" the value of the captured
1718
+
substring is "tweedledee". However, if there are
1719
+
nested capturing subpatterns, the corresponding captured
1720
+
values may have been set in previous iterations. For example,
1497
1721
after
1498
1722

1499
1723
<literal>/(a|(b))+/</literal>
1500
1724

1501
-
matches "aba" the value of the second captured substring is
1725
+
matches "aba" the value of the second captured substring is
1502
1726
"b".
1503
1727
</para>
1504
1728
</section>
...
...
@@ -1506,78 +1730,78 @@
1506
1730
<section xml:id="regexp.reference.back-references">
1507
1731
<title>Back references</title>
1508
1732
<para>
1509
-
Outside a character class, a backslash followed by a digit
1510
-
greater than 0 (and possibly further digits) is a back
1511
-
reference to a capturing subpattern earlier (i.e. to its
1512
-
left) in the pattern, provided there have been that many
1733
+
Outside a character class, a backslash followed by a digit
1734
+
greater than 0 (and possibly further digits) is a back
1735
+
reference to a capturing subpattern earlier (i.e. to its
1736
+
left) in the pattern, provided there have been that many
1513
1737
previous capturing left parentheses.
1514
1738
</para>
1515
1739
<para>
1516
-
However, if the decimal number following the backslash is
1517
-
less than 10, it is always taken as a back reference, and
1518
-
causes an error only if there are not that many capturing
1519
-
left parentheses in the entire pattern. In other words, the
1520
-
parentheses that are referenced need not be to the left of
1521
-
the reference for numbers less than 10.
1740
+
However, if the decimal number following the backslash is
1741
+
less than 10, it is always taken as a back reference, and
1742
+
causes an error only if there are not that many capturing
1743
+
left parentheses in the entire pattern. In other words, the
1744
+
parentheses that are referenced need not be to the left of
1745
+
the reference for numbers less than 10.
1522
1746
A "forward back reference" can make sense when a repetition
1523
1747
is involved and the subpattern to the right has participated
1524
1748
in an earlier iteration. See the section
1525
-
entitled "Backslash" above for further details of the handling
1749
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1526
1750
of digits following a backslash.
1527
1751
</para>
1528
1752
<para>
1529
-
A back reference matches whatever actually matched the capturing
1753
+
A back reference matches whatever actually matched the capturing
1530
1754
subpattern in the current subject string, rather than
1531
1755
anything matching the subpattern itself. So the pattern
1532
1756

1533
1757
<literal>(sens|respons)e and \1ibility</literal>
1534
1758

1535
-
matches "sense and sensibility" and "response and responsibility",
1536
-
but not "sense and responsibility". If case-sensitive (caseful)
1759
+
matches "sense and sensibility" and "response and responsibility",
1760
+
but not "sense and responsibility". If case-sensitive (caseful)
1537
1761
matching is in force at the time of the back reference, then
1538
1762
the case of letters is relevant. For example,
1539
1763

1540
1764
<literal>((?i)rah)\s+\1</literal>
1541
1765

1542
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1543
-
though the original capturing subpattern is matched
1766
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1767
+
though the original capturing subpattern is matched
1544
1768
case-insensitively (caselessly).
1545
1769
</para>
1546
1770
<para>
1547
-
There may be more than one back reference to the same subpattern.
1548
-
If a subpattern has not actually been used in a
1549
-
particular match, then any back references to it always
1771
+
There may be more than one back reference to the same subpattern.
1772
+
If a subpattern has not actually been used in a
1773
+
particular match, then any back references to it always
1550
1774
fail. For example, the pattern
1551
1775

1552
1776
<literal>(a|(bc))\2</literal>
1553
1777

1554
-
always fails if it starts to match "a" rather than "bc".
1555
-
Because there may be up to 99 back references, all digits
1556
-
following the backslash are taken as part of a potential
1778
+
always fails if it starts to match "a" rather than "bc".
1779
+
Because there may be up to 99 back references, all digits
1780
+
following the backslash are taken as part of a potential
1557
1781
back reference number. If the pattern continues with a digit
1558
1782
character, then some delimiter must be used to terminate the
1559
1783
back reference. If the <link
1560
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1561
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1784
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1785
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1562
1786
</para>
1563
1787
<para>
1564
1788
A back reference that occurs inside the parentheses to which
1565
-
it refers fails when the subpattern is first used, so, for
1566
-
example, (a\1) never matches. However, such references can
1789
+
it refers fails when the subpattern is first used, so, for
1790
+
example, (a\1) never matches. However, such references can
1567
1791
be useful inside repeated subpatterns. For example, the pattern
1568
1792

1569
1793
<literal>(a|b\1)+</literal>
1570
1794

1571
-
matches any number of "a"s and also "aba", "ababba" etc. At
1795
+
matches any number of "a"s and also "aba", "ababba" etc. At
1572
1796
each iteration of the subpattern, the back reference matches
1573
-
the character string corresponding to the previous iteration.
1797
+
the character string corresponding to the previous iteration.
1574
1798
In order for this to work, the pattern must be such
1575
-
that the first iteration does not need to match the back
1576
-
reference. This can be done using alternation, as in the
1799
+
that the first iteration does not need to match the back
1800
+
reference. This can be done using alternation, as in the
1577
1801
example above, or by a quantifier with a minimum of zero.
1578
1802
</para>
1579
1803
<para>
1580
-
As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be
1804
+
The <literal>\g</literal> escape sequence can be
1581
1805
used for absolute and relative referencing of subpatterns.
1582
1806
This escape sequence must be followed by an unsigned number or a negative
1583
1807
number, optionally enclosed in braces. The sequences <literal>\1</literal>,
...
...
@@ -1598,28 +1822,28 @@
1598
1822
</para>
1599
1823
<para>
1600
1824
Back references to the named subpatterns can be achieved by
1601
-
<literal>(?P=name)</literal> or, since PHP 5.2.2, also by
1602
-
<literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>.
1603
-
Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>
1604
-
and <literal>\g{name}</literal>.
1825
+
<literal>(?P=name)</literal>,
1826
+
<literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,
1827
+
<literal>\k{name}</literal>, <literal>\g{name}</literal>,
1828
+
<literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.
1605
1829
</para>
1606
1830
</section>
1607
1831

1608
1832
<section xml:id="regexp.reference.assertions">
1609
1833
<title>Assertions</title>
1610
1834
<para>
1611
-
An assertion is a test on the characters following or
1612
-
preceding the current matching point that does not actually
1613
-
consume any characters. The simple assertions coded as \b,
1614
-
\B, \A, \Z, \z, ^ and $ are described above. More complicated
1615
-
assertions are coded as subpatterns. There are two
1616
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1835
+
An assertion is a test on the characters following or
1836
+
preceding the current matching point that does not actually
1837
+
consume any characters. The simple assertions coded as \b,
1838
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1839
+
assertions are coded as subpatterns. There are two
1840
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1617
1841
subject string, and those that <emphasis>look behind</emphasis> it.
1618
1842
</para>
1619
1843
<para>
1620
1844
An assertion subpattern is matched in the normal way, except
1621
-
that it does not cause the current matching position to be
1622
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1845
+
that it does not cause the current matching position to be
1846
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1623
1847
assertions and (?! for negative assertions. For example,
1624
1848

1625
1849
<literal>\w+(?=;)</literal>
...
...
@@ -1629,27 +1853,27 @@
1629
1853

1630
1854
<literal>foo(?!bar)</literal>
1631
1855

1632
-
matches any occurrence of "foo" that is not followed by
1856
+
matches any occurrence of "foo" that is not followed by
1633
1857
"bar". Note that the apparently similar pattern
1634
1858

1635
1859
<literal>(?!foo)bar</literal>
1636
1860

1637
-
does not find an occurrence of "bar" that is preceded by
1861
+
does not find an occurrence of "bar" that is preceded by
1638
1862
something other than "foo"; it finds any occurrence of "bar"
1639
-
whatsoever, because the assertion (?!foo) is always &true;
1640
-
when the next three characters are "bar". A lookbehind
1863
+
whatsoever, because the assertion (?!foo) is always &true;
1864
+
when the next three characters are "bar". A lookbehind
1641
1865
assertion is needed to achieve this effect.
1642
1866
</para>
1643
1867
<para>
1644
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1868
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1645
1869
and (?&lt;! for negative assertions. For example,
1646
1870

1647
1871
<literal>(?&lt;!foo)bar</literal>
1648
1872

1649
-
does find an occurrence of "bar" that is not preceded by
1873
+
does find an occurrence of "bar" that is not preceded by
1650
1874
"foo". The contents of a lookbehind assertion are restricted
1651
-
such that all the strings it matches must have a fixed
1652
-
length. However, if there are several alternatives, they do
1875
+
such that all the strings it matches must have a fixed
1876
+
length. However, if there are several alternatives, they do
1653
1877
not all have to have the same fixed length. Thus
1654
1878

1655
1879
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1658,51 +1882,51 @@
1658
1882

1659
1883
<literal>(?&lt;!dogs?|cats?)</literal>
1660
1884

1661
-
causes an error at compile time. Branches that match different
1885
+
causes an error at compile time. Branches that match different
1662
1886
length strings are permitted only at the top level of
1663
-
a lookbehind assertion. This is an extension compared with
1664
-
Perl 5.005, which requires all branches to match the same
1887
+
a lookbehind assertion. This is an extension compared with
1888
+
Perl 5.005, which requires all branches to match the same
1665
1889
length of string. An assertion such as
1666
1890

1667
1891
<literal>(?&lt;=ab(c|de))</literal>
1668
1892

1669
-
is not permitted, because its single top-level branch can
1893
+
is not permitted, because its single top-level branch can
1670
1894
match two different lengths, but it is acceptable if rewritten
1671
1895
to use two top-level branches:
1672
1896

1673
1897
<literal>(?&lt;=abc|abde)</literal>
1674
1898

1675
-
The implementation of lookbehind assertions is, for each
1676
-
alternative, to temporarily move the current position back
1677
-
by the fixed width and then try to match. If there are
1678
-
insufficient characters before the current position, the
1679
-
match is deemed to fail. Lookbehinds in conjunction with
1680
-
once-only subpatterns can be particularly useful for matching
1681
-
at the ends of strings; an example is given at the end
1899
+
The implementation of lookbehind assertions is, for each
1900
+
alternative, to temporarily move the current position back
1901
+
by the fixed width and then try to match. If there are
1902
+
insufficient characters before the current position, the
1903
+
match is deemed to fail. Lookbehinds in conjunction with
1904
+
once-only subpatterns can be particularly useful for matching
1905
+
at the ends of strings; an example is given at the end
1682
1906
of the section on once-only subpatterns.
1683
1907
</para>
1684
1908
<para>
1685
-
Several assertions (of any sort) may occur in succession.
1909
+
Several assertions (of any sort) may occur in succession.
1686
1910
For example,
1687
1911

1688
1912
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1689
1913

1690
-
matches "foo" preceded by three digits that are not "999".
1691
-
Notice that each of the assertions is applied independently
1692
-
at the same point in the subject string. First there is a
1693
-
check that the previous three characters are all digits,
1914
+
matches "foo" preceded by three digits that are not "999".
1915
+
Notice that each of the assertions is applied independently
1916
+
at the same point in the subject string. First there is a
1917
+
check that the previous three characters are all digits,
1694
1918
then there is a check that the same three characters are not
1695
-
"999". This pattern does not match "foo" preceded by six
1919
+
"999". This pattern does not match "foo" preceded by six
1696
1920
characters, the first of which are digits and the last three
1697
-
of which are not "999". For example, it doesn't match
1921
+
of which are not "999". For example, it doesn't match
1698
1922
"123abcfoo". A pattern to do that is
1699
1923

1700
1924
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1701
1925
</para>
1702
1926
<para>
1703
-
This time the first assertion looks at the preceding six
1704
-
characters, checking that the first three are digits, and
1705
-
then the second assertion checks that the preceding three
1927
+
This time the first assertion looks at the preceding six
1928
+
characters, checking that the first three are digits, and
1929
+
then the second assertion checks that the preceding three
1706
1930
characters are not "999".
1707
1931
</para>
1708
1932
<para>
...
...
@@ -1710,26 +1934,26 @@
1710
1934

1711
1935
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1712
1936

1713
-
matches an occurrence of "baz" that is preceded by "bar"
1937
+
matches an occurrence of "baz" that is preceded by "bar"
1714
1938
which in turn is not preceded by "foo", while
1715
1939

1716
1940
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1717
1941

1718
-
is another pattern which matches "foo" preceded by three
1942
+
is another pattern which matches "foo" preceded by three
1719
1943
digits and any three characters that are not "999".
1720
1944
</para>
1721
1945
<para>
1722
1946
Assertion subpatterns are not capturing subpatterns, and may
1723
-
not be repeated, because it makes no sense to assert the
1724
-
same thing several times. If any kind of assertion contains
1725
-
capturing subpatterns within it, these are counted for the
1947
+
not be repeated, because it makes no sense to assert the
1948
+
same thing several times. If any kind of assertion contains
1949
+
capturing subpatterns within it, these are counted for the
1726
1950
purposes of numbering the capturing subpatterns in the whole
1727
-
pattern. However, substring capturing is carried out only
1728
-
for positive assertions, because it does not make sense for
1951
+
pattern. However, substring capturing is carried out only
1952
+
for positive assertions, because it does not make sense for
1729
1953
negative assertions.
1730
1954
</para>
1731
1955
<para>
1732
-
Assertions count towards the maximum of 200 parenthesized
1956
+
Assertions count towards the maximum of 200 parenthesized
1733
1957
subpatterns.
1734
1958
</para>
1735
1959
</section>
...
...
@@ -1737,17 +1961,17 @@
1737
1961
<section xml:id="regexp.reference.onlyonce">
1738
1962
<title>Once-only subpatterns</title>
1739
1963
<para>
1740
-
With both maximizing and minimizing repetition, failure of
1741
-
what follows normally causes the repeated item to be
1964
+
With both maximizing and minimizing repetition, failure of
1965
+
what follows normally causes the repeated item to be
1742
1966
re-evaluated to see if a different number of repeats allows the
1743
-
rest of the pattern to match. Sometimes it is useful to
1744
-
prevent this, either to change the nature of the match, or
1745
-
to cause it fail earlier than it otherwise might, when the
1746
-
author of the pattern knows there is no point in carrying
1967
+
rest of the pattern to match. Sometimes it is useful to
1968
+
prevent this, either to change the nature of the match, or
1969
+
to cause it fail earlier than it otherwise might, when the
1970
+
author of the pattern knows there is no point in carrying
1747
1971
on.
1748
1972
</para>
1749
1973
<para>
1750
-
Consider, for example, the pattern \d+foo when applied to
1974
+
Consider, for example, the pattern \d+foo when applied to
1751
1975
the subject line
1752
1976

1753
1977
<literal>123456bar</literal>
...
...
@@ -1755,108 +1979,108 @@
1755
1979
<para>
1756
1980
After matching all 6 digits and then failing to match "foo",
1757
1981
the normal action of the matcher is to try again with only 5
1758
-
digits matching the \d+ item, and then with 4, and so on,
1982
+
digits matching the \d+ item, and then with 4, and so on,
1759
1983
before ultimately failing. Once-only subpatterns provide the
1760
-
means for specifying that once a portion of the pattern has
1761
-
matched, it is not to be re-evaluated in this way, so the
1762
-
matcher would give up immediately on failing to match "foo"
1763
-
the first time. The notation is another kind of special
1984
+
means for specifying that once a portion of the pattern has
1985
+
matched, it is not to be re-evaluated in this way, so the
1986
+
matcher would give up immediately on failing to match "foo"
1987
+
the first time. The notation is another kind of special
1764
1988
parenthesis, starting with (?&gt; as in this example:
1765
1989

1766
1990
<literal>(?&gt;\d+)bar</literal>
1767
1991
</para>
1768
1992
<para>
1769
-
This kind of parenthesis "locks up" the part of the pattern
1770
-
it contains once it has matched, and a failure further into
1771
-
the pattern is prevented from backtracking into it.
1772
-
Backtracking past it to previous items, however, works as normal.
1993
+
This kind of parenthesis "locks up" the part of the pattern
1994
+
it contains once it has matched, and a failure further into
1995
+
the pattern is prevented from backtracking into it.
1996
+
Backtracking past it to previous items, however, works as normal.
1773
1997
</para>
1774
1998
<para>
1775
1999
An alternative description is that a subpattern of this type
1776
-
matches the string of characters that an identical standalone
2000
+
matches the string of characters that an identical standalone
1777
2001
pattern would match, if anchored at the current point
1778
2002
in the subject string.
1779
2003
</para>
1780
2004
<para>
1781
-
Once-only subpatterns are not capturing subpatterns. Simple
1782
-
cases such as the above example can be thought of as a maximizing
1783
-
repeat that must swallow everything it can. So,
2005
+
Once-only subpatterns are not capturing subpatterns. Simple
2006
+
cases such as the above example can be thought of as a maximizing
2007
+
repeat that must swallow everything it can. So,
1784
2008
while both \d+ and \d+? are prepared to adjust the number of
1785
-
digits they match in order to make the rest of the pattern
2009
+
digits they match in order to make the rest of the pattern
1786
2010
match, (?&gt;\d+) can only match an entire sequence of digits.
1787
2011
</para>
1788
2012
<para>
1789
-
This construction can of course contain arbitrarily complicated
2013
+
This construction can of course contain arbitrarily complicated
1790
2014
subpatterns, and it can be nested.
1791
2015
</para>
1792
2016
<para>
1793
2017
Once-only subpatterns can be used in conjunction with
1794
-
lookbehind assertions to specify efficient matching at the end
2018
+
lookbehind assertions to specify efficient matching at the end
1795
2019
of the subject string. Consider a simple pattern such as
1796
2020

1797
2021
<literal>abcd$</literal>
1798
2022

1799
-
when applied to a long string which does not match. Because
1800
-
matching proceeds from left to right, PCRE will look for
2023
+
when applied to a long string which does not match. Because
2024
+
matching proceeds from left to right, PCRE will look for
1801
2025
each "a" in the subject and then see if what follows matches
1802
2026
the rest of the pattern. If the pattern is specified as
1803
2027

1804
2028
<literal>^.*abcd$</literal>
1805
2029

1806
-
then the initial .* matches the entire string at first, but
1807
-
when this fails (because there is no following "a"), it
2030
+
then the initial .* matches the entire string at first, but
2031
+
when this fails (because there is no following "a"), it
1808
2032
backtracks to match all but the last character, then all but
1809
-
the last two characters, and so on. Once again the search
1810
-
for "a" covers the entire string, from right to left, so we
2033
+
the last two characters, and so on. Once again the search
2034
+
for "a" covers the entire string, from right to left, so we
1811
2035
are no better off. However, if the pattern is written as
1812
2036

1813
2037
<literal>^(?>.*)(?&lt;=abcd)</literal>
1814
2038

1815
-
then there can be no backtracking for the .* item; it can
1816
-
match only the entire string. The subsequent lookbehind
2039
+
then there can be no backtracking for the .* item; it can
2040
+
match only the entire string. The subsequent lookbehind
1817
2041
assertion does a single test on the last four characters. If
1818
-
it fails, the match fails immediately. For long strings,
2042
+
it fails, the match fails immediately. For long strings,
1819
2043
this approach makes a significant difference to the processing time.
1820
2044
</para>
1821
2045
<para>
1822
2046
When a pattern contains an unlimited repeat inside a subpattern
1823
2047
that can itself be repeated an unlimited number of
1824
-
times, the use of a once-only subpattern is the only way to
1825
-
avoid some failing matches taking a very long time indeed.
2048
+
times, the use of a once-only subpattern is the only way to
2049
+
avoid some failing matches taking a very long time indeed.
1826
2050
The pattern
1827
2051

1828
2052
<literal>(\D+|&lt;\d+>)*[!?]</literal>
1829
2053

1830
-
matches an unlimited number of substrings that either consist
1831
-
of non-digits, or digits enclosed in &lt;>, followed by
2054
+
matches an unlimited number of substrings that either consist
2055
+
of non-digits, or digits enclosed in &lt;>, followed by
1832
2056
either ! or ?. When it matches, it runs quickly. However, if
1833
2057
it is applied to
1834
2058

1835
2059
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
1836
2060

1837
-
it takes a long time before reporting failure. This is
2061
+
it takes a long time before reporting failure. This is
1838
2062
because the string can be divided between the two repeats in
1839
2063
a large number of ways, and all have to be tried. (The example
1840
-
used [!?] rather than a single character at the end,
1841
-
because both PCRE and Perl have an optimization that allows
1842
-
for fast failure when a single character is used. They
1843
-
remember the last single character that is required for a
1844
-
match, and fail early if it is not present in the string.)
2064
+
used [!?] rather than a single character at the end,
2065
+
because both PCRE and Perl have an optimization that allows
2066
+
for fast failure when a single character is used. They
2067
+
remember the last single character that is required for a
2068
+
match, and fail early if it is not present in the string.)
1845
2069
If the pattern is changed to
1846
2070

1847
2071
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
1848
2072

1849
-
sequences of non-digits cannot be broken, and failure happens quickly.
2073
+
sequences of non-digits cannot be broken, and failure happens quickly.
1850
2074
</para>
1851
2075
</section>
1852
2076

1853
2077
<section xml:id="regexp.reference.conditional">
1854
2078
<title>Conditional subpatterns</title>
1855
2079
<para>
1856
-
It is possible to cause the matching process to obey a subpattern
1857
-
conditionally or to choose between two alternative
1858
-
subpatterns, depending on the result of an assertion, or
1859
-
whether a previous capturing subpattern matched or not. The
2080
+
It is possible to cause the matching process to obey a subpattern
2081
+
conditionally or to choose between two alternative
2082
+
subpatterns, depending on the result of an assertion, or
2083
+
whether a previous capturing subpattern matched or not. The
1860
2084
two possible forms of conditional subpattern are
1861
2085
</para>
1862
2086

...
...
@@ -1870,34 +2094,39 @@
1870
2094
</informalexample>
1871
2095
<para>
1872
2096
If the condition is satisfied, the yes-pattern is used; otherwise
1873
-
the no-pattern (if present) is used. If there are
2097
+
the no-pattern (if present) is used. If there are
1874
2098
more than two alternatives in the subpattern, a compile-time
1875
2099
error occurs.
1876
2100
</para>
1877
2101
<para>
1878
-
There are two kinds of condition. If the text between the
1879
-
parentheses consists of a sequence of digits, then the
1880
-
condition is satisfied if the capturing subpattern of that
1881
-
number has previously matched. Consider the following pattern,
1882
-
which contains non-significant white space to make it
1883
-
more readable (assume the <link
2102
+
There are two kinds of condition. If the text between the
2103
+
parentheses consists of a sequence of digits, then the
2104
+
condition is satisfied if the capturing subpattern of that
2105
+
number has previously matched. Consider the following pattern,
2106
+
which contains non-significant white space to make it
2107
+
more readable (assume the <link
1884
2108
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1885
-
option) and to divide it into three parts for ease of discussion:
1886
-
1887
-
<literal>( \( )? [^()]+ (?(1) \) )</literal>
1888
-
</para>
1889
-
<para>
1890
-
The first part matches an optional opening parenthesis, and
1891
-
if that character is present, sets it as the first captured
1892
-
substring. The second part matches one or more characters
1893
-
that are not parentheses. The third part is a conditional
1894
-
subpattern that tests whether the first set of parentheses
1895
-
matched or not. If they did, that is, if subject started
1896
-
with an opening parenthesis, the condition is &true;, and so
1897
-
the yes-pattern is executed and a closing parenthesis is
1898
-
required. Otherwise, since no-pattern is not present, the
1899
-
subpattern matches nothing. In other words, this pattern
1900
-
matches a sequence of non-parentheses, optionally enclosed
2109
+
option) and to divide it into three parts for ease of discussion:
2110
+
</para>
2111
+
<informalexample>
2112
+
<programlisting>
2113
+
<![CDATA[
2114
+
( \( )? [^()]+ (?(1) \) )
2115
+
]]>
2116
+
</programlisting>
2117
+
</informalexample>
2118
+
<para>
2119
+
The first part matches an optional opening parenthesis, and
2120
+
if that character is present, sets it as the first captured
2121
+
substring. The second part matches one or more characters
2122
+
that are not parentheses. The third part is a conditional
2123
+
subpattern that tests whether the first set of parentheses
2124
+
matched or not. If they did, that is, if subject started
2125
+
with an opening parenthesis, the condition is &true;, and so
2126
+
the yes-pattern is executed and a closing parenthesis is
2127
+
required. Otherwise, since no-pattern is not present, the
2128
+
subpattern matches nothing. In other words, this pattern
2129
+
matches a sequence of non-parentheses, optionally enclosed
1901
2130
in parentheses.
1902
2131
</para>
1903
2132
<para>
...
...
@@ -1906,10 +2135,10 @@
1906
2135
level", the condition is false.
1907
2136
</para>
1908
2137
<para>
1909
-
If the condition is not a sequence of digits or (R), it must be an
1910
-
assertion. This may be a positive or negative lookahead or
1911
-
lookbehind assertion. Consider this pattern, again containing
1912
-
non-significant white space, and with the two alternatives on
2138
+
If the condition is not a sequence of digits or (R), it must be an
2139
+
assertion. This may be a positive or negative lookahead or
2140
+
lookbehind assertion. Consider this pattern, again containing
2141
+
non-significant white space, and with the two alternatives on
1913
2142
the second line:
1914
2143
</para>
1915
2144

...
...
@@ -1917,18 +2146,18 @@
1917
2146
<programlisting>
1918
2147
<![CDATA[
1919
2148
(?(?=[^a-z]*[a-z])
1920
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2149
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1921
2150
]]>
1922
2151
</programlisting>
1923
2152
</informalexample>
1924
2153
<para>
1925
2154
The condition is a positive lookahead assertion that matches
1926
2155
an optional sequence of non-letters followed by a letter. In
1927
-
other words, it tests for the presence of at least one
1928
-
letter in the subject. If a letter is found, the subject is
1929
-
matched against the first alternative; otherwise it is
1930
-
matched against the second. This pattern matches strings in
1931
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2156
+
other words, it tests for the presence of at least one
2157
+
letter in the subject. If a letter is found, the subject is
2158
+
matched against the first alternative; otherwise it is
2159
+
matched against the second. This pattern matches strings in
2160
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1932
2161
letters and dd are digits.
1933
2162
</para>
1934
2163
</section>
...
...
@@ -1936,31 +2165,66 @@
1936
2165
<section xml:id="regexp.reference.comments">
1937
2166
<title>Comments</title>
1938
2167
<para>
1939
-
The sequence (?# marks the start of a comment which
1940
-
continues up to the next closing parenthesis. Nested
2168
+
The sequence (?# marks the start of a comment which
2169
+
continues up to the next closing parenthesis. Nested
1941
2170
parentheses are not permitted. The characters that make up a
1942
2171
comment play no part in the pattern matching at all.
1943
2172
</para>
1944
2173
<para>
1945
2174
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1946
-
option is set, an unescaped # character outside a character class
2175
+
option is set, an unescaped # character outside a character class
1947
2176
introduces a comment that continues up to the next newline character
1948
2177
in the pattern.
1949
2178
</para>
2179
+
<para>
2180
+
<example>
2181
+
<title>Usage of comments in PCRE pattern</title>
2182
+
<programlisting role="php">
2183
+
<![CDATA[
2184
+
<?php
2185
+

2186
+
$subject = 'test';
2187
+

2188
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2189
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2190
+
var_dump($match);
2191
+

2192
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2193
+
$match = preg_match('/te #~~~~
2194
+
st/', $subject);
2195
+
var_dump($match);
2196
+

2197
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2198
+
that follows an unescaped # on the same line is ignored */
2199
+
$match = preg_match('/te #~~~~
2200
+
st/x', $subject);
2201
+
var_dump($match);
2202
+
]]>
2203
+
</programlisting>
2204
+
&example.outputs;
2205
+
<screen>
2206
+
<![CDATA[
2207
+
int(1)
2208
+
int(0)
2209
+
int(1)
2210
+
]]>
2211
+
</screen>
2212
+
</example>
2213
+
</para>
1950
2214
</section>
1951
2215

1952
2216
<section xml:id="regexp.reference.recursive">
1953
2217
<title>Recursive patterns</title>
1954
2218
<para>
1955
-
Consider the problem of matching a string in parentheses,
1956
-
allowing for unlimited nested parentheses. Without the use
1957
-
of recursion, the best that can be done is to use a pattern
1958
-
that matches up to some fixed depth of nesting. It is not
1959
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
1960
-
provided an experimental facility that allows regular
1961
-
expressions to recurse (among other things). The special
1962
-
item (?R) is provided for the specific case of recursion.
1963
-
This PCRE pattern solves the parentheses problem (assume
2219
+
Consider the problem of matching a string in parentheses,
2220
+
allowing for unlimited nested parentheses. Without the use
2221
+
of recursion, the best that can be done is to use a pattern
2222
+
that matches up to some fixed depth of nesting. It is not
2223
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2224
+
provided an experimental facility that allows regular
2225
+
expressions to recurse (among other things). The special
2226
+
item (?R) is provided for the specific case of recursion.
2227
+
This PCRE pattern solves the parentheses problem (assume
1964
2228
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1965
2229
option is set so that white space is
1966
2230
ignored):
...
...
@@ -1969,45 +2233,45 @@
1969
2233
</para>
1970
2234
<para>
1971
2235
First it matches an opening parenthesis. Then it matches any
1972
-
number of substrings which can either be a sequence of
1973
-
non-parentheses, or a recursive match of the pattern itself
2236
+
number of substrings which can either be a sequence of
2237
+
non-parentheses, or a recursive match of the pattern itself
1974
2238
(i.e. a correctly parenthesized substring). Finally there is
1975
2239
a closing parenthesis.
1976
2240
</para>
1977
2241
<para>
1978
-
This particular example pattern contains nested unlimited
2242
+
This particular example pattern contains nested unlimited
1979
2243
repeats, and so the use of a once-only subpattern for matching
1980
-
strings of non-parentheses is important when applying
1981
-
the pattern to strings that do not match. For example, when
2244
+
strings of non-parentheses is important when applying
2245
+
the pattern to strings that do not match. For example, when
1982
2246
it is applied to
1983
2247

1984
2248
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
1985
2249

1986
-
it yields "no match" quickly. However, if a once-only subpattern
1987
-
is not used, the match runs for a very long time
1988
-
indeed because there are so many different ways the + and *
1989
-
repeats can carve up the subject, and all have to be tested
2250
+
it yields "no match" quickly. However, if a once-only subpattern
2251
+
is not used, the match runs for a very long time
2252
+
indeed because there are so many different ways the + and *
2253
+
repeats can carve up the subject, and all have to be tested
1990
2254
before failure can be reported.
1991
2255
</para>
1992
2256
<para>
1993
-
The values set for any capturing subpatterns are those from
2257
+
The values set for any capturing subpatterns are those from
1994
2258
the outermost level of the recursion at which the subpattern
1995
2259
value is set. If the pattern above is matched against
1996
2260

1997
2261
<literal>(ab(cd)ef)</literal>
1998
2262

1999
-
the value for the capturing parentheses is "ef", which is
2000
-
the last value taken on at the top level. If additional
2263
+
the value for the capturing parentheses is "ef", which is
2264
+
the last value taken on at the top level. If additional
2001
2265
parentheses are added, giving
2002
2266

2003
2267
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2004
2268
then the string they capture
2005
2269
is "ab(cd)ef", the contents of the top level parentheses. If
2006
-
there are more than 15 capturing parentheses in a pattern,
2007
-
PCRE has to obtain extra memory to store data during a
2008
-
recursion, which it does by using pcre_malloc, freeing it
2009
-
via pcre_free afterwards. If no memory can be obtained, it
2010
-
saves data for the first 15 capturing parentheses only, as
2270
+
there are more than 15 capturing parentheses in a pattern,
2271
+
PCRE has to obtain extra memory to store data during a
2272
+
recursion, which it does by using pcre_malloc, freeing it
2273
+
via pcre_free afterwards. If no memory can be obtained, it
2274
+
saves data for the first 15 capturing parentheses only, as
2011
2275
there is no way to give an out-of-memory error from within a
2012
2276
recursion.
2013
2277
</para>
...
...
@@ -2016,7 +2280,7 @@
2016
2280
<literal>(?1)</literal>, <literal>(?2)</literal> and so on
2017
2281
can be used for recursive subpatterns too. It is also possible to use named
2018
2282
subpatterns: <literal>(?P&gt;name)</literal> or
2019
-
<literal>(?P&amp;name)</literal>.
2283
+
<literal>(?&amp;name)</literal>.
2020
2284
</para>
2021
2285
<para>
2022
2286
If the syntax for a recursive subpattern reference (either by number or
...
...
@@ -2046,75 +2310,75 @@
2046
2310
<title>Performance</title>
2047
2311
<para>
2048
2312
Certain items that may appear in patterns are more efficient
2049
-
than others. It is more efficient to use a character class
2313
+
than others. It is more efficient to use a character class
2050
2314
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2051
-
In general, the simplest construction that provides the
2052
-
required behaviour is usually the most efficient. Jeffrey
2053
-
Friedl's book contains a lot of discussion about optimizing
2315
+
In general, the simplest construction that provides the
2316
+
required behaviour is usually the most efficient. Jeffrey
2317
+
Friedl's book contains a lot of discussion about optimizing
2054
2318
regular expressions for efficient performance.
2055
2319
</para>
2056
2320
<para>
2057
2321
When a pattern begins with .* and the <link
2058
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2059
-
set, the pattern is implicitly anchored by PCRE, since it
2322
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2323
+
set, the pattern is implicitly anchored by PCRE, since it
2060
2324
can match only at the start of a subject string. However, if
2061
2325
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2062
2326
is not set, PCRE cannot make this optimization,
2063
-
because the . metacharacter does not then match a newline,
2327
+
because the . metacharacter does not then match a newline,
2064
2328
and if the subject string contains newlines, the pattern may
2065
-
match from the character immediately following one of them
2329
+
match from the character immediately following one of them
2066
2330
instead of from the very start. For example, the pattern
2067
2331

2068
2332
<literal>(.*) second</literal>
2069
2333

2070
2334
matches the subject "first\nand second" (where \n stands for
2071
2335
a newline character) with the first captured substring being
2072
-
"and". In order to do this, PCRE has to retry the match
2336
+
"and". In order to do this, PCRE has to retry the match
2073
2337
starting after every newline in the subject.
2074
2338
</para>
2075
2339
<para>
2076
2340
If you are using such a pattern with subject strings that do
2077
-
not contain newlines, the best performance is obtained by
2341
+
not contain newlines, the best performance is obtained by
2078
2342
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2079
-
or starting the pattern with ^.* to
2080
-
indicate explicit anchoring. That saves PCRE from having to
2343
+
or starting the pattern with ^.* to
2344
+
indicate explicit anchoring. That saves PCRE from having to
2081
2345
scan along the subject looking for a newline to restart at.
2082
2346
</para>
2083
2347
<para>
2084
-
Beware of patterns that contain nested indefinite repeats.
2085
-
These can take a long time to run when applied to a string
2348
+
Beware of patterns that contain nested indefinite repeats.
2349
+
These can take a long time to run when applied to a string
2086
2350
that does not match. Consider the pattern fragment
2087
2351

2088
2352
<literal>(a+)*</literal>
2089
2353
</para>
2090
2354
<para>
2091
-
This can match "aaaa" in 33 different ways, and this number
2092
-
increases very rapidly as the string gets longer. (The *
2093
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2094
-
those cases other than 0, the + repeats can match different
2355
+
This can match "aaaa" in 33 different ways, and this number
2356
+
increases very rapidly as the string gets longer. (The *
2357
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2358
+
those cases other than 0, the + repeats can match different
2095
2359
numbers of times.) When the remainder of the pattern is such
2096
-
that the entire match is going to fail, PCRE has in principle
2097
-
to try every possible variation, and this can take an
2360
+
that the entire match is going to fail, PCRE has in principle
2361
+
to try every possible variation, and this can take an
2098
2362
extremely long time.
2099
2363
</para>
2100
2364
<para>
2101
-
An optimization catches some of the more simple cases such
2365
+
An optimization catches some of the more simple cases such
2102
2366
as
2103
2367

2104
2368
<literal>(a+)*b</literal>
2105
2369

2106
-
where a literal character follows. Before embarking on the
2370
+
where a literal character follows. Before embarking on the
2107
2371
standard matching procedure, PCRE checks that there is a "b"
2108
-
later in the subject string, and if there is not, it fails
2109
-
the match immediately. However, when there is no following
2110
-
literal this optimization cannot be used. You can see the
2372
+
later in the subject string, and if there is not, it fails
2373
+
the match immediately. However, when there is no following
2374
+
literal this optimization cannot be used. You can see the
2111
2375
difference by comparing the behaviour of
2112
2376

2113
2377
<literal>(a+)*\d</literal>
2114
2378

2115
-
with the pattern above. The former gives a failure almost
2116
-
instantly when applied to a whole line of "a" characters,
2117
-
whereas the latter takes an appreciable time with strings
2379
+
with the pattern above. The former gives a failure almost
2380
+
instantly when applied to a whole line of "a" characters,
2381
+
whereas the latter takes an appreciable time with strings
2118
2382
longer than about 20 characters.
2119
2383
</para>
2120
2384
</section>
2121
2385