reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77
...
...
@@ -1,28 +1,28 @@
1
1
<?xml version="1.0" encoding="utf-8"?>
2
2
<!-- $Revision$ -->
3
3
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
4
-
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">
4
+
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
5
5
<title>Pattern Syntax</title>
6
6
<titleabbrev>PCRE regex syntax</titleabbrev>
7
7

8
8
<section xml:id="regexp.introduction">
9
9
<title>Introduction</title>
10
10
<para>
11
-
The syntax and semantics of the regular expressions
12
-
supported by PCRE are described below. Regular expressions are
13
-
also described in the Perl documentation and in a number of
14
-
other books, some of which have copious examples. Jeffrey
15
-
Friedl's "Mastering Regular Expressions", published by
16
-
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
11
+
The syntax and semantics of the regular expressions
12
+
supported by PCRE are described below. Regular expressions are
13
+
also described in the Perl documentation and in a number of
14
+
other books, some of which have copious examples. Jeffrey
15
+
Friedl's "Mastering Regular Expressions", published by
16
+
O'Reilly (ISBN 1-56592-257-3), covers them in great detail.
17
17
The description here is intended as reference documentation.
18
18
</para>
19
19
<para>
20
-
A regular expression is a pattern that is matched against a
20
+
A regular expression is a pattern that is matched against a
21
21
subject string from left to right. Most characters stand for
22
22
themselves in a pattern, and match the corresponding
23
23
characters in the subject. As a trivial example, the pattern
24
24
<literal>The quick brown fox</literal>
25
-
matches a portion of a subject string that is identical to
25
+
matches a portion of a subject string that is identical to
26
26
itself.
27
27
</para>
28
28
</section>
...
...
@@ -32,6 +32,7 @@
32
32
When using the PCRE functions, it is required that the pattern is enclosed
33
33
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
34
34
non-backslash, non-whitespace character.
35
+
Leading whitespace before a valid delimiter is silently ignored.
35
36
</para>
36
37
<para>
37
38
Often used delimiters are forward slashes (<literal>/</literal>), hash
...
...
@@ -49,6 +50,26 @@
49
50
</informalexample>
50
51
</para>
51
52
<para>
53
+
It is also possible to use
54
+
bracket style delimiters where the opening and closing brackets are the
55
+
starting and ending delimiter, respectively. <literal>()</literal>,
56
+
<literal>{}</literal>, <literal>[]</literal> and <literal>&lt;&gt;</literal>
57
+
are all valid bracket style delimiter pairs.
58
+
<informalexample>
59
+
<programlisting>
60
+
<![CDATA[
61
+
(this [is] a (pattern))
62
+
{this [is] a (pattern)}
63
+
[this [is] a (pattern)]
64
+
<this [is] a (pattern)>
65
+
]]>
66
+
</programlisting>
67
+
</informalexample>
68
+
Bracket style delimiters do not need to be escaped when they are used as meta
69
+
characters within the pattern, but as with other delimiters they must be
70
+
escaped when they are used as literal characters.
71
+
</para>
72
+
<para>
52
73
If the delimiter needs to be matched inside the pattern it must be
53
74
escaped using a backslash. If the delimiter appears often inside the
54
75
pattern, it is a good idea to choose another delimiter in order to increase
...
...
@@ -66,18 +87,6 @@
66
87
to specify the delimiter to be escaped.
67
88
</para>
68
89
<para>
69
-
In addition to the aforementioned delimiters, it is also possible to use
70
-
bracket style delimiters where the opening and closing brackets are the
71
-
starting and ending delimiter, respectively.
72
-
<informalexample>
73
-
<programlisting>
74
-
<![CDATA[
75
-
{this is a pattern}
76
-
]]>
77
-
</programlisting>
78
-
</informalexample>
79
-
</para>
80
-
<para>
81
90
You may add <link linkend="reference.pcre.pattern.modifiers">pattern
82
91
modifiers</link> after the ending delimiter. The following is an example
83
92
of case-insensitive matching:
...
...
@@ -93,103 +102,100 @@
93
102
<section xml:id="regexp.reference.meta">
94
103
<title>Meta-characters</title>
95
104
<para>
96
-
The power of regular expressions comes from the
105
+
The power of regular expressions comes from the
97
106
ability to include alternatives and repetitions in the
98
-
pattern. These are encoded in the pattern by the use of
99
-
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
107
+
pattern. These are encoded in the pattern by the use of
108
+
<emphasis>meta-characters</emphasis>, which do not stand for themselves but instead
100
109
are interpreted in some special way.
101
110
</para>
102
111
<para>
103
-
There are two different sets of meta-characters: those that
104
-
are recognized anywhere in the pattern except within square
112
+
There are two different sets of meta-characters: those that
113
+
are recognized anywhere in the pattern except within square
105
114
brackets, and those that are recognized in square brackets.
106
115
Outside square brackets, the meta-characters are as follows:
107
-
<variablelist>
108
-
<varlistentry>
109
-
<term><emphasis>\</emphasis></term>
110
-
<listitem><simpara>general escape character with several uses</simpara></listitem>
111
-
</varlistentry>
112
-
<varlistentry>
113
-
<term><emphasis>^</emphasis></term>
114
-
<listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>
115
-
</varlistentry>
116
-
<varlistentry>
117
-
<term><emphasis>$</emphasis></term>
118
-
<listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem>
119
-
</varlistentry>
120
-
<varlistentry>
121
-
<term><emphasis>.</emphasis></term>
122
-
<listitem><simpara>match any character except newline (by default)</simpara></listitem>
123
-
</varlistentry>
124
-
<varlistentry>
125
-
<term><emphasis>[</emphasis></term>
126
-
<listitem><simpara>start character class definition</simpara></listitem>
127
-
</varlistentry>
128
-
<varlistentry>
129
-
<term><emphasis>]</emphasis></term>
130
-
<listitem><simpara>end character class definition</simpara></listitem>
131
-
</varlistentry>
132
-
<varlistentry>
133
-
<term><emphasis>|</emphasis></term>
134
-
<listitem><simpara>start of alternative branch</simpara></listitem>
135
-
</varlistentry>
136
-
<varlistentry>
137
-
<term><emphasis>(</emphasis></term>
138
-
<listitem><simpara>start subpattern</simpara></listitem>
139
-
</varlistentry>
140
-
<varlistentry>
141
-
<term><emphasis>)</emphasis></term>
142
-
<listitem><simpara>end subpattern</simpara></listitem>
143
-
</varlistentry>
144
-
<varlistentry>
145
-
<term><emphasis>?</emphasis></term>
146
-
<listitem>
147
-
<simpara>
148
-
extends the meaning of (, also 0 or 1 quantifier, also makes greedy
149
-
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)
150
-
</simpara>
151
-
</listitem>
152
-
</varlistentry>
153
-
<varlistentry>
154
-
<term><emphasis>*</emphasis></term>
155
-
<listitem><simpara>0 or more quantifier</simpara></listitem>
156
-
</varlistentry>
157
-
<varlistentry>
158
-
<term><emphasis>+</emphasis></term>
159
-
<listitem><simpara>1 or more quantifier</simpara></listitem>
160
-
</varlistentry>
161
-
<varlistentry>
162
-
<term><emphasis>{</emphasis></term>
163
-
<listitem><simpara>start min/max quantifier</simpara></listitem>
164
-
</varlistentry>
165
-
<varlistentry>
166
-
<term><emphasis>}</emphasis></term>
167
-
<listitem><simpara>end min/max quantifier</simpara></listitem>
168
-
</varlistentry>
169
-
</variablelist>
116
+

117
+
<table>
118
+
<title>Meta-characters outside square brackets</title>
119
+
<tgroup cols="2">
120
+
<thead>
121
+
<row>
122
+
<entry>Meta-character</entry><entry>Description</entry>
123
+
</row>
124
+
</thead>
125
+
<tbody>
126
+
<row>
127
+
<entry>\</entry><entry>general escape character with several uses</entry>
128
+
</row>
129
+
<row>
130
+
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
+
</row>
132
+
<row>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or
134
+
end of line, in multiline mode)</entry>
135
+
</row>
136
+
<row>
137
+
<entry>.</entry><entry>match any character except newline (by default)</entry>
138
+
</row>
139
+
<row>
140
+
<entry>[</entry><entry>start character class definition</entry>
141
+
</row>
142
+
<row>
143
+
<entry>]</entry><entry>end character class definition</entry>
144
+
</row>
145
+
<row>
146
+
<entry>|</entry><entry>start of alternative branch</entry>
147
+
</row>
148
+
<row>
149
+
<entry>(</entry><entry>start subpattern</entry>
150
+
</row>
151
+
<row>
152
+
<entry>)</entry><entry>end subpattern</entry>
153
+
</row>
154
+
<row>
155
+
<entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy
156
+
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>
157
+
</row>
158
+
<row>
159
+
<entry>*</entry><entry>0 or more quantifier</entry>
160
+
</row>
161
+
<row>
162
+
<entry>+</entry><entry>1 or more quantifier</entry>
163
+
</row>
164
+
<row>
165
+
<entry>{</entry><entry>start min/max quantifier</entry>
166
+
</row>
167
+
<row>
168
+
<entry>}</entry><entry>end min/max quantifier</entry>
169
+
</row>
170
+
</tbody>
171
+
</tgroup>
172
+
</table>
170
173

171
174
Part of a pattern that is in square brackets is called a
172
-
"character class". In a character class the only
175
+
<link linkend="regexp.reference.character-classes">character class</link>. In a character class the only
173
176
meta-characters are:
174
177

175
-
<variablelist>
176
-
<varlistentry>
177
-
<term><emphasis>\</emphasis></term>
178
-
<listitem><simpara>general escape character</simpara></listitem>
179
-
</varlistentry>
180
-
<varlistentry>
181
-
<term><emphasis>^</emphasis></term>
182
-
<listitem><simpara>negate the class, but only if the first character</simpara></listitem>
183
-
</varlistentry>
184
-
<varlistentry>
185
-
<term><emphasis>-</emphasis></term>
186
-
<listitem><simpara>indicates character range</simpara></listitem>
187
-
</varlistentry>
188
-
<varlistentry>
189
-
<term><emphasis>]</emphasis></term>
190
-
<listitem><simpara>terminates the character class</simpara></listitem>
191
-
</varlistentry>
192
-
</variablelist>
178
+
<table>
179
+
<title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>
180
+
<tgroup cols="2">
181
+
<thead>
182
+
<row>
183
+
<entry>Meta-character</entry><entry>Description</entry>
184
+
</row>
185
+
</thead>
186
+
<tbody>
187
+
<row>
188
+
<entry>\</entry><entry>general escape character</entry>
189
+
</row>
190
+
<row>
191
+
<entry>^</entry><entry>negate the class, but only if the first character</entry>
192
+
</row>
193
+
<row>
194
+
<entry>-</entry><entry>indicates character range</entry>
195
+
</row>
196
+
</tbody>
197
+
</tgroup>
198
+
</table>
193
199

194
200
The following sections describe the use of each of the
195
201
meta-characters.
...
...
@@ -199,9 +205,9 @@
199
205
<section xml:id="regexp.reference.escape">
200
206
<title>Escape sequences</title>
201
207
<para>
202
-
The backslash character has several uses. Firstly, if it is
208
+
The backslash character has several uses. Firstly, if it is
203
209
followed by a non-alphanumeric character, it takes away any
204
-
special meaning that character may have. This use of
210
+
special meaning that character may have. This use of
205
211
backslash as an escape character applies both inside and
206
212
outside character classes.
207
213
</para>
...
...
@@ -210,7 +216,7 @@
210
216
"\*" in the pattern. This applies whether or not the
211
217
following character would otherwise be interpreted as a
212
218
meta-character, so it is always safe to precede a non-alphanumeric
213
-
with "\" to specify that it stands for itself. In
219
+
with "\" to specify that it stands for itself. In
214
220
particular, if you want to match a backslash, you write "\\".
215
221
</para>
216
222
<note>
...
...
@@ -232,10 +238,10 @@
232
238
<para>
233
239
A second use of backslash provides a way of encoding
234
240
non-printing characters in patterns in a visible manner. There
235
-
is no restriction on the appearance of non-printing characters,
241
+
is no restriction on the appearance of non-printing characters,
236
242
apart from the binary zero that terminates a pattern,
237
243
but when a pattern is being prepared by text editing, it is
238
-
usually easier to use one of the following escape sequences
244
+
usually easier to use one of the following escape sequences
239
245
than the binary character it represents:
240
246
</para>
241
247
<para>
...
...
@@ -297,6 +303,12 @@
297
303
</listitem>
298
304
</varlistentry>
299
305
<varlistentry>
306
+
<term><emphasis>\R</emphasis></term>
307
+
<listitem>
308
+
<simpara>line break: matches \n, \r and \r\n</simpara>
309
+
</listitem>
310
+
</varlistentry>
311
+
<varlistentry>
300
312
<term><emphasis>\t</emphasis></term>
301
313
<listitem>
302
314
<simpara>tab (hex 09)</simpara>
...
...
@@ -320,9 +332,9 @@
320
332
</para>
321
333
<para>
322
334
The precise effect of "<literal>\cx</literal>" is as follows:
323
-
if "<literal>x</literal>" is a lower case letter, it is converted
335
+
if "<literal>x</literal>" is a lower case letter, it is converted
324
336
to upper case. Then bit 6 of the character (hex 40) is inverted.
325
-
Thus "<literal>\cz</literal>" becomes hex 1A, but
337
+
Thus "<literal>\cz</literal>" becomes hex 1A, but
326
338
"<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"
327
339
becomes hex 7B.
328
340
</para>
...
...
@@ -338,7 +350,7 @@
338
350
</para>
339
351
<para>
340
352
After "<literal>\0</literal>" up to two further octal digits are read.
341
-
In both cases, if there are fewer than two digits, just those that
353
+
In both cases, if there are fewer than two digits, just those that
342
354
are present are used. Thus the sequence "<literal>\0\x\07</literal>"
343
355
specifies two binary zeros followed by a BEL character. Make sure you
344
356
supply two digits after the initial zero if the character
...
...
@@ -347,20 +359,20 @@
347
359
<para>
348
360
The handling of a backslash followed by a digit other than 0
349
361
is complicated. Outside a character class, PCRE reads it
350
-
and any following digits as a decimal number. If the number
351
-
is less than 10, or if there have been at least that many
352
-
previous capturing left parentheses in the expression, the
353
-
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
354
-
of how this works is given later, following the discussion
362
+
and any following digits as a decimal number. If the number
363
+
is less than 10, or if there have been at least that many
364
+
previous capturing left parentheses in the expression, the
365
+
entire sequence is taken as a <emphasis>back reference</emphasis>. A description
366
+
of how this works is given later, following the discussion
355
367
of parenthesized subpatterns.
356
368
</para>
357
369
<para>
358
-
Inside a character class, or if the decimal number is
370
+
Inside a character class, or if the decimal number is
359
371
greater than 9 and there have not been that many capturing
360
372
subpatterns, PCRE re-reads up to three octal digits following
361
373
the backslash, and generates a single byte from the
362
374
least significant 8 bits of the value. Any subsequent digits
363
-
stand for themselves. For example:
375
+
stand for themselves. For example:
364
376
</para>
365
377
<para>
366
378
<variablelist>
...
...
@@ -428,7 +440,7 @@
428
440
digits are ever read.
429
441
</para>
430
442
<para>
431
-
All the sequences that define a single byte value can be
443
+
All the sequences that define a single byte value can be
432
444
used both inside and outside character classes. In addition,
433
445
inside a character class, the sequence "<literal>\b</literal>"
434
446
is interpreted as the backspace character (hex 08). Outside a character
...
...
@@ -450,11 +462,11 @@
450
462
</varlistentry>
451
463
<varlistentry>
452
464
<term><emphasis>\h</emphasis></term>
453
-
<listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
465
+
<listitem><simpara>any horizontal whitespace character</simpara></listitem>
454
466
</varlistentry>
455
467
<varlistentry>
456
468
<term><emphasis>\H</emphasis></term>
457
-
<listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
469
+
<listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>
458
470
</varlistentry>
459
471
<varlistentry>
460
472
<term><emphasis>\s</emphasis></term>
...
...
@@ -466,11 +478,11 @@
466
478
</varlistentry>
467
479
<varlistentry>
468
480
<term><emphasis>\v</emphasis></term>
469
-
<listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
481
+
<listitem><simpara>any vertical whitespace character</simpara></listitem>
470
482
</varlistentry>
471
483
<varlistentry>
472
484
<term><emphasis>\V</emphasis></term>
473
-
<listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
485
+
<listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>
474
486
</varlistentry>
475
487
<varlistentry>
476
488
<term><emphasis>\w</emphasis></term>
...
...
@@ -488,8 +500,14 @@
488
500
matches one, and only one, of each pair.
489
501
</para>
490
502
<para>
503
+
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),
504
+
and space (32). However, if locale-specific matching is happening,
505
+
characters with code points in the range 128-255 may also be considered
506
+
as whitespace characters, for instance, NBSP (A0).
507
+
</para>
508
+
<para>
491
509
A "word" character is any letter or digit or the underscore
492
-
character, that is, any character which can be part of a
510
+
character, that is, any character which can be part of a
493
511
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
494
512
controlled by PCRE's character tables, and may vary if locale-specific
495
513
matching is taking place. For example, in the "fr" (French) locale, some
...
...
@@ -498,15 +516,15 @@
498
516
</para>
499
517
<para>
500
518
These character type sequences can appear both inside and
501
-
outside character classes. They each match one character of
502
-
the appropriate type. If the current matching point is at
519
+
outside character classes. They each match one character of
520
+
the appropriate type. If the current matching point is at
503
521
the end of the subject string, all of them fail, since there
504
522
is no character to match.
505
523
</para>
506
524
<para>
507
-
The fourth use of backslash is for certain simple
525
+
The fourth use of backslash is for certain simple
508
526
assertions. An assertion specifies a condition that has to be met
509
-
at a particular point in a match, without consuming any
527
+
at a particular point in a match, without consuming any
510
528
characters from the subject string. The use of subpatterns
511
529
for more complicated assertions is described below. The
512
530
backslashed assertions are
...
...
@@ -545,7 +563,7 @@
545
563
</variablelist>
546
564
</para>
547
565
<para>
548
-
These assertions may not appear in character classes (but
566
+
These assertions may not appear in character classes (but
549
567
note that "<literal>\b</literal>" has a different meaning, namely the backspace
550
568
character, inside a character class).
551
569
</para>
...
...
@@ -553,20 +571,20 @@
553
571
A word boundary is a position in the subject string where
554
572
the current character and the previous character do not both
555
573
match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches
556
-
<literal>\w</literal> and the other matches
574
+
<literal>\w</literal> and the other matches
557
575
<literal>\W</literal>), or the start or end of the string if the first
558
576
or last character matches <literal>\w</literal>, respectively.
559
577
</para>
560
578
<para>
561
579
The <literal>\A</literal>, <literal>\Z</literal>, and
562
-
<literal>\z</literal> assertions differ from the traditional
563
-
circumflex and dollar (described below) in that they only
564
-
ever match at the very start and end of the subject string,
565
-
whatever options are set. They are not affected by the
580
+
<literal>\z</literal> assertions differ from the traditional
581
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )
582
+
in that they only ever match at the very start and end of the subject string,
583
+
whatever options are set. They are not affected by the
566
584
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
567
585
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>
568
-
options. The difference between <literal>\Z</literal> and
569
-
<literal>\z</literal> is that <literal>\Z</literal> matches before a
586
+
options. The difference between <literal>\Z</literal> and
587
+
<literal>\z</literal> is that <literal>\Z</literal> matches before a
570
588
newline that is the last character of the string as well as at the end of
571
589
the string, whereas <literal>\z</literal> matches only at the end.
572
590
</para>
...
...
@@ -583,12 +601,16 @@
583
601
regexp metacharacters in the pattern. For example:
584
602
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
585
603
followed by literals <literal>.$.</literal> and anchored at the end of
586
-
the string.
604
+
the string. Note that this does not change the behavior of
605
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
606
+
is not valid, because the second <literal>#</literal> marks the end
607
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
608
+
modifiers.
587
609
</para>
588
610

589
611
<para>
590
-
<literal>\K</literal> can be used to reset the match start since
591
-
PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches
612
+
<literal>\K</literal> can be used to reset the match start.
613
+
For example, the pattern <literal>foo\Kbar</literal> matches
592
614
"foobar", but reports that it has matched "bar". The use of
593
615
<literal>\K</literal> does not interfere with the setting of captured
594
616
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
...
...
@@ -844,7 +866,7 @@
844
866
</tgroup>
845
867
</table>
846
868
<para>
847
-
Extended properties such as "Greek" or "InMusicalSymbols" are not
869
+
Extended properties such as <literal>InMusicalSymbols</literal> are not
848
870
supported by PCRE.
849
871
</para>
850
872
<para>
...
...
@@ -852,15 +874,193 @@
852
874
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
853
875
</para>
854
876
<para>
855
-
The <literal>\X</literal> escape matches any number of Unicode characters
856
-
that form an extended Unicode sequence. <literal>\X</literal> is equivalent
857
-
to <literal>(?>\PM\pM*)</literal>.
877
+
Sets of Unicode characters are defined as belonging to certain scripts. A
878
+
character from one of these sets can be matched using a script name. For
879
+
example:
880
+
</para>
881
+
<itemizedlist>
882
+
<listitem>
883
+
<simpara><literal>\p{Greek}</literal></simpara>
884
+
</listitem>
885
+
<listitem>
886
+
<simpara><literal>\P{Han}</literal></simpara>
887
+
</listitem>
888
+
</itemizedlist>
889
+
<para>
890
+
Those that are not part of an identified script are lumped together as
891
+
<literal>Common</literal>. The current list of scripts is:
892
+
</para>
893
+
<table>
894
+
<title>Supported scripts</title>
895
+
<tgroup cols="5">
896
+
<tbody>
897
+
<row>
898
+
<entry><literal>Arabic</literal></entry>
899
+
<entry><literal>Armenian</literal></entry>
900
+
<entry><literal>Avestan</literal></entry>
901
+
<entry><literal>Balinese</literal></entry>
902
+
<entry><literal>Bamum</literal></entry>
903
+
</row>
904
+
<row>
905
+
<entry><literal>Batak</literal></entry>
906
+
<entry><literal>Bengali</literal></entry>
907
+
<entry><literal>Bopomofo</literal></entry>
908
+
<entry><literal>Brahmi</literal></entry>
909
+
<entry><literal>Braille</literal></entry>
910
+
</row>
911
+
<row>
912
+
<entry><literal>Buginese</literal></entry>
913
+
<entry><literal>Buhid</literal></entry>
914
+
<entry><literal>Canadian_Aboriginal</literal></entry>
915
+
<entry><literal>Carian</literal></entry>
916
+
<entry><literal>Chakma</literal></entry>
917
+
</row>
918
+
<row>
919
+
<entry><literal>Cham</literal></entry>
920
+
<entry><literal>Cherokee</literal></entry>
921
+
<entry><literal>Common</literal></entry>
922
+
<entry><literal>Coptic</literal></entry>
923
+
<entry><literal>Cuneiform</literal></entry>
924
+
</row>
925
+
<row>
926
+
<entry><literal>Cypriot</literal></entry>
927
+
<entry><literal>Cyrillic</literal></entry>
928
+
<entry><literal>Deseret</literal></entry>
929
+
<entry><literal>Devanagari</literal></entry>
930
+
<entry><literal>Egyptian_Hieroglyphs</literal></entry>
931
+
</row>
932
+
<row>
933
+
<entry><literal>Ethiopic</literal></entry>
934
+
<entry><literal>Georgian</literal></entry>
935
+
<entry><literal>Glagolitic</literal></entry>
936
+
<entry><literal>Gothic</literal></entry>
937
+
<entry><literal>Greek</literal></entry>
938
+
</row>
939
+
<row>
940
+
<entry><literal>Gujarati</literal></entry>
941
+
<entry><literal>Gurmukhi</literal></entry>
942
+
<entry><literal>Han</literal></entry>
943
+
<entry><literal>Hangul</literal></entry>
944
+
<entry><literal>Hanunoo</literal></entry>
945
+
</row>
946
+
<row>
947
+
<entry><literal>Hebrew</literal></entry>
948
+
<entry><literal>Hiragana</literal></entry>
949
+
<entry><literal>Imperial_Aramaic</literal></entry>
950
+
<entry><literal>Inherited</literal></entry>
951
+
<entry><literal>Inscriptional_Pahlavi</literal></entry>
952
+
</row>
953
+
<row>
954
+
<entry><literal>Inscriptional_Parthian</literal></entry>
955
+
<entry><literal>Javanese</literal></entry>
956
+
<entry><literal>Kaithi</literal></entry>
957
+
<entry><literal>Kannada</literal></entry>
958
+
<entry><literal>Katakana</literal></entry>
959
+
</row>
960
+
<row>
961
+
<entry><literal>Kayah_Li</literal></entry>
962
+
<entry><literal>Kharoshthi</literal></entry>
963
+
<entry><literal>Khmer</literal></entry>
964
+
<entry><literal>Lao</literal></entry>
965
+
<entry><literal>Latin</literal></entry>
966
+
</row>
967
+
<row>
968
+
<entry><literal>Lepcha</literal></entry>
969
+
<entry><literal>Limbu</literal></entry>
970
+
<entry><literal>Linear_B</literal></entry>
971
+
<entry><literal>Lisu</literal></entry>
972
+
<entry><literal>Lycian</literal></entry>
973
+
</row>
974
+
<row>
975
+
<entry><literal>Lydian</literal></entry>
976
+
<entry><literal>Malayalam</literal></entry>
977
+
<entry><literal>Mandaic</literal></entry>
978
+
<entry><literal>Meetei_Mayek</literal></entry>
979
+
<entry><literal>Meroitic_Cursive</literal></entry>
980
+
</row>
981
+
<row>
982
+
<entry><literal>Meroitic_Hieroglyphs</literal></entry>
983
+
<entry><literal>Miao</literal></entry>
984
+
<entry><literal>Mongolian</literal></entry>
985
+
<entry><literal>Myanmar</literal></entry>
986
+
<entry><literal>New_Tai_Lue</literal></entry>
987
+
</row>
988
+
<row>
989
+
<entry><literal>Nko</literal></entry>
990
+
<entry><literal>Ogham</literal></entry>
991
+
<entry><literal>Old_Italic</literal></entry>
992
+
<entry><literal>Old_Persian</literal></entry>
993
+
<entry><literal>Old_South_Arabian</literal></entry>
994
+
</row>
995
+
<row>
996
+
<entry><literal>Old_Turkic</literal></entry>
997
+
<entry><literal>Ol_Chiki</literal></entry>
998
+
<entry><literal>Oriya</literal></entry>
999
+
<entry><literal>Osmanya</literal></entry>
1000
+
<entry><literal>Phags_Pa</literal></entry>
1001
+
</row>
1002
+
<row>
1003
+
<entry><literal>Phoenician</literal></entry>
1004
+
<entry><literal>Rejang</literal></entry>
1005
+
<entry><literal>Runic</literal></entry>
1006
+
<entry><literal>Samaritan</literal></entry>
1007
+
<entry><literal>Saurashtra</literal></entry>
1008
+
</row>
1009
+
<row>
1010
+
<entry><literal>Sharada</literal></entry>
1011
+
<entry><literal>Shavian</literal></entry>
1012
+
<entry><literal>Sinhala</literal></entry>
1013
+
<entry><literal>Sora_Sompeng</literal></entry>
1014
+
<entry><literal>Sundanese</literal></entry>
1015
+
</row>
1016
+
<row>
1017
+
<entry><literal>Syloti_Nagri</literal></entry>
1018
+
<entry><literal>Syriac</literal></entry>
1019
+
<entry><literal>Tagalog</literal></entry>
1020
+
<entry><literal>Tagbanwa</literal></entry>
1021
+
<entry><literal>Tai_Le</literal></entry>
1022
+
</row>
1023
+
<row>
1024
+
<entry><literal>Tai_Tham</literal></entry>
1025
+
<entry><literal>Tai_Viet</literal></entry>
1026
+
<entry><literal>Takri</literal></entry>
1027
+
<entry><literal>Tamil</literal></entry>
1028
+
<entry><literal>Telugu</literal></entry>
1029
+
</row>
1030
+
<row>
1031
+
<entry><literal>Thaana</literal></entry>
1032
+
<entry><literal>Thai</literal></entry>
1033
+
<entry><literal>Tibetan</literal></entry>
1034
+
<entry><literal>Tifinagh</literal></entry>
1035
+
<entry><literal>Ugaritic</literal></entry>
1036
+
</row>
1037
+
<row>
1038
+
<entry><literal>Vai</literal></entry>
1039
+
<entry><literal>Yi</literal></entry>
1040
+
<entry />
1041
+
<entry />
1042
+
<entry />
1043
+
<entry />
1044
+
</row>
1045
+
</tbody>
1046
+
</tgroup>
1047
+
</table>
1048
+
<para>
1049
+
The <literal>\X</literal> escape matches a Unicode extended grapheme
1050
+
cluster. An extended grapheme cluster is one or more Unicode characters
1051
+
that combine to form a single glyph. In effect, this can be thought of as
1052
+
the Unicode equivalent of <literal>.</literal> as it will match one
1053
+
composed character, regardless of how many individual characters are
1054
+
actually used to render it.
858
1055
</para>
859
1056
<para>
860
-
That is, it matches a character without the "mark" property, followed
861
-
by zero or more characters with the "mark" property, and treats the
862
-
sequence as an atomic group (see below). Characters with the "mark"
863
-
property are typically accents that affect the preceding character.
1057
+
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1058
+
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1059
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1060
+
character without the "mark" property, followed by zero or more characters
1061
+
with the "mark" property, and treats the sequence as an atomic group (see
1062
+
below). Characters with the "mark" property are typically accents that
1063
+
affect the preceding character.
864
1064
</para>
865
1065
<para>
866
1066
Matching characters by Unicode property is not fast, because PCRE has
...
...
@@ -876,8 +1076,8 @@
876
1076
<para>
877
1077
Outside a character class, in the default matching mode, the
878
1078
circumflex character (<literal>^</literal>) is an assertion which
879
-
is true only if the current matching point is at the start of
880
-
the subject string. Inside a character class, circumflex (<literal>^</literal>)
1079
+
is true only if the current matching point is at the start of
1080
+
the subject string. Inside a character class, circumflex (<literal>^</literal>)
881
1081
has an entirely different meaning (see below).
882
1082
</para>
883
1083
<para>
...
...
@@ -892,12 +1092,12 @@
892
1092
</para>
893
1093
<para>
894
1094
A dollar character (<literal>$</literal>) is an assertion which is
895
-
&true; only if the current matching point is at the end of the subject
896
-
string, or immediately before a newline character that is the last
1095
+
&true; only if the current matching point is at the end of the subject
1096
+
string, or immediately before a newline character that is the last
897
1097
character in the string (by default). Dollar (<literal>$</literal>)
898
-
need not be the last character of the pattern if a number of
899
-
alternatives are involved, but it should be the last item in any branch
900
-
in which it appears. Dollar has no special meaning in a
1098
+
need not be the last character of the pattern if a number of
1099
+
alternatives are involved, but it should be the last item in any branch
1100
+
in which it appears. Dollar has no special meaning in a
901
1101
character class.
902
1102
</para>
903
1103
<para>
...
...
@@ -923,9 +1123,9 @@
923
1123
set.
924
1124
</para>
925
1125
<para>
926
-
Note that the sequences \A, \Z, and \z can be used to match
927
-
the start and end of the subject in both modes, and if all
928
-
branches of a pattern start with \A is it always anchored,
1126
+
Note that the sequences \A, \Z, and \z can be used to match
1127
+
the start and end of the subject in both modes, and if all
1128
+
branches of a pattern start with \A is it always anchored,
929
1129
whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
930
1130
is set or not.
931
1131
</para>
...
...
@@ -934,14 +1134,14 @@
934
1134
<section xml:id="regexp.reference.dot">
935
1135
<title>Dot</title>
936
1136
<para>
937
-
Outside a character class, a dot in the pattern matches any
938
-
one character in the subject, including a non-printing
939
-
character, but not (by default) newline. If the
1137
+
Outside a character class, a dot in the pattern matches any
1138
+
one character in the subject, including a non-printing
1139
+
character, but not (by default) newline. If the
940
1140
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
941
-
option is set, then dots match newlines as well. The
1141
+
option is set, then dots match newlines as well. The
942
1142
handling of dot is entirely independent of the handling of
943
-
circumflex and dollar, the only relationship being that they
944
-
both involve newline characters. Dot has no special meaning
1143
+
circumflex and dollar, the only relationship being that they
1144
+
both involve newline characters. Dot has no special meaning
945
1145
in a character class.
946
1146
</para>
947
1147
<para>
...
...
@@ -955,29 +1155,29 @@
955
1155
<title>Character classes</title>
956
1156
<para>
957
1157
An opening square bracket introduces a character class,
958
-
terminated by a closing square bracket. A closing square
959
-
bracket on its own is not special. If a closing square
960
-
bracket is required as a member of the class, it should be
1158
+
terminated by a closing square bracket. A closing square
1159
+
bracket on its own is not special. If a closing square
1160
+
bracket is required as a member of the class, it should be
961
1161
the first data character in the class (after an initial
962
1162
circumflex, if present) or escaped with a backslash.
963
1163
</para>
964
1164
<para>
965
1165
A character class matches a single character in the subject;
966
-
the character must be in the set of characters defined by
1166
+
the character must be in the set of characters defined by
967
1167
the class, unless the first character in the class is a
968
-
circumflex, in which case the subject character must not be in
969
-
the set defined by the class. If a circumflex is actually
970
-
required as a member of the class, ensure it is not the
1168
+
circumflex, in which case the subject character must not be in
1169
+
the set defined by the class. If a circumflex is actually
1170
+
required as a member of the class, ensure it is not the
971
1171
first character, or escape it with a backslash.
972
1172
</para>
973
1173
<para>
974
-
For example, the character class [aeiou] matches any lower
1174
+
For example, the character class [aeiou] matches any lower
975
1175
case vowel, while [^aeiou] matches any character that is not
976
-
a lower case vowel. Note that a circumflex is just a
977
-
convenient notation for specifying the characters which are in
978
-
the class by enumerating those that are not. It is not an
979
-
assertion: it still consumes a character from the subject
980
-
string, and fails if the current pointer is at the end of
1176
+
a lower case vowel. Note that a circumflex is just a
1177
+
convenient notation for specifying the characters which are in
1178
+
the class by enumerating those that are not. It is not an
1179
+
assertion: it still consumes a character from the subject
1180
+
string, and fails if the current pointer is at the end of
981
1181
the string.
982
1182
</para>
983
1183
<para>
...
...
@@ -989,61 +1189,62 @@
989
1189
</para>
990
1190
<para>
991
1191
The newline character is never treated in any special way in
992
-
character classes, whatever the setting of the <link
1192
+
character classes, whatever the setting of the <link
993
1193
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
994
1194
or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
995
1195
options is. A class such as [^a] will always match a newline.
996
1196
</para>
997
1197
<para>
998
-
The minus (hyphen) character can be used to specify a range
999
-
of characters in a character class. For example, [d-m]
1000
-
matches any letter between d and m, inclusive. If a minus
1001
-
character is required in a class, it must be escaped with a
1198
+
The minus (hyphen) character can be used to specify a range
1199
+
of characters in a character class. For example, [d-m]
1200
+
matches any letter between d and m, inclusive. If a minus
1201
+
character is required in a class, it must be escaped with a
1002
1202
backslash or appear in a position where it cannot be
1003
1203
interpreted as indicating a range, typically as the first or last
1004
1204
character in the class.
1005
1205
</para>
1006
1206
<para>
1007
-
It is not possible to have the literal character "]" as the
1008
-
end character of a range. A pattern such as [W-]46] is
1207
+
It is not possible to have the literal character "]" as the
1208
+
end character of a range. A pattern such as [W-]46] is
1009
1209
interpreted as a class of two characters ("W" and "-")
1010
1210
followed by a literal string "46]", so it would match "W46]" or
1011
-
"-46]". However, if the "]" is escaped with a backslash it
1012
-
is interpreted as the end of range, so [W-\]46] is
1013
-
interpreted as a single class containing a range followed by two
1211
+
"-46]". However, if the "]" is escaped with a backslash it
1212
+
is interpreted as the end of range, so [W-\]46] is
1213
+
interpreted as a single class containing a range followed by two
1014
1214
separate characters. The octal or hexadecimal representation
1015
1215
of "]" can also be used to end a range.
1016
1216
</para>
1017
1217
<para>
1018
1218
Ranges operate in ASCII collating sequence. They can also be
1019
-
used for characters specified numerically, for example
1020
-
[\000-\037]. If a range that includes letters is used when
1021
-
case-insensitive (caseless) matching is set, it matches the
1022
-
letters in either case. For example, [W-c] is equivalent to
1219
+
used for characters specified numerically, for example
1220
+
[\000-\037]. If a range that includes letters is used when
1221
+
case-insensitive (caseless) matching is set, it matches the
1222
+
letters in either case. For example, [W-c] is equivalent to
1023
1223
[][\^_`wxyzabc], matched case-insensitively, and if character
1024
1224
tables for the "fr" locale are in use, [\xc8-\xcb] matches
1025
1225
accented E characters in both cases.
1026
1226
</para>
1027
1227
<para>
1028
-
The character types \d, \D, \s, \S, \w, and \W may also
1029
-
appear in a character class, and add the characters that
1228
+
The character types \d, \D, \s, \S, \w, and \W may also
1229
+
appear in a character class, and add the characters that
1030
1230
they match to the class. For example, [\dABCDEF] matches any
1031
-
hexadecimal digit. A circumflex can conveniently be used
1032
-
with the upper case character types to specify a more
1231
+
hexadecimal digit. A circumflex can conveniently be used
1232
+
with the upper case character types to specify a more
1033
1233
restricted set of characters than the matching lower case type.
1034
-
For example, the class [^\W_] matches any letter or digit,
1234
+
For example, the class [^\W_] matches any letter or digit,
1035
1235
but not underscore.
1036
1236
</para>
1037
1237
<para>
1038
-
All non-alphanumeric characters other than \, -, ^ (at the
1039
-
start) and the terminating ] are non-special in character
1238
+
All non-alphanumeric characters other than \, -, ^ (at the
1239
+
start) and the terminating ] are non-special in character
1040
1240
classes, but it does no harm if they are escaped. The pattern
1041
1241
terminator is always special and must be escaped when used
1042
1242
within an expression.
1043
1243
</para>
1044
1244
<para>
1045
1245
Perl supports the POSIX notation for character classes. This uses names
1046
-
enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also
1246
+
enclosed by <literal>[:</literal> and <literal>:]</literal> within
1247
+
the enclosing square brackets. PCRE also
1047
1248
supports this notation. For example, <literal>[01[:alpha:]%]</literal>
1048
1249
matches "0", "1", any alphabetic character, or "%". The supported class
1049
1250
names are:
...
...
@@ -1082,22 +1283,32 @@
1082
1283
<para>
1083
1284
In UTF-8 mode, characters with values greater than 128 do not match any
1084
1285
of the POSIX character classes.
1286
+
As of libpcre 8.10 some character classes are changed to use
1287
+
Unicode character properties, in which case the mentioned restriction does
1288
+
not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>
1289
+
for details.
1290
+
</para>
1291
+
<para>
1292
+
Unicode character properties can appear inside a character class. They can
1293
+
not be part of a range. The minus (hyphen) character after a Unicode
1294
+
character class will match literally. Trying to end a range with a Unicode
1295
+
character property will result in a warning.
1085
1296
</para>
1086
1297
</section>
1087
1298

1088
1299
<section xml:id="regexp.reference.alternation">
1089
1300
<title>Alternation</title>
1090
1301
<para>
1091
-
Vertical bar characters are used to separate alternative
1302
+
Vertical bar characters are used to separate alternative
1092
1303
patterns. For example, the pattern
1093
1304
<literal>gilbert|sullivan</literal>
1094
1305
matches either "gilbert" or "sullivan". Any number of alternatives
1095
-
may appear, and an empty alternative is permitted
1096
-
(matching the empty string). The matching process tries
1097
-
each alternative in turn, from left to right, and the first
1098
-
one that succeeds is used. If the alternatives are within a
1099
-
subpattern (defined below), "succeeds" means matching the
1100
-
rest of the main pattern as well as the alternative in the
1306
+
may appear, and an empty alternative is permitted
1307
+
(matching the empty string). The matching process tries
1308
+
each alternative in turn, from left to right, and the first
1309
+
one that succeeds is used. If the alternatives are within a
1310
+
subpattern (defined below), "succeeds" means matching the
1311
+
rest of the main pattern as well as the alternative in the
1101
1312
subpattern.
1102
1313
</para>
1103
1314
</section>
...
...
@@ -1112,7 +1323,7 @@
1112
1323
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,
1113
1324
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1114
1325
and PCRE_DUPNAMES can be changed from within the pattern by
1115
-
a sequence of Perl option letters enclosed between "(?" and
1326
+
a sequence of Perl option letters enclosed between "(?" and
1116
1327
")". The option letters are:
1117
1328

1118
1329
<table>
...
...
@@ -1141,7 +1352,8 @@
1141
1352
</row>
1142
1353
<row>
1143
1354
<entry><literal>X</literal></entry>
1144
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>
1355
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>
1356
+
(no longer supported as of PHP 7.3.0)</entry>
1145
1357
</row>
1146
1358
<row>
1147
1359
<entry><literal>J</literal></entry>
...
...
@@ -1152,16 +1364,16 @@
1152
1364
</table>
1153
1365
</para>
1154
1366
<para>
1155
-
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1367
+
For example, (?im) sets case-insensitive (caseless), multiline matching. It is
1156
1368
also possible to unset these options by preceding the letter
1157
-
with a hyphen, and a combined setting and unsetting such as
1158
-
(?im-sx), which sets <link
1369
+
with a hyphen, and a combined setting and unsetting such as
1370
+
(?im-sx), which sets <link
1159
1371
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and
1160
1372
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>
1161
1373
while unsetting <link
1162
1374
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and
1163
1375
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,
1164
-
is also permitted. If a letter appears both before and after the
1376
+
is also permitted. If a letter appears both before and after the
1165
1377
hyphen, the option is unset.
1166
1378
</para>
1167
1379
<para>
...
...
@@ -1171,14 +1383,14 @@
1171
1383
and "abC".
1172
1384
</para>
1173
1385
<para>
1174
-
If an option change occurs inside a subpattern, the effect
1175
-
is different. This is a change of behaviour in Perl 5.005.
1176
-
An option change inside a subpattern affects only that part
1386
+
If an option change occurs inside a subpattern, the effect
1387
+
is different. This is a change of behaviour in Perl 5.005.
1388
+
An option change inside a subpattern affects only that part
1177
1389
of the subpattern that follows it, so
1178
1390

1179
1391
<literal>(a(?i)b)c</literal>
1180
1392

1181
-
matches abc and aBc and no other strings (assuming <link
1393
+
matches "abc" and "aBc" and no other strings (assuming <link
1182
1394
linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not
1183
1395
used). By this means, options can be made to have different settings in
1184
1396
different parts of the pattern. Any changes made in one alternative do
...
...
@@ -1187,18 +1399,18 @@
1187
1399

1188
1400
<literal>(a(?i)b|c)</literal>
1189
1401

1190
-
matches "ab", "aB", "c", and "C", even though when matching
1402
+
matches "ab", "aB", "c", and "C", even though when matching
1191
1403
"C" the first branch is abandoned before the option setting.
1192
-
This is because the effects of option settings happen at
1193
-
compile time. There would be some very weird behaviour otherwise.
1404
+
This is because the effects of option settings happen at
1405
+
compile time. There would be some very weird behaviour otherwise.
1194
1406
</para>
1195
1407
<para>
1196
1408
The PCRE-specific options <link
1197
-
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1198
-
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1409
+
linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and
1410
+
<link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can
1199
1411
be changed in the same way as the Perl-compatible options by
1200
-
using the characters U and X respectively. The (?X) flag
1201
-
setting is special in that it must always occur earlier in
1412
+
using the characters U and X respectively. The (?X) flag
1413
+
setting is special in that it must always occur earlier in
1202
1414
the pattern than any of the additional features it turns on,
1203
1415
even when it is at top level. It is best put at the start.
1204
1416
</para>
...
...
@@ -1207,8 +1419,8 @@
1207
1419
<section xml:id="regexp.reference.subpatterns">
1208
1420
<title>Subpatterns</title>
1209
1421
<para>
1210
-
Subpatterns are delimited by parentheses (round brackets),
1211
-
which can be nested. Marking part of a pattern as a subpattern
1422
+
Subpatterns are delimited by parentheses (round brackets),
1423
+
which can be nested. Marking part of a pattern as a subpattern
1212
1424
does two things:
1213
1425
</para>
1214
1426
<orderedlist>
...
...
@@ -1237,30 +1449,30 @@
1237
1449

1238
1450
<literal>the ((red|white) (king|queen))</literal>
1239
1451

1240
-
the captured substrings are "red king", "red", and "king",
1452
+
the captured substrings are "red king", "red", and "king",
1241
1453
and are numbered 1, 2, and 3.
1242
1454
</para>
1243
1455
<para>
1244
-
The fact that plain parentheses fulfill two functions is not
1245
-
always helpful. There are often times when a grouping subpattern
1246
-
is required without a capturing requirement. If an
1456
+
The fact that plain parentheses fulfill two functions is not
1457
+
always helpful. There are often times when a grouping subpattern
1458
+
is required without a capturing requirement. If an
1247
1459
opening parenthesis is followed by "?:", the subpattern does
1248
-
not do any capturing, and is not counted when computing the
1460
+
not do any capturing, and is not counted when computing the
1249
1461
number of any subsequent capturing subpatterns. For example,
1250
-
if the string "the white queen" is matched against the
1462
+
if the string "the white queen" is matched against the
1251
1463
pattern
1252
1464

1253
1465
<literal>the ((?:red|white) (king|queen))</literal>
1254
1466

1255
-
the captured substrings are "white queen" and "queen", and
1256
-
are numbered 1 and 2. The maximum number of captured substrings
1257
-
is 99, and the maximum number of all subpatterns,
1258
-
both capturing and non-capturing, is 200.
1467
+
the captured substrings are "white queen" and "queen", and
1468
+
are numbered 1 and 2. The maximum number of captured substrings
1469
+
is 65535. It may not be possible to compile such large patterns,
1470
+
however, depending on the configuration options of libpcre.
1259
1471
</para>
1260
1472
<para>
1261
-
As a convenient shorthand, if any option settings are
1262
-
required at the start of a non-capturing subpattern, the
1263
-
option letters may appear between the "?" and the ":". Thus
1473
+
As a convenient shorthand, if any option settings are
1474
+
required at the start of a non-capturing subpattern, the
1475
+
option letters may appear between the "?" and the ":". Thus
1264
1476
the two patterns
1265
1477
</para>
1266
1478

...
...
@@ -1274,10 +1486,10 @@
1274
1486
</informalexample>
1275
1487

1276
1488
<para>
1277
-
match exactly the same set of strings. Because alternative
1278
-
branches are tried from left to right, and options are not
1279
-
reset until the end of the subpattern is reached, an option
1280
-
setting in one branch does affect subsequent branches, so
1489
+
match exactly the same set of strings. Because alternative
1490
+
branches are tried from left to right, and options are not
1491
+
reset until the end of the subpattern is reached, an option
1492
+
setting in one branch does affect subsequent branches, so
1281
1493
the above patterns match "SUNDAY" as well as "Saturday".
1282
1494
</para>
1283
1495

...
...
@@ -1285,7 +1497,7 @@
1285
1497
It is possible to name a subpattern using the syntax
1286
1498
<literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then
1287
1499
be indexed in the matches array by its normal numeric position and
1288
-
also by name. PHP 5.2.2 introduced two alternative syntaxes
1500
+
also by name. There are two alternative syntaxes
1289
1501
<literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.
1290
1502
</para>
1291
1503

...
...
@@ -1306,9 +1518,10 @@
1306
1518

1307
1519
<para>
1308
1520
Here <literal>Sun</literal> is stored in backreference 2, while
1309
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1310
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1311
-
to use the <literal>(?|</literal> fixes this problem:
1521
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1522
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1523
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1524
+
this problem:
1312
1525
</para>
1313
1526

1314
1527
<informalexample>
...
...
@@ -1334,45 +1547,45 @@
1334
1547
<listitem><simpara>the . metacharacter</simpara></listitem>
1335
1548
<listitem><simpara>a character class</simpara></listitem>
1336
1549
<listitem><simpara>a back reference (see next section)</simpara></listitem>
1337
-
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1550
+
<listitem><simpara>a parenthesized subpattern (unless it is an assertion -
1338
1551
see below)</simpara></listitem>
1339
1552
</itemizedlist>
1340
1553
</para>
1341
1554
<para>
1342
-
The general repetition quantifier specifies a minimum and
1343
-
maximum number of permitted matches, by giving the two
1344
-
numbers in curly brackets (braces), separated by a comma.
1345
-
The numbers must be less than 65536, and the first must be
1555
+
The general repetition quantifier specifies a minimum and
1556
+
maximum number of permitted matches, by giving the two
1557
+
numbers in curly brackets (braces), separated by a comma.
1558
+
The numbers must be less than 65536, and the first must be
1346
1559
less than or equal to the second. For example:
1347
1560

1348
1561
<literal>z{2,4}</literal>
1349
1562

1350
-
matches "zz", "zzz", or "zzzz". A closing brace on its own
1563
+
matches "zz", "zzz", or "zzzz". A closing brace on its own
1351
1564
is not a special character. If the second number is omitted,
1352
-
but the comma is present, there is no upper limit; if the
1565
+
but the comma is present, there is no upper limit; if the
1353
1566
second number and the comma are both omitted, the quantifier
1354
1567
specifies an exact number of required matches. Thus
1355
1568

1356
1569
<literal>[aeiou]{3,}</literal>
1357
1570

1358
-
matches at least 3 successive vowels, but may match many
1571
+
matches at least 3 successive vowels, but may match many
1359
1572
more, while
1360
1573

1361
1574
<literal>\d{8}</literal>
1362
1575

1363
-
matches exactly 8 digits. An opening curly bracket that
1364
-
appears in a position where a quantifier is not allowed, or
1576
+
matches exactly 8 digits. An opening curly bracket that
1577
+
appears in a position where a quantifier is not allowed, or
1365
1578
one that does not match the syntax of a quantifier, is taken
1366
-
as a literal character. For example, {,6} is not a quantifier,
1579
+
as a literal character. For example, {,6} is not a quantifier,
1367
1580
but a literal string of four characters.
1368
1581
</para>
1369
1582
<para>
1370
-
The quantifier {0} is permitted, causing the expression to
1371
-
behave as if the previous item and the quantifier were not
1583
+
The quantifier {0} is permitted, causing the expression to
1584
+
behave as if the previous item and the quantifier were not
1372
1585
present.
1373
1586
</para>
1374
1587
<para>
1375
-
For convenience (and historical compatibility) the three
1588
+
For convenience (and historical compatibility) the three
1376
1589
most common quantifiers have single-character abbreviations:
1377
1590

1378
1591
<table>
...
...
@@ -1396,63 +1609,63 @@
1396
1609
</table>
1397
1610
</para>
1398
1611
<para>
1399
-
It is possible to construct infinite loops by following a
1400
-
subpattern that can match no characters with a quantifier
1612
+
It is possible to construct infinite loops by following a
1613
+
subpattern that can match no characters with a quantifier
1401
1614
that has no upper limit, for example:
1402
1615

1403
1616
<literal>(a?)*</literal>
1404
1617
</para>
1405
1618
<para>
1406
-
Earlier versions of Perl and PCRE used to give an error at
1407
-
compile time for such patterns. However, because there are
1408
-
cases where this can be useful, such patterns are now
1409
-
accepted, but if any repetition of the subpattern does in
1619
+
Earlier versions of Perl and PCRE used to give an error at
1620
+
compile time for such patterns. However, because there are
1621
+
cases where this can be useful, such patterns are now
1622
+
accepted, but if any repetition of the subpattern does in
1410
1623
fact match no characters, the loop is forcibly broken.
1411
1624
</para>
1412
1625
<para>
1413
-
By default, the quantifiers are "greedy", that is, they
1414
-
match as much as possible (up to the maximum number of permitted
1415
-
times), without causing the rest of the pattern to
1626
+
By default, the quantifiers are "greedy", that is, they
1627
+
match as much as possible (up to the maximum number of permitted
1628
+
times), without causing the rest of the pattern to
1416
1629
fail. The classic example of where this gives problems is in
1417
1630
trying to match comments in C programs. These appear between
1418
-
the sequences /* and */ and within the sequence, individual
1419
-
* and / characters may appear. An attempt to match C comments
1631
+
the sequences /* and */ and within the sequence, individual
1632
+
* and / characters may appear. An attempt to match C comments
1420
1633
by applying the pattern
1421
1634

1422
1635
<literal>/\*.*\*/</literal>
1423
1636

1424
1637
to the string
1425
1638

1426
-
<literal>/* first comment */ not comment /* second comment */</literal>
1639
+
<literal>/* first comment */ not comment /* second comment */</literal>
1427
1640

1428
-
fails, because it matches the entire string due to the
1429
-
greediness of the .* item.
1641
+
fails, because it matches the entire string due to the
1642
+
greediness of the .* item.
1430
1643
</para>
1431
1644
<para>
1432
-
However, if a quantifier is followed by a question mark,
1645
+
However, if a quantifier is followed by a question mark,
1433
1646
then it becomes lazy, and instead matches the minimum
1434
1647
number of times possible, so the pattern
1435
1648

1436
1649
<literal>/\*.*?\*/</literal>
1437
1650

1438
1651
does the right thing with the C comments. The meaning of the
1439
-
various quantifiers is not otherwise changed, just the preferred
1440
-
number of matches. Do not confuse this use of
1441
-
question mark with its use as a quantifier in its own right.
1652
+
various quantifiers is not otherwise changed, just the preferred
1653
+
number of matches. Do not confuse this use of
1654
+
question mark with its use as a quantifier in its own right.
1442
1655
Because it has two uses, it can sometimes appear doubled, as
1443
1656
in
1444
1657

1445
1658
<literal>\d??\d</literal>
1446
1659

1447
-
which matches one digit by preference, but can match two if
1660
+
which matches one digit by preference, but can match two if
1448
1661
that is the only way the rest of the pattern matches.
1449
1662
</para>
1450
1663
<para>
1451
1664
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>
1452
-
option is set (an option which is not
1453
-
available in Perl) then the quantifiers are not greedy by
1665
+
option is set (an option which is not
1666
+
available in Perl) then the quantifiers are not greedy by
1454
1667
default, but individual ones can be made greedy by following
1455
-
them with a question mark. In other words, it inverts the
1668
+
them with a question mark. In other words, it inverts the
1456
1669
default behaviour.
1457
1670
</para>
1458
1671
<para>
...
...
@@ -1464,41 +1677,41 @@
1464
1677
</para>
1465
1678
<para>
1466
1679
When a parenthesized subpattern is quantified with a minimum
1467
-
repeat count that is greater than 1 or with a limited maximum,
1468
-
more store is required for the compiled pattern, in
1680
+
repeat count that is greater than 1 or with a limited maximum,
1681
+
more store is required for the compiled pattern, in
1469
1682
proportion to the size of the minimum or maximum.
1470
1683
</para>
1471
1684
<para>
1472
-
If a pattern starts with .* or .{0,} and the <link
1685
+
If a pattern starts with .* or .{0,} and the <link
1473
1686
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
1474
1687
option (equivalent to Perl's /s) is set, thus allowing the .
1475
-
to match newlines, then the pattern is implicitly anchored,
1688
+
to match newlines, then the pattern is implicitly anchored,
1476
1689
because whatever follows will be tried against every character
1477
-
position in the subject string, so there is no point in
1478
-
retrying the overall match at any position after the first.
1690
+
position in the subject string, so there is no point in
1691
+
retrying the overall match at any position after the first.
1479
1692
PCRE treats such a pattern as though it were preceded by \A.
1480
-
In cases where it is known that the subject string contains
1693
+
In cases where it is known that the subject string contains
1481
1694
no newlines, it is worth setting <link
1482
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1695
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the
1483
1696
pattern begins with .* in order to
1484
1697
obtain this optimization, or
1485
1698
alternatively using ^ to indicate anchoring explicitly.
1486
1699
</para>
1487
1700
<para>
1488
-
When a capturing subpattern is repeated, the value captured
1701
+
When a capturing subpattern is repeated, the value captured
1489
1702
is the substring that matched the final iteration. For example, after
1490
1703

1491
1704
<literal>(tweedle[dume]{3}\s*)+</literal>
1492
1705

1493
-
has matched "tweedledum tweedledee" the value of the captured
1494
-
substring is "tweedledee". However, if there are
1495
-
nested capturing subpatterns, the corresponding captured
1496
-
values may have been set in previous iterations. For example,
1706
+
has matched "tweedledum tweedledee" the value of the captured
1707
+
substring is "tweedledee". However, if there are
1708
+
nested capturing subpatterns, the corresponding captured
1709
+
values may have been set in previous iterations. For example,
1497
1710
after
1498
1711

1499
1712
<literal>/(a|(b))+/</literal>
1500
1713

1501
-
matches "aba" the value of the second captured substring is
1714
+
matches "aba" the value of the second captured substring is
1502
1715
"b".
1503
1716
</para>
1504
1717
</section>
...
...
@@ -1506,78 +1719,78 @@
1506
1719
<section xml:id="regexp.reference.back-references">
1507
1720
<title>Back references</title>
1508
1721
<para>
1509
-
Outside a character class, a backslash followed by a digit
1510
-
greater than 0 (and possibly further digits) is a back
1511
-
reference to a capturing subpattern earlier (i.e. to its
1512
-
left) in the pattern, provided there have been that many
1722
+
Outside a character class, a backslash followed by a digit
1723
+
greater than 0 (and possibly further digits) is a back
1724
+
reference to a capturing subpattern earlier (i.e. to its
1725
+
left) in the pattern, provided there have been that many
1513
1726
previous capturing left parentheses.
1514
1727
</para>
1515
1728
<para>
1516
-
However, if the decimal number following the backslash is
1517
-
less than 10, it is always taken as a back reference, and
1518
-
causes an error only if there are not that many capturing
1519
-
left parentheses in the entire pattern. In other words, the
1520
-
parentheses that are referenced need not be to the left of
1521
-
the reference for numbers less than 10.
1729
+
However, if the decimal number following the backslash is
1730
+
less than 10, it is always taken as a back reference, and
1731
+
causes an error only if there are not that many capturing
1732
+
left parentheses in the entire pattern. In other words, the
1733
+
parentheses that are referenced need not be to the left of
1734
+
the reference for numbers less than 10.
1522
1735
A "forward back reference" can make sense when a repetition
1523
1736
is involved and the subpattern to the right has participated
1524
1737
in an earlier iteration. See the section
1525
-
entitled "Backslash" above for further details of the handling
1738
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1526
1739
of digits following a backslash.
1527
1740
</para>
1528
1741
<para>
1529
-
A back reference matches whatever actually matched the capturing
1742
+
A back reference matches whatever actually matched the capturing
1530
1743
subpattern in the current subject string, rather than
1531
1744
anything matching the subpattern itself. So the pattern
1532
1745

1533
1746
<literal>(sens|respons)e and \1ibility</literal>
1534
1747

1535
-
matches "sense and sensibility" and "response and responsibility",
1536
-
but not "sense and responsibility". If case-sensitive (caseful)
1748
+
matches "sense and sensibility" and "response and responsibility",
1749
+
but not "sense and responsibility". If case-sensitive (caseful)
1537
1750
matching is in force at the time of the back reference, then
1538
1751
the case of letters is relevant. For example,
1539
1752

1540
1753
<literal>((?i)rah)\s+\1</literal>
1541
1754

1542
-
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1543
-
though the original capturing subpattern is matched
1755
+
matches "rah rah" and "RAH RAH", but not "RAH rah", even
1756
+
though the original capturing subpattern is matched
1544
1757
case-insensitively (caselessly).
1545
1758
</para>
1546
1759
<para>
1547
-
There may be more than one back reference to the same subpattern.
1548
-
If a subpattern has not actually been used in a
1549
-
particular match, then any back references to it always
1760
+
There may be more than one back reference to the same subpattern.
1761
+
If a subpattern has not actually been used in a
1762
+
particular match, then any back references to it always
1550
1763
fail. For example, the pattern
1551
1764

1552
1765
<literal>(a|(bc))\2</literal>
1553
1766

1554
-
always fails if it starts to match "a" rather than "bc".
1555
-
Because there may be up to 99 back references, all digits
1556
-
following the backslash are taken as part of a potential
1767
+
always fails if it starts to match "a" rather than "bc".
1768
+
Because there may be up to 99 back references, all digits
1769
+
following the backslash are taken as part of a potential
1557
1770
back reference number. If the pattern continues with a digit
1558
1771
character, then some delimiter must be used to terminate the
1559
1772
back reference. If the <link
1560
-
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1561
-
is set, this can be whitespace. Otherwise an empty comment can be used.
1773
+
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option
1774
+
is set, this can be whitespace. Otherwise an empty comment can be used.
1562
1775
</para>
1563
1776
<para>
1564
1777
A back reference that occurs inside the parentheses to which
1565
-
it refers fails when the subpattern is first used, so, for
1566
-
example, (a\1) never matches. However, such references can
1778
+
it refers fails when the subpattern is first used, so, for
1779
+
example, (a\1) never matches. However, such references can
1567
1780
be useful inside repeated subpatterns. For example, the pattern
1568
1781

1569
1782
<literal>(a|b\1)+</literal>
1570
1783

1571
-
matches any number of "a"s and also "aba", "ababba" etc. At
1784
+
matches any number of "a"s and also "aba", "ababba" etc. At
1572
1785
each iteration of the subpattern, the back reference matches
1573
-
the character string corresponding to the previous iteration.
1786
+
the character string corresponding to the previous iteration.
1574
1787
In order for this to work, the pattern must be such
1575
-
that the first iteration does not need to match the back
1576
-
reference. This can be done using alternation, as in the
1788
+
that the first iteration does not need to match the back
1789
+
reference. This can be done using alternation, as in the
1577
1790
example above, or by a quantifier with a minimum of zero.
1578
1791
</para>
1579
1792
<para>
1580
-
As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be
1793
+
The <literal>\g</literal> escape sequence can be
1581
1794
used for absolute and relative referencing of subpatterns.
1582
1795
This escape sequence must be followed by an unsigned number or a negative
1583
1796
number, optionally enclosed in braces. The sequences <literal>\1</literal>,
...
...
@@ -1598,28 +1811,28 @@
1598
1811
</para>
1599
1812
<para>
1600
1813
Back references to the named subpatterns can be achieved by
1601
-
<literal>(?P=name)</literal> or, since PHP 5.2.2, also by
1602
-
<literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>.
1603
-
Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>
1604
-
and <literal>\g{name}</literal>.
1814
+
<literal>(?P=name)</literal>,
1815
+
<literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,
1816
+
<literal>\k{name}</literal>, <literal>\g{name}</literal>,
1817
+
<literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.
1605
1818
</para>
1606
1819
</section>
1607
1820

1608
1821
<section xml:id="regexp.reference.assertions">
1609
1822
<title>Assertions</title>
1610
1823
<para>
1611
-
An assertion is a test on the characters following or
1612
-
preceding the current matching point that does not actually
1613
-
consume any characters. The simple assertions coded as \b,
1614
-
\B, \A, \Z, \z, ^ and $ are described above. More complicated
1615
-
assertions are coded as subpatterns. There are two
1616
-
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1824
+
An assertion is a test on the characters following or
1825
+
preceding the current matching point that does not actually
1826
+
consume any characters. The simple assertions coded as \b,
1827
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1828
+
assertions are coded as subpatterns. There are two
1829
+
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1617
1830
subject string, and those that <emphasis>look behind</emphasis> it.
1618
1831
</para>
1619
1832
<para>
1620
1833
An assertion subpattern is matched in the normal way, except
1621
-
that it does not cause the current matching position to be
1622
-
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1834
+
that it does not cause the current matching position to be
1835
+
changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive
1623
1836
assertions and (?! for negative assertions. For example,
1624
1837

1625
1838
<literal>\w+(?=;)</literal>
...
...
@@ -1629,27 +1842,27 @@
1629
1842

1630
1843
<literal>foo(?!bar)</literal>
1631
1844

1632
-
matches any occurrence of "foo" that is not followed by
1845
+
matches any occurrence of "foo" that is not followed by
1633
1846
"bar". Note that the apparently similar pattern
1634
1847

1635
1848
<literal>(?!foo)bar</literal>
1636
1849

1637
-
does not find an occurrence of "bar" that is preceded by
1850
+
does not find an occurrence of "bar" that is preceded by
1638
1851
something other than "foo"; it finds any occurrence of "bar"
1639
-
whatsoever, because the assertion (?!foo) is always &true;
1640
-
when the next three characters are "bar". A lookbehind
1852
+
whatsoever, because the assertion (?!foo) is always &true;
1853
+
when the next three characters are "bar". A lookbehind
1641
1854
assertion is needed to achieve this effect.
1642
1855
</para>
1643
1856
<para>
1644
-
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1857
+
<emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions
1645
1858
and (?&lt;! for negative assertions. For example,
1646
1859

1647
1860
<literal>(?&lt;!foo)bar</literal>
1648
1861

1649
-
does find an occurrence of "bar" that is not preceded by
1862
+
does find an occurrence of "bar" that is not preceded by
1650
1863
"foo". The contents of a lookbehind assertion are restricted
1651
-
such that all the strings it matches must have a fixed
1652
-
length. However, if there are several alternatives, they do
1864
+
such that all the strings it matches must have a fixed
1865
+
length. However, if there are several alternatives, they do
1653
1866
not all have to have the same fixed length. Thus
1654
1867

1655
1868
<literal>(?&lt;=bullock|donkey)</literal>
...
...
@@ -1658,51 +1871,51 @@
1658
1871

1659
1872
<literal>(?&lt;!dogs?|cats?)</literal>
1660
1873

1661
-
causes an error at compile time. Branches that match different
1874
+
causes an error at compile time. Branches that match different
1662
1875
length strings are permitted only at the top level of
1663
-
a lookbehind assertion. This is an extension compared with
1664
-
Perl 5.005, which requires all branches to match the same
1876
+
a lookbehind assertion. This is an extension compared with
1877
+
Perl 5.005, which requires all branches to match the same
1665
1878
length of string. An assertion such as
1666
1879

1667
1880
<literal>(?&lt;=ab(c|de))</literal>
1668
1881

1669
-
is not permitted, because its single top-level branch can
1882
+
is not permitted, because its single top-level branch can
1670
1883
match two different lengths, but it is acceptable if rewritten
1671
1884
to use two top-level branches:
1672
1885

1673
1886
<literal>(?&lt;=abc|abde)</literal>
1674
1887

1675
-
The implementation of lookbehind assertions is, for each
1676
-
alternative, to temporarily move the current position back
1677
-
by the fixed width and then try to match. If there are
1678
-
insufficient characters before the current position, the
1679
-
match is deemed to fail. Lookbehinds in conjunction with
1680
-
once-only subpatterns can be particularly useful for matching
1681
-
at the ends of strings; an example is given at the end
1888
+
The implementation of lookbehind assertions is, for each
1889
+
alternative, to temporarily move the current position back
1890
+
by the fixed width and then try to match. If there are
1891
+
insufficient characters before the current position, the
1892
+
match is deemed to fail. Lookbehinds in conjunction with
1893
+
once-only subpatterns can be particularly useful for matching
1894
+
at the ends of strings; an example is given at the end
1682
1895
of the section on once-only subpatterns.
1683
1896
</para>
1684
1897
<para>
1685
-
Several assertions (of any sort) may occur in succession.
1898
+
Several assertions (of any sort) may occur in succession.
1686
1899
For example,
1687
1900

1688
1901
<literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>
1689
1902

1690
-
matches "foo" preceded by three digits that are not "999".
1691
-
Notice that each of the assertions is applied independently
1692
-
at the same point in the subject string. First there is a
1693
-
check that the previous three characters are all digits,
1903
+
matches "foo" preceded by three digits that are not "999".
1904
+
Notice that each of the assertions is applied independently
1905
+
at the same point in the subject string. First there is a
1906
+
check that the previous three characters are all digits,
1694
1907
then there is a check that the same three characters are not
1695
-
"999". This pattern does not match "foo" preceded by six
1908
+
"999". This pattern does not match "foo" preceded by six
1696
1909
characters, the first of which are digits and the last three
1697
-
of which are not "999". For example, it doesn't match
1910
+
of which are not "999". For example, it doesn't match
1698
1911
"123abcfoo". A pattern to do that is
1699
1912

1700
1913
<literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>
1701
1914
</para>
1702
1915
<para>
1703
-
This time the first assertion looks at the preceding six
1704
-
characters, checking that the first three are digits, and
1705
-
then the second assertion checks that the preceding three
1916
+
This time the first assertion looks at the preceding six
1917
+
characters, checking that the first three are digits, and
1918
+
then the second assertion checks that the preceding three
1706
1919
characters are not "999".
1707
1920
</para>
1708
1921
<para>
...
...
@@ -1710,26 +1923,26 @@
1710
1923

1711
1924
<literal>(?&lt;=(?&lt;!foo)bar)baz</literal>
1712
1925

1713
-
matches an occurrence of "baz" that is preceded by "bar"
1926
+
matches an occurrence of "baz" that is preceded by "bar"
1714
1927
which in turn is not preceded by "foo", while
1715
1928

1716
1929
<literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>
1717
1930

1718
-
is another pattern which matches "foo" preceded by three
1931
+
is another pattern which matches "foo" preceded by three
1719
1932
digits and any three characters that are not "999".
1720
1933
</para>
1721
1934
<para>
1722
1935
Assertion subpatterns are not capturing subpatterns, and may
1723
-
not be repeated, because it makes no sense to assert the
1724
-
same thing several times. If any kind of assertion contains
1725
-
capturing subpatterns within it, these are counted for the
1936
+
not be repeated, because it makes no sense to assert the
1937
+
same thing several times. If any kind of assertion contains
1938
+
capturing subpatterns within it, these are counted for the
1726
1939
purposes of numbering the capturing subpatterns in the whole
1727
-
pattern. However, substring capturing is carried out only
1728
-
for positive assertions, because it does not make sense for
1940
+
pattern. However, substring capturing is carried out only
1941
+
for positive assertions, because it does not make sense for
1729
1942
negative assertions.
1730
1943
</para>
1731
1944
<para>
1732
-
Assertions count towards the maximum of 200 parenthesized
1945
+
Assertions count towards the maximum of 200 parenthesized
1733
1946
subpatterns.
1734
1947
</para>
1735
1948
</section>
...
...
@@ -1737,17 +1950,17 @@
1737
1950
<section xml:id="regexp.reference.onlyonce">
1738
1951
<title>Once-only subpatterns</title>
1739
1952
<para>
1740
-
With both maximizing and minimizing repetition, failure of
1741
-
what follows normally causes the repeated item to be
1953
+
With both maximizing and minimizing repetition, failure of
1954
+
what follows normally causes the repeated item to be
1742
1955
re-evaluated to see if a different number of repeats allows the
1743
-
rest of the pattern to match. Sometimes it is useful to
1744
-
prevent this, either to change the nature of the match, or
1745
-
to cause it fail earlier than it otherwise might, when the
1746
-
author of the pattern knows there is no point in carrying
1956
+
rest of the pattern to match. Sometimes it is useful to
1957
+
prevent this, either to change the nature of the match, or
1958
+
to cause it fail earlier than it otherwise might, when the
1959
+
author of the pattern knows there is no point in carrying
1747
1960
on.
1748
1961
</para>
1749
1962
<para>
1750
-
Consider, for example, the pattern \d+foo when applied to
1963
+
Consider, for example, the pattern \d+foo when applied to
1751
1964
the subject line
1752
1965

1753
1966
<literal>123456bar</literal>
...
...
@@ -1755,108 +1968,108 @@
1755
1968
<para>
1756
1969
After matching all 6 digits and then failing to match "foo",
1757
1970
the normal action of the matcher is to try again with only 5
1758
-
digits matching the \d+ item, and then with 4, and so on,
1971
+
digits matching the \d+ item, and then with 4, and so on,
1759
1972
before ultimately failing. Once-only subpatterns provide the
1760
-
means for specifying that once a portion of the pattern has
1761
-
matched, it is not to be re-evaluated in this way, so the
1762
-
matcher would give up immediately on failing to match "foo"
1763
-
the first time. The notation is another kind of special
1973
+
means for specifying that once a portion of the pattern has
1974
+
matched, it is not to be re-evaluated in this way, so the
1975
+
matcher would give up immediately on failing to match "foo"
1976
+
the first time. The notation is another kind of special
1764
1977
parenthesis, starting with (?&gt; as in this example:
1765
1978

1766
1979
<literal>(?&gt;\d+)bar</literal>
1767
1980
</para>
1768
1981
<para>
1769
-
This kind of parenthesis "locks up" the part of the pattern
1770
-
it contains once it has matched, and a failure further into
1771
-
the pattern is prevented from backtracking into it.
1772
-
Backtracking past it to previous items, however, works as normal.
1982
+
This kind of parenthesis "locks up" the part of the pattern
1983
+
it contains once it has matched, and a failure further into
1984
+
the pattern is prevented from backtracking into it.
1985
+
Backtracking past it to previous items, however, works as normal.
1773
1986
</para>
1774
1987
<para>
1775
1988
An alternative description is that a subpattern of this type
1776
-
matches the string of characters that an identical standalone
1989
+
matches the string of characters that an identical standalone
1777
1990
pattern would match, if anchored at the current point
1778
1991
in the subject string.
1779
1992
</para>
1780
1993
<para>
1781
-
Once-only subpatterns are not capturing subpatterns. Simple
1782
-
cases such as the above example can be thought of as a maximizing
1783
-
repeat that must swallow everything it can. So,
1994
+
Once-only subpatterns are not capturing subpatterns. Simple
1995
+
cases such as the above example can be thought of as a maximizing
1996
+
repeat that must swallow everything it can. So,
1784
1997
while both \d+ and \d+? are prepared to adjust the number of
1785
-
digits they match in order to make the rest of the pattern
1998
+
digits they match in order to make the rest of the pattern
1786
1999
match, (?&gt;\d+) can only match an entire sequence of digits.
1787
2000
</para>
1788
2001
<para>
1789
-
This construction can of course contain arbitrarily complicated
2002
+
This construction can of course contain arbitrarily complicated
1790
2003
subpatterns, and it can be nested.
1791
2004
</para>
1792
2005
<para>
1793
2006
Once-only subpatterns can be used in conjunction with
1794
-
lookbehind assertions to specify efficient matching at the end
2007
+
lookbehind assertions to specify efficient matching at the end
1795
2008
of the subject string. Consider a simple pattern such as
1796
2009

1797
2010
<literal>abcd$</literal>
1798
2011

1799
-
when applied to a long string which does not match. Because
1800
-
matching proceeds from left to right, PCRE will look for
2012
+
when applied to a long string which does not match. Because
2013
+
matching proceeds from left to right, PCRE will look for
1801
2014
each "a" in the subject and then see if what follows matches
1802
2015
the rest of the pattern. If the pattern is specified as
1803
2016

1804
2017
<literal>^.*abcd$</literal>
1805
2018

1806
-
then the initial .* matches the entire string at first, but
1807
-
when this fails (because there is no following "a"), it
2019
+
then the initial .* matches the entire string at first, but
2020
+
when this fails (because there is no following "a"), it
1808
2021
backtracks to match all but the last character, then all but
1809
-
the last two characters, and so on. Once again the search
1810
-
for "a" covers the entire string, from right to left, so we
2022
+
the last two characters, and so on. Once again the search
2023
+
for "a" covers the entire string, from right to left, so we
1811
2024
are no better off. However, if the pattern is written as
1812
2025

1813
2026
<literal>^(?>.*)(?&lt;=abcd)</literal>
1814
2027

1815
-
then there can be no backtracking for the .* item; it can
1816
-
match only the entire string. The subsequent lookbehind
2028
+
then there can be no backtracking for the .* item; it can
2029
+
match only the entire string. The subsequent lookbehind
1817
2030
assertion does a single test on the last four characters. If
1818
-
it fails, the match fails immediately. For long strings,
2031
+
it fails, the match fails immediately. For long strings,
1819
2032
this approach makes a significant difference to the processing time.
1820
2033
</para>
1821
2034
<para>
1822
2035
When a pattern contains an unlimited repeat inside a subpattern
1823
2036
that can itself be repeated an unlimited number of
1824
-
times, the use of a once-only subpattern is the only way to
1825
-
avoid some failing matches taking a very long time indeed.
2037
+
times, the use of a once-only subpattern is the only way to
2038
+
avoid some failing matches taking a very long time indeed.
1826
2039
The pattern
1827
2040

1828
2041
<literal>(\D+|&lt;\d+>)*[!?]</literal>
1829
2042

1830
-
matches an unlimited number of substrings that either consist
1831
-
of non-digits, or digits enclosed in &lt;>, followed by
2043
+
matches an unlimited number of substrings that either consist
2044
+
of non-digits, or digits enclosed in &lt;>, followed by
1832
2045
either ! or ?. When it matches, it runs quickly. However, if
1833
2046
it is applied to
1834
2047

1835
2048
<literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>
1836
2049

1837
-
it takes a long time before reporting failure. This is
2050
+
it takes a long time before reporting failure. This is
1838
2051
because the string can be divided between the two repeats in
1839
2052
a large number of ways, and all have to be tried. (The example
1840
-
used [!?] rather than a single character at the end,
1841
-
because both PCRE and Perl have an optimization that allows
1842
-
for fast failure when a single character is used. They
1843
-
remember the last single character that is required for a
1844
-
match, and fail early if it is not present in the string.)
2053
+
used [!?] rather than a single character at the end,
2054
+
because both PCRE and Perl have an optimization that allows
2055
+
for fast failure when a single character is used. They
2056
+
remember the last single character that is required for a
2057
+
match, and fail early if it is not present in the string.)
1845
2058
If the pattern is changed to
1846
2059

1847
2060
<literal>((?>\D+)|&lt;\d+>)*[!?]</literal>
1848
2061

1849
-
sequences of non-digits cannot be broken, and failure happens quickly.
2062
+
sequences of non-digits cannot be broken, and failure happens quickly.
1850
2063
</para>
1851
2064
</section>
1852
2065

1853
2066
<section xml:id="regexp.reference.conditional">
1854
2067
<title>Conditional subpatterns</title>
1855
2068
<para>
1856
-
It is possible to cause the matching process to obey a subpattern
1857
-
conditionally or to choose between two alternative
1858
-
subpatterns, depending on the result of an assertion, or
1859
-
whether a previous capturing subpattern matched or not. The
2069
+
It is possible to cause the matching process to obey a subpattern
2070
+
conditionally or to choose between two alternative
2071
+
subpatterns, depending on the result of an assertion, or
2072
+
whether a previous capturing subpattern matched or not. The
1860
2073
two possible forms of conditional subpattern are
1861
2074
</para>
1862
2075

...
...
@@ -1870,34 +2083,39 @@
1870
2083
</informalexample>
1871
2084
<para>
1872
2085
If the condition is satisfied, the yes-pattern is used; otherwise
1873
-
the no-pattern (if present) is used. If there are
2086
+
the no-pattern (if present) is used. If there are
1874
2087
more than two alternatives in the subpattern, a compile-time
1875
2088
error occurs.
1876
2089
</para>
1877
2090
<para>
1878
-
There are two kinds of condition. If the text between the
1879
-
parentheses consists of a sequence of digits, then the
1880
-
condition is satisfied if the capturing subpattern of that
1881
-
number has previously matched. Consider the following pattern,
1882
-
which contains non-significant white space to make it
1883
-
more readable (assume the <link
2091
+
There are two kinds of condition. If the text between the
2092
+
parentheses consists of a sequence of digits, then the
2093
+
condition is satisfied if the capturing subpattern of that
2094
+
number has previously matched. Consider the following pattern,
2095
+
which contains non-significant white space to make it
2096
+
more readable (assume the <link
1884
2097
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1885
-
option) and to divide it into three parts for ease of discussion:
1886
-
1887
-
<literal>( \( )? [^()]+ (?(1) \) )</literal>
1888
-
</para>
1889
-
<para>
1890
-
The first part matches an optional opening parenthesis, and
1891
-
if that character is present, sets it as the first captured
1892
-
substring. The second part matches one or more characters
1893
-
that are not parentheses. The third part is a conditional
1894
-
subpattern that tests whether the first set of parentheses
1895
-
matched or not. If they did, that is, if subject started
1896
-
with an opening parenthesis, the condition is &true;, and so
1897
-
the yes-pattern is executed and a closing parenthesis is
1898
-
required. Otherwise, since no-pattern is not present, the
1899
-
subpattern matches nothing. In other words, this pattern
1900
-
matches a sequence of non-parentheses, optionally enclosed
2098
+
option) and to divide it into three parts for ease of discussion:
2099
+
</para>
2100
+
<informalexample>
2101
+
<programlisting>
2102
+
<![CDATA[
2103
+
( \( )? [^()]+ (?(1) \) )
2104
+
]]>
2105
+
</programlisting>
2106
+
</informalexample>
2107
+
<para>
2108
+
The first part matches an optional opening parenthesis, and
2109
+
if that character is present, sets it as the first captured
2110
+
substring. The second part matches one or more characters
2111
+
that are not parentheses. The third part is a conditional
2112
+
subpattern that tests whether the first set of parentheses
2113
+
matched or not. If they did, that is, if subject started
2114
+
with an opening parenthesis, the condition is &true;, and so
2115
+
the yes-pattern is executed and a closing parenthesis is
2116
+
required. Otherwise, since no-pattern is not present, the
2117
+
subpattern matches nothing. In other words, this pattern
2118
+
matches a sequence of non-parentheses, optionally enclosed
1901
2119
in parentheses.
1902
2120
</para>
1903
2121
<para>
...
...
@@ -1906,10 +2124,10 @@
1906
2124
level", the condition is false.
1907
2125
</para>
1908
2126
<para>
1909
-
If the condition is not a sequence of digits or (R), it must be an
1910
-
assertion. This may be a positive or negative lookahead or
1911
-
lookbehind assertion. Consider this pattern, again containing
1912
-
non-significant white space, and with the two alternatives on
2127
+
If the condition is not a sequence of digits or (R), it must be an
2128
+
assertion. This may be a positive or negative lookahead or
2129
+
lookbehind assertion. Consider this pattern, again containing
2130
+
non-significant white space, and with the two alternatives on
1913
2131
the second line:
1914
2132
</para>
1915
2133

...
...
@@ -1917,18 +2135,18 @@
1917
2135
<programlisting>
1918
2136
<![CDATA[
1919
2137
(?(?=[^a-z]*[a-z])
1920
-
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2138
+
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1921
2139
]]>
1922
2140
</programlisting>
1923
2141
</informalexample>
1924
2142
<para>
1925
2143
The condition is a positive lookahead assertion that matches
1926
2144
an optional sequence of non-letters followed by a letter. In
1927
-
other words, it tests for the presence of at least one
1928
-
letter in the subject. If a letter is found, the subject is
1929
-
matched against the first alternative; otherwise it is
1930
-
matched against the second. This pattern matches strings in
1931
-
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2145
+
other words, it tests for the presence of at least one
2146
+
letter in the subject. If a letter is found, the subject is
2147
+
matched against the first alternative; otherwise it is
2148
+
matched against the second. This pattern matches strings in
2149
+
one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1932
2150
letters and dd are digits.
1933
2151
</para>
1934
2152
</section>
...
...
@@ -1936,31 +2154,66 @@
1936
2154
<section xml:id="regexp.reference.comments">
1937
2155
<title>Comments</title>
1938
2156
<para>
1939
-
The sequence (?# marks the start of a comment which
1940
-
continues up to the next closing parenthesis. Nested
2157
+
The sequence (?# marks the start of a comment which
2158
+
continues up to the next closing parenthesis. Nested
1941
2159
parentheses are not permitted. The characters that make up a
1942
2160
comment play no part in the pattern matching at all.
1943
2161
</para>
1944
2162
<para>
1945
2163
If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1946
-
option is set, an unescaped # character outside a character class
2164
+
option is set, an unescaped # character outside a character class
1947
2165
introduces a comment that continues up to the next newline character
1948
2166
in the pattern.
1949
2167
</para>
2168
+
<para>
2169
+
<example>
2170
+
<title>Usage of comments in PCRE pattern</title>
2171
+
<programlisting role="php">
2172
+
<![CDATA[
2173
+
<?php
2174
+

2175
+
$subject = 'test';
2176
+

2177
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2178
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2179
+
var_dump($match);
2180
+

2181
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2182
+
$match = preg_match('/te #~~~~
2183
+
st/', $subject);
2184
+
var_dump($match);
2185
+

2186
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2187
+
that follows an unescaped # on the same line is ignored */
2188
+
$match = preg_match('/te #~~~~
2189
+
st/x', $subject);
2190
+
var_dump($match);
2191
+
]]>
2192
+
</programlisting>
2193
+
&example.outputs;
2194
+
<screen>
2195
+
<![CDATA[
2196
+
int(1)
2197
+
int(0)
2198
+
int(1)
2199
+
]]>
2200
+
</screen>
2201
+
</example>
2202
+
</para>
1950
2203
</section>
1951
2204

1952
2205
<section xml:id="regexp.reference.recursive">
1953
2206
<title>Recursive patterns</title>
1954
2207
<para>
1955
-
Consider the problem of matching a string in parentheses,
1956
-
allowing for unlimited nested parentheses. Without the use
1957
-
of recursion, the best that can be done is to use a pattern
1958
-
that matches up to some fixed depth of nesting. It is not
1959
-
possible to handle an arbitrary nesting depth. Perl 5.6 has
1960
-
provided an experimental facility that allows regular
1961
-
expressions to recurse (among other things). The special
1962
-
item (?R) is provided for the specific case of recursion.
1963
-
This PCRE pattern solves the parentheses problem (assume
2208
+
Consider the problem of matching a string in parentheses,
2209
+
allowing for unlimited nested parentheses. Without the use
2210
+
of recursion, the best that can be done is to use a pattern
2211
+
that matches up to some fixed depth of nesting. It is not
2212
+
possible to handle an arbitrary nesting depth. Perl 5.6 has
2213
+
provided an experimental facility that allows regular
2214
+
expressions to recurse (among other things). The special
2215
+
item (?R) is provided for the specific case of recursion.
2216
+
This PCRE pattern solves the parentheses problem (assume
1964
2217
the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1965
2218
option is set so that white space is
1966
2219
ignored):
...
...
@@ -1969,45 +2222,45 @@
1969
2222
</para>
1970
2223
<para>
1971
2224
First it matches an opening parenthesis. Then it matches any
1972
-
number of substrings which can either be a sequence of
1973
-
non-parentheses, or a recursive match of the pattern itself
2225
+
number of substrings which can either be a sequence of
2226
+
non-parentheses, or a recursive match of the pattern itself
1974
2227
(i.e. a correctly parenthesized substring). Finally there is
1975
2228
a closing parenthesis.
1976
2229
</para>
1977
2230
<para>
1978
-
This particular example pattern contains nested unlimited
2231
+
This particular example pattern contains nested unlimited
1979
2232
repeats, and so the use of a once-only subpattern for matching
1980
-
strings of non-parentheses is important when applying
1981
-
the pattern to strings that do not match. For example, when
2233
+
strings of non-parentheses is important when applying
2234
+
the pattern to strings that do not match. For example, when
1982
2235
it is applied to
1983
2236

1984
2237
<literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>
1985
2238

1986
-
it yields "no match" quickly. However, if a once-only subpattern
1987
-
is not used, the match runs for a very long time
1988
-
indeed because there are so many different ways the + and *
1989
-
repeats can carve up the subject, and all have to be tested
2239
+
it yields "no match" quickly. However, if a once-only subpattern
2240
+
is not used, the match runs for a very long time
2241
+
indeed because there are so many different ways the + and *
2242
+
repeats can carve up the subject, and all have to be tested
1990
2243
before failure can be reported.
1991
2244
</para>
1992
2245
<para>
1993
-
The values set for any capturing subpatterns are those from
2246
+
The values set for any capturing subpatterns are those from
1994
2247
the outermost level of the recursion at which the subpattern
1995
2248
value is set. If the pattern above is matched against
1996
2249

1997
2250
<literal>(ab(cd)ef)</literal>
1998
2251

1999
-
the value for the capturing parentheses is "ef", which is
2000
-
the last value taken on at the top level. If additional
2252
+
the value for the capturing parentheses is "ef", which is
2253
+
the last value taken on at the top level. If additional
2001
2254
parentheses are added, giving
2002
2255

2003
2256
<literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>
2004
2257
then the string they capture
2005
2258
is "ab(cd)ef", the contents of the top level parentheses. If
2006
-
there are more than 15 capturing parentheses in a pattern,
2007
-
PCRE has to obtain extra memory to store data during a
2008
-
recursion, which it does by using pcre_malloc, freeing it
2009
-
via pcre_free afterwards. If no memory can be obtained, it
2010
-
saves data for the first 15 capturing parentheses only, as
2259
+
there are more than 15 capturing parentheses in a pattern,
2260
+
PCRE has to obtain extra memory to store data during a
2261
+
recursion, which it does by using pcre_malloc, freeing it
2262
+
via pcre_free afterwards. If no memory can be obtained, it
2263
+
saves data for the first 15 capturing parentheses only, as
2011
2264
there is no way to give an out-of-memory error from within a
2012
2265
recursion.
2013
2266
</para>
...
...
@@ -2016,7 +2269,7 @@
2016
2269
<literal>(?1)</literal>, <literal>(?2)</literal> and so on
2017
2270
can be used for recursive subpatterns too. It is also possible to use named
2018
2271
subpatterns: <literal>(?P&gt;name)</literal> or
2019
-
<literal>(?P&amp;name)</literal>.
2272
+
<literal>(?&amp;name)</literal>.
2020
2273
</para>
2021
2274
<para>
2022
2275
If the syntax for a recursive subpattern reference (either by number or
...
...
@@ -2046,75 +2299,75 @@
2046
2299
<title>Performance</title>
2047
2300
<para>
2048
2301
Certain items that may appear in patterns are more efficient
2049
-
than others. It is more efficient to use a character class
2302
+
than others. It is more efficient to use a character class
2050
2303
like [aeiou] than a set of alternatives such as (a|e|i|o|u).
2051
-
In general, the simplest construction that provides the
2052
-
required behaviour is usually the most efficient. Jeffrey
2053
-
Friedl's book contains a lot of discussion about optimizing
2304
+
In general, the simplest construction that provides the
2305
+
required behaviour is usually the most efficient. Jeffrey
2306
+
Friedl's book contains a lot of discussion about optimizing
2054
2307
regular expressions for efficient performance.
2055
2308
</para>
2056
2309
<para>
2057
2310
When a pattern begins with .* and the <link
2058
-
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2059
-
set, the pattern is implicitly anchored by PCRE, since it
2311
+
linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is
2312
+
set, the pattern is implicitly anchored by PCRE, since it
2060
2313
can match only at the start of a subject string. However, if
2061
2314
<link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>
2062
2315
is not set, PCRE cannot make this optimization,
2063
-
because the . metacharacter does not then match a newline,
2316
+
because the . metacharacter does not then match a newline,
2064
2317
and if the subject string contains newlines, the pattern may
2065
-
match from the character immediately following one of them
2318
+
match from the character immediately following one of them
2066
2319
instead of from the very start. For example, the pattern
2067
2320

2068
2321
<literal>(.*) second</literal>
2069
2322

2070
2323
matches the subject "first\nand second" (where \n stands for
2071
2324
a newline character) with the first captured substring being
2072
-
"and". In order to do this, PCRE has to retry the match
2325
+
"and". In order to do this, PCRE has to retry the match
2073
2326
starting after every newline in the subject.
2074
2327
</para>
2075
2328
<para>
2076
2329
If you are using such a pattern with subject strings that do
2077
-
not contain newlines, the best performance is obtained by
2330
+
not contain newlines, the best performance is obtained by
2078
2331
setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,
2079
-
or starting the pattern with ^.* to
2080
-
indicate explicit anchoring. That saves PCRE from having to
2332
+
or starting the pattern with ^.* to
2333
+
indicate explicit anchoring. That saves PCRE from having to
2081
2334
scan along the subject looking for a newline to restart at.
2082
2335
</para>
2083
2336
<para>
2084
-
Beware of patterns that contain nested indefinite repeats.
2085
-
These can take a long time to run when applied to a string
2337
+
Beware of patterns that contain nested indefinite repeats.
2338
+
These can take a long time to run when applied to a string
2086
2339
that does not match. Consider the pattern fragment
2087
2340

2088
2341
<literal>(a+)*</literal>
2089
2342
</para>
2090
2343
<para>
2091
-
This can match "aaaa" in 33 different ways, and this number
2092
-
increases very rapidly as the string gets longer. (The *
2093
-
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2094
-
those cases other than 0, the + repeats can match different
2344
+
This can match "aaaa" in 33 different ways, and this number
2345
+
increases very rapidly as the string gets longer. (The *
2346
+
repeat can match 0, 1, 2, 3, or 4 times, and for each of
2347
+
those cases other than 0, the + repeats can match different
2095
2348
numbers of times.) When the remainder of the pattern is such
2096
-
that the entire match is going to fail, PCRE has in principle
2097
-
to try every possible variation, and this can take an
2349
+
that the entire match is going to fail, PCRE has in principle
2350
+
to try every possible variation, and this can take an
2098
2351
extremely long time.
2099
2352
</para>
2100
2353
<para>
2101
-
An optimization catches some of the more simple cases such
2354
+
An optimization catches some of the more simple cases such
2102
2355
as
2103
2356

2104
2357
<literal>(a+)*b</literal>
2105
2358

2106
-
where a literal character follows. Before embarking on the
2359
+
where a literal character follows. Before embarking on the
2107
2360
standard matching procedure, PCRE checks that there is a "b"
2108
-
later in the subject string, and if there is not, it fails
2109
-
the match immediately. However, when there is no following
2110
-
literal this optimization cannot be used. You can see the
2361
+
later in the subject string, and if there is not, it fails
2362
+
the match immediately. However, when there is no following
2363
+
literal this optimization cannot be used. You can see the
2111
2364
difference by comparing the behaviour of
2112
2365

2113
2366
<literal>(a+)*\d</literal>
2114
2367

2115
-
with the pattern above. The former gives a failure almost
2116
-
instantly when applied to a whole line of "a" characters,
2117
-
whereas the latter takes an appreciable time with strings
2368
+
with the pattern above. The former gives a failure almost
2369
+
instantly when applied to a whole line of "a" characters,
2370
+
whereas the latter takes an appreciable time with strings
2118
2371
longer than about 20 characters.
2119
2372
</para>
2120
2373
</section>
2121
2374