reference/pcre/pattern.syntax.xml
587830d5d261802148a160a59059dd8d76385fd2
...
...
@@ -1,7 +1,7 @@
1
1
<?xml version="1.0" encoding="utf-8"?>
2
2
<!-- $Revision$ -->
3
3
<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->
4
-
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">
4
+
<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">
5
5
<title>Pattern Syntax</title>
6
6
<titleabbrev>PCRE regex syntax</titleabbrev>
7
7

...
...
@@ -32,6 +32,7 @@
32
32
When using the PCRE functions, it is required that the pattern is enclosed
33
33
by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,
34
34
non-backslash, non-whitespace character.
35
+
Leading whitespace before a valid delimiter is silently ignored.
35
36
</para>
36
37
<para>
37
38
Often used delimiters are forward slashes (<literal>/</literal>), hash
...
...
@@ -49,6 +50,26 @@
49
50
</informalexample>
50
51
</para>
51
52
<para>
53
+
It is also possible to use
54
+
bracket style delimiters where the opening and closing brackets are the
55
+
starting and ending delimiter, respectively. <literal>()</literal>,
56
+
<literal>{}</literal>, <literal>[]</literal> and <literal>&lt;&gt;</literal>
57
+
are all valid bracket style delimiter pairs.
58
+
<informalexample>
59
+
<programlisting>
60
+
<![CDATA[
61
+
(this [is] a (pattern))
62
+
{this [is] a (pattern)}
63
+
[this [is] a (pattern)]
64
+
<this [is] a (pattern)>
65
+
]]>
66
+
</programlisting>
67
+
</informalexample>
68
+
Bracket style delimiters do not need to be escaped when they are used as meta
69
+
characters within the pattern, but as with other delimiters they must be
70
+
escaped when they are used as literal characters.
71
+
</para>
72
+
<para>
52
73
If the delimiter needs to be matched inside the pattern it must be
53
74
escaped using a backslash. If the delimiter appears often inside the
54
75
pattern, it is a good idea to choose another delimiter in order to increase
...
...
@@ -66,18 +87,6 @@
66
87
to specify the delimiter to be escaped.
67
88
</para>
68
89
<para>
69
-
In addition to the aforementioned delimiters, it is also possible to use
70
-
bracket style delimiters where the opening and closing brackets are the
71
-
starting and ending delimiter, respectively.
72
-
<informalexample>
73
-
<programlisting>
74
-
<![CDATA[
75
-
{this is a pattern}
76
-
]]>
77
-
</programlisting>
78
-
</informalexample>
79
-
</para>
80
-
<para>
81
90
You may add <link linkend="reference.pcre.pattern.modifiers">pattern
82
91
modifiers</link> after the ending delimiter. The following is an example
83
92
of case-insensitive matching:
...
...
@@ -104,92 +113,88 @@
104
113
are recognized anywhere in the pattern except within square
105
114
brackets, and those that are recognized in square brackets.
106
115
Outside square brackets, the meta-characters are as follows:
107
-
<variablelist>
108
-
<varlistentry>
109
-
<term><emphasis>\</emphasis></term>
110
-
<listitem><simpara>general escape character with several uses</simpara></listitem>
111
-
</varlistentry>
112
-
<varlistentry>
113
-
<term><emphasis>^</emphasis></term>
114
-
<listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>
115
-
</varlistentry>
116
-
<varlistentry>
117
-
<term><emphasis>$</emphasis></term>
118
-
<listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem>
119
-
</varlistentry>
120
-
<varlistentry>
121
-
<term><emphasis>.</emphasis></term>
122
-
<listitem><simpara>match any character except newline (by default)</simpara></listitem>
123
-
</varlistentry>
124
-
<varlistentry>
125
-
<term><emphasis>[</emphasis></term>
126
-
<listitem><simpara>start character class definition</simpara></listitem>
127
-
</varlistentry>
128
-
<varlistentry>
129
-
<term><emphasis>]</emphasis></term>
130
-
<listitem><simpara>end character class definition</simpara></listitem>
131
-
</varlistentry>
132
-
<varlistentry>
133
-
<term><emphasis>|</emphasis></term>
134
-
<listitem><simpara>start of alternative branch</simpara></listitem>
135
-
</varlistentry>
136
-
<varlistentry>
137
-
<term><emphasis>(</emphasis></term>
138
-
<listitem><simpara>start subpattern</simpara></listitem>
139
-
</varlistentry>
140
-
<varlistentry>
141
-
<term><emphasis>)</emphasis></term>
142
-
<listitem><simpara>end subpattern</simpara></listitem>
143
-
</varlistentry>
144
-
<varlistentry>
145
-
<term><emphasis>?</emphasis></term>
146
-
<listitem>
147
-
<simpara>
148
-
extends the meaning of (, also 0 or 1 quantifier, also makes greedy
149
-
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)
150
-
</simpara>
151
-
</listitem>
152
-
</varlistentry>
153
-
<varlistentry>
154
-
<term><emphasis>*</emphasis></term>
155
-
<listitem><simpara>0 or more quantifier</simpara></listitem>
156
-
</varlistentry>
157
-
<varlistentry>
158
-
<term><emphasis>+</emphasis></term>
159
-
<listitem><simpara>1 or more quantifier</simpara></listitem>
160
-
</varlistentry>
161
-
<varlistentry>
162
-
<term><emphasis>{</emphasis></term>
163
-
<listitem><simpara>start min/max quantifier</simpara></listitem>
164
-
</varlistentry>
165
-
<varlistentry>
166
-
<term><emphasis>}</emphasis></term>
167
-
<listitem><simpara>end min/max quantifier</simpara></listitem>
168
-
</varlistentry>
169
-
</variablelist>
116
+

117
+
<table>
118
+
<title>Meta-characters outside square brackets</title>
119
+
<tgroup cols="2">
120
+
<thead>
121
+
<row>
122
+
<entry>Meta-character</entry><entry>Description</entry>
123
+
</row>
124
+
</thead>
125
+
<tbody>
126
+
<row>
127
+
<entry>\</entry><entry>general escape character with several uses</entry>
128
+
</row>
129
+
<row>
130
+
<entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>
131
+
</row>
132
+
<row>
133
+
<entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>
134
+
</row>
135
+
<row>
136
+
<entry>.</entry><entry>match any character except newline (by default)</entry>
137
+
</row>
138
+
<row>
139
+
<entry>[</entry><entry>start character class definition</entry>
140
+
</row>
141
+
<row>
142
+
<entry>]</entry><entry>end character class definition</entry>
143
+
</row>
144
+
<row>
145
+
<entry>|</entry><entry>start of alternative branch</entry>
146
+
</row>
147
+
<row>
148
+
<entry>(</entry><entry>start subpattern</entry>
149
+
</row>
150
+
<row>
151
+
<entry>)</entry><entry>end subpattern</entry>
152
+
</row>
153
+
<row>
154
+
<entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy
155
+
quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>
156
+
</row>
157
+
<row>
158
+
<entry>*</entry><entry>0 or more quantifier</entry>
159
+
</row>
160
+
<row>
161
+
<entry>+</entry><entry>1 or more quantifier</entry>
162
+
</row>
163
+
<row>
164
+
<entry>{</entry><entry>start min/max quantifier</entry>
165
+
</row>
166
+
<row>
167
+
<entry>}</entry><entry>end min/max quantifier</entry>
168
+
</row>
169
+
</tbody>
170
+
</tgroup>
171
+
</table>
170
172

171
173
Part of a pattern that is in square brackets is called a
172
-
"character class". In a character class the only
174
+
<link linkend="regexp.reference.character-classes">character class</link>. In a character class the only
173
175
meta-characters are:
174
176

175
-
<variablelist>
176
-
<varlistentry>
177
-
<term><emphasis>\</emphasis></term>
178
-
<listitem><simpara>general escape character</simpara></listitem>
179
-
</varlistentry>
180
-
<varlistentry>
181
-
<term><emphasis>^</emphasis></term>
182
-
<listitem><simpara>negate the class, but only if the first character</simpara></listitem>
183
-
</varlistentry>
184
-
<varlistentry>
185
-
<term><emphasis>-</emphasis></term>
186
-
<listitem><simpara>indicates character range</simpara></listitem>
187
-
</varlistentry>
188
-
<varlistentry>
189
-
<term><emphasis>]</emphasis></term>
190
-
<listitem><simpara>terminates the character class</simpara></listitem>
191
-
</varlistentry>
192
-
</variablelist>
177
+
<table>
178
+
<title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>
179
+
<tgroup cols="2">
180
+
<thead>
181
+
<row>
182
+
<entry>Meta-character</entry><entry>Description</entry>
183
+
</row>
184
+
</thead>
185
+
<tbody>
186
+
<row>
187
+
<entry>\</entry><entry>general escape character</entry>
188
+
</row>
189
+
<row>
190
+
<entry>^</entry><entry>negate the class, but only if the first character</entry>
191
+
</row>
192
+
<row>
193
+
<entry>-</entry><entry>indicates character range</entry>
194
+
</row>
195
+
</tbody>
196
+
</tgroup>
197
+
</table>
193
198

194
199
The following sections describe the use of each of the
195
200
meta-characters.
...
...
@@ -297,6 +302,12 @@
297
302
</listitem>
298
303
</varlistentry>
299
304
<varlistentry>
305
+
<term><emphasis>\R</emphasis></term>
306
+
<listitem>
307
+
<simpara>line break: matches \n, \r and \r\n</simpara>
308
+
</listitem>
309
+
</varlistentry>
310
+
<varlistentry>
300
311
<term><emphasis>\t</emphasis></term>
301
312
<listitem>
302
313
<simpara>tab (hex 09)</simpara>
...
...
@@ -450,11 +461,11 @@
450
461
</varlistentry>
451
462
<varlistentry>
452
463
<term><emphasis>\h</emphasis></term>
453
-
<listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
464
+
<listitem><simpara>any horizontal whitespace character</simpara></listitem>
454
465
</varlistentry>
455
466
<varlistentry>
456
467
<term><emphasis>\H</emphasis></term>
457
-
<listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>
468
+
<listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>
458
469
</varlistentry>
459
470
<varlistentry>
460
471
<term><emphasis>\s</emphasis></term>
...
...
@@ -466,11 +477,11 @@
466
477
</varlistentry>
467
478
<varlistentry>
468
479
<term><emphasis>\v</emphasis></term>
469
-
<listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
480
+
<listitem><simpara>any vertical whitespace character</simpara></listitem>
470
481
</varlistentry>
471
482
<varlistentry>
472
483
<term><emphasis>\V</emphasis></term>
473
-
<listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>
484
+
<listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>
474
485
</varlistentry>
475
486
<varlistentry>
476
487
<term><emphasis>\w</emphasis></term>
...
...
@@ -488,6 +499,12 @@
488
499
matches one, and only one, of each pair.
489
500
</para>
490
501
<para>
502
+
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),
503
+
and space (32). However, if locale-specific matching is happening,
504
+
characters with code points in the range 128-255 may also be considered
505
+
as whitespace characters, for instance, NBSP (A0).
506
+
</para>
507
+
<para>
491
508
A "word" character is any letter or digit or the underscore
492
509
character, that is, any character which can be part of a
493
510
Perl "<emphasis>word</emphasis>". The definition of letters and digits is
...
...
@@ -560,7 +577,7 @@
560
577
<para>
561
578
The <literal>\A</literal>, <literal>\Z</literal>, and
562
579
<literal>\z</literal> assertions differ from the traditional
563
-
circumflex and dollar (described below) in that they only
580
+
circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only
564
581
ever match at the very start and end of the subject string,
565
582
whatever options are set. They are not affected by the
566
583
<link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or
...
...
@@ -583,12 +600,16 @@
583
600
regexp metacharacters in the pattern. For example:
584
601
<literal>\w+\Q.$.\E$</literal> will match one or more word characters,
585
602
followed by literals <literal>.$.</literal> and anchored at the end of
586
-
the string.
603
+
the string. Note that this does not change the behavior of
604
+
delimiters; for instance the pattern <literal>#\Q#\E#$</literal>
605
+
is not valid, because the second <literal>#</literal> marks the end
606
+
of the pattern, and the <literal>\E#</literal> is interpreted as invalid
607
+
modifiers.
587
608
</para>
588
609

589
610
<para>
590
-
<literal>\K</literal> can be used to reset the match start since
591
-
PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches
611
+
<literal>\K</literal> can be used to reset the match start.
612
+
For example, the pattern <literal>foo\Kbar</literal> matches
592
613
"foobar", but reports that it has matched "bar". The use of
593
614
<literal>\K</literal> does not interfere with the setting of captured
594
615
substrings. For example, when the pattern <literal>(foo)\Kbar</literal>
...
...
@@ -844,7 +865,7 @@
844
865
</tgroup>
845
866
</table>
846
867
<para>
847
-
Extended properties such as "Greek" or "InMusicalSymbols" are not
868
+
Extended properties such as <literal>InMusicalSymbols</literal> are not
848
869
supported by PCRE.
849
870
</para>
850
871
<para>
...
...
@@ -852,15 +873,193 @@
852
873
For example, <literal>\p{Lu}</literal> always matches only upper case letters.
853
874
</para>
854
875
<para>
855
-
The <literal>\X</literal> escape matches any number of Unicode characters
856
-
that form an extended Unicode sequence. <literal>\X</literal> is equivalent
857
-
to <literal>(?>\PM\pM*)</literal>.
876
+
Sets of Unicode characters are defined as belonging to certain scripts. A
877
+
character from one of these sets can be matched using a script name. For
878
+
example:
858
879
</para>
880
+
<itemizedlist>
881
+
<listitem>
882
+
<simpara><literal>\p{Greek}</literal></simpara>
883
+
</listitem>
884
+
<listitem>
885
+
<simpara><literal>\P{Han}</literal></simpara>
886
+
</listitem>
887
+
</itemizedlist>
859
888
<para>
860
-
That is, it matches a character without the "mark" property, followed
861
-
by zero or more characters with the "mark" property, and treats the
862
-
sequence as an atomic group (see below). Characters with the "mark"
863
-
property are typically accents that affect the preceding character.
889
+
Those that are not part of an identified script are lumped together as
890
+
<literal>Common</literal>. The current list of scripts is:
891
+
</para>
892
+
<table>
893
+
<title>Supported scripts</title>
894
+
<tgroup cols="5">
895
+
<tbody>
896
+
<row>
897
+
<entry><literal>Arabic</literal></entry>
898
+
<entry><literal>Armenian</literal></entry>
899
+
<entry><literal>Avestan</literal></entry>
900
+
<entry><literal>Balinese</literal></entry>
901
+
<entry><literal>Bamum</literal></entry>
902
+
</row>
903
+
<row>
904
+
<entry><literal>Batak</literal></entry>
905
+
<entry><literal>Bengali</literal></entry>
906
+
<entry><literal>Bopomofo</literal></entry>
907
+
<entry><literal>Brahmi</literal></entry>
908
+
<entry><literal>Braille</literal></entry>
909
+
</row>
910
+
<row>
911
+
<entry><literal>Buginese</literal></entry>
912
+
<entry><literal>Buhid</literal></entry>
913
+
<entry><literal>Canadian_Aboriginal</literal></entry>
914
+
<entry><literal>Carian</literal></entry>
915
+
<entry><literal>Chakma</literal></entry>
916
+
</row>
917
+
<row>
918
+
<entry><literal>Cham</literal></entry>
919
+
<entry><literal>Cherokee</literal></entry>
920
+
<entry><literal>Common</literal></entry>
921
+
<entry><literal>Coptic</literal></entry>
922
+
<entry><literal>Cuneiform</literal></entry>
923
+
</row>
924
+
<row>
925
+
<entry><literal>Cypriot</literal></entry>
926
+
<entry><literal>Cyrillic</literal></entry>
927
+
<entry><literal>Deseret</literal></entry>
928
+
<entry><literal>Devanagari</literal></entry>
929
+
<entry><literal>Egyptian_Hieroglyphs</literal></entry>
930
+
</row>
931
+
<row>
932
+
<entry><literal>Ethiopic</literal></entry>
933
+
<entry><literal>Georgian</literal></entry>
934
+
<entry><literal>Glagolitic</literal></entry>
935
+
<entry><literal>Gothic</literal></entry>
936
+
<entry><literal>Greek</literal></entry>
937
+
</row>
938
+
<row>
939
+
<entry><literal>Gujarati</literal></entry>
940
+
<entry><literal>Gurmukhi</literal></entry>
941
+
<entry><literal>Han</literal></entry>
942
+
<entry><literal>Hangul</literal></entry>
943
+
<entry><literal>Hanunoo</literal></entry>
944
+
</row>
945
+
<row>
946
+
<entry><literal>Hebrew</literal></entry>
947
+
<entry><literal>Hiragana</literal></entry>
948
+
<entry><literal>Imperial_Aramaic</literal></entry>
949
+
<entry><literal>Inherited</literal></entry>
950
+
<entry><literal>Inscriptional_Pahlavi</literal></entry>
951
+
</row>
952
+
<row>
953
+
<entry><literal>Inscriptional_Parthian</literal></entry>
954
+
<entry><literal>Javanese</literal></entry>
955
+
<entry><literal>Kaithi</literal></entry>
956
+
<entry><literal>Kannada</literal></entry>
957
+
<entry><literal>Katakana</literal></entry>
958
+
</row>
959
+
<row>
960
+
<entry><literal>Kayah_Li</literal></entry>
961
+
<entry><literal>Kharoshthi</literal></entry>
962
+
<entry><literal>Khmer</literal></entry>
963
+
<entry><literal>Lao</literal></entry>
964
+
<entry><literal>Latin</literal></entry>
965
+
</row>
966
+
<row>
967
+
<entry><literal>Lepcha</literal></entry>
968
+
<entry><literal>Limbu</literal></entry>
969
+
<entry><literal>Linear_B</literal></entry>
970
+
<entry><literal>Lisu</literal></entry>
971
+
<entry><literal>Lycian</literal></entry>
972
+
</row>
973
+
<row>
974
+
<entry><literal>Lydian</literal></entry>
975
+
<entry><literal>Malayalam</literal></entry>
976
+
<entry><literal>Mandaic</literal></entry>
977
+
<entry><literal>Meetei_Mayek</literal></entry>
978
+
<entry><literal>Meroitic_Cursive</literal></entry>
979
+
</row>
980
+
<row>
981
+
<entry><literal>Meroitic_Hieroglyphs</literal></entry>
982
+
<entry><literal>Miao</literal></entry>
983
+
<entry><literal>Mongolian</literal></entry>
984
+
<entry><literal>Myanmar</literal></entry>
985
+
<entry><literal>New_Tai_Lue</literal></entry>
986
+
</row>
987
+
<row>
988
+
<entry><literal>Nko</literal></entry>
989
+
<entry><literal>Ogham</literal></entry>
990
+
<entry><literal>Old_Italic</literal></entry>
991
+
<entry><literal>Old_Persian</literal></entry>
992
+
<entry><literal>Old_South_Arabian</literal></entry>
993
+
</row>
994
+
<row>
995
+
<entry><literal>Old_Turkic</literal></entry>
996
+
<entry><literal>Ol_Chiki</literal></entry>
997
+
<entry><literal>Oriya</literal></entry>
998
+
<entry><literal>Osmanya</literal></entry>
999
+
<entry><literal>Phags_Pa</literal></entry>
1000
+
</row>
1001
+
<row>
1002
+
<entry><literal>Phoenician</literal></entry>
1003
+
<entry><literal>Rejang</literal></entry>
1004
+
<entry><literal>Runic</literal></entry>
1005
+
<entry><literal>Samaritan</literal></entry>
1006
+
<entry><literal>Saurashtra</literal></entry>
1007
+
</row>
1008
+
<row>
1009
+
<entry><literal>Sharada</literal></entry>
1010
+
<entry><literal>Shavian</literal></entry>
1011
+
<entry><literal>Sinhala</literal></entry>
1012
+
<entry><literal>Sora_Sompeng</literal></entry>
1013
+
<entry><literal>Sundanese</literal></entry>
1014
+
</row>
1015
+
<row>
1016
+
<entry><literal>Syloti_Nagri</literal></entry>
1017
+
<entry><literal>Syriac</literal></entry>
1018
+
<entry><literal>Tagalog</literal></entry>
1019
+
<entry><literal>Tagbanwa</literal></entry>
1020
+
<entry><literal>Tai_Le</literal></entry>
1021
+
</row>
1022
+
<row>
1023
+
<entry><literal>Tai_Tham</literal></entry>
1024
+
<entry><literal>Tai_Viet</literal></entry>
1025
+
<entry><literal>Takri</literal></entry>
1026
+
<entry><literal>Tamil</literal></entry>
1027
+
<entry><literal>Telugu</literal></entry>
1028
+
</row>
1029
+
<row>
1030
+
<entry><literal>Thaana</literal></entry>
1031
+
<entry><literal>Thai</literal></entry>
1032
+
<entry><literal>Tibetan</literal></entry>
1033
+
<entry><literal>Tifinagh</literal></entry>
1034
+
<entry><literal>Ugaritic</literal></entry>
1035
+
</row>
1036
+
<row>
1037
+
<entry><literal>Vai</literal></entry>
1038
+
<entry><literal>Yi</literal></entry>
1039
+
<entry />
1040
+
<entry />
1041
+
<entry />
1042
+
<entry />
1043
+
</row>
1044
+
</tbody>
1045
+
</tgroup>
1046
+
</table>
1047
+
<para>
1048
+
The <literal>\X</literal> escape matches a Unicode extended grapheme
1049
+
cluster. An extended grapheme cluster is one or more Unicode characters
1050
+
that combine to form a single glyph. In effect, this can be thought of as
1051
+
the Unicode equivalent of <literal>.</literal> as it will match one
1052
+
composed character, regardless of how many individual characters are
1053
+
actually used to render it.
1054
+
</para>
1055
+
<para>
1056
+
In versions of PCRE older than 8.32 (which corresponds to PHP versions
1057
+
before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>
1058
+
is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a
1059
+
character without the "mark" property, followed by zero or more characters
1060
+
with the "mark" property, and treats the sequence as an atomic group (see
1061
+
below). Characters with the "mark" property are typically accents that
1062
+
affect the preceding character.
864
1063
</para>
865
1064
<para>
866
1065
Matching characters by Unicode property is not fast, because PCRE has
...
...
@@ -1082,6 +1281,16 @@
1082
1281
<para>
1083
1282
In UTF-8 mode, characters with values greater than 128 do not match any
1084
1283
of the POSIX character classes.
1284
+
As of libpcre 8.10 some character classes are changed to use
1285
+
Unicode character properties, in which case the mentioned restriction does
1286
+
not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>
1287
+
for details.
1288
+
</para>
1289
+
<para>
1290
+
Unicode character properties can appear inside a character class. They can
1291
+
not be part of a range. The minus (hyphen) character after a Unicode
1292
+
character class will match literally. Trying to end a range with a Unicode
1293
+
character property will result in a warning.
1085
1294
</para>
1086
1295
</section>
1087
1296

...
...
@@ -1141,7 +1350,7 @@
1141
1350
</row>
1142
1351
<row>
1143
1352
<entry><literal>X</literal></entry>
1144
-
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>
1353
+
<entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>
1145
1354
</row>
1146
1355
<row>
1147
1356
<entry><literal>J</literal></entry>
...
...
@@ -1254,8 +1463,8 @@
1254
1463

1255
1464
the captured substrings are "white queen" and "queen", and
1256
1465
are numbered 1 and 2. The maximum number of captured substrings
1257
-
is 99, and the maximum number of all subpatterns,
1258
-
both capturing and non-capturing, is 200.
1466
+
is 65535. It may not be possible to compile such large patterns,
1467
+
however, depending on the configuration options of libpcre.
1259
1468
</para>
1260
1469
<para>
1261
1470
As a convenient shorthand, if any option settings are
...
...
@@ -1285,7 +1494,7 @@
1285
1494
It is possible to name a subpattern using the syntax
1286
1495
<literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then
1287
1496
be indexed in the matches array by its normal numeric position and
1288
-
also by name. PHP 5.2.2 introduced two alternative syntaxes
1497
+
also by name. There are two alternative syntaxes
1289
1498
<literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.
1290
1499
</para>
1291
1500

...
...
@@ -1306,9 +1515,10 @@
1306
1515

1307
1516
<para>
1308
1517
Here <literal>Sun</literal> is stored in backreference 2, while
1309
-
backreference 1 is empty. Matching yields <literal>Sat</literal> in
1310
-
backreference 1 while backreference 2 does not exist. Changing the pattern
1311
-
to use the <literal>(?|</literal> fixes this problem:
1518
+
backreference 1 is empty. Matching <literal>Saturday</literal> yields
1519
+
<literal>Sat</literal> in backreference 1 while backreference 2 does
1520
+
not exist. Changing the pattern to use the <literal>(?|</literal> fixes
1521
+
this problem:
1312
1522
</para>
1313
1523

1314
1524
<informalexample>
...
...
@@ -1522,7 +1732,7 @@
1522
1732
A "forward back reference" can make sense when a repetition
1523
1733
is involved and the subpattern to the right has participated
1524
1734
in an earlier iteration. See the section
1525
-
entitled "Backslash" above for further details of the handling
1735
+
<link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling
1526
1736
of digits following a backslash.
1527
1737
</para>
1528
1738
<para>
...
...
@@ -1577,7 +1787,7 @@
1577
1787
example above, or by a quantifier with a minimum of zero.
1578
1788
</para>
1579
1789
<para>
1580
-
As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be
1790
+
The <literal>\g</literal> escape sequence can be
1581
1791
used for absolute and relative referencing of subpatterns.
1582
1792
This escape sequence must be followed by an unsigned number or a negative
1583
1793
number, optionally enclosed in braces. The sequences <literal>\1</literal>,
...
...
@@ -1598,10 +1808,10 @@
1598
1808
</para>
1599
1809
<para>
1600
1810
Back references to the named subpatterns can be achieved by
1601
-
<literal>(?P=name)</literal> or, since PHP 5.2.2, also by
1602
-
<literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>.
1603
-
Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>
1604
-
and <literal>\g{name}</literal>.
1811
+
<literal>(?P=name)</literal>,
1812
+
<literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,
1813
+
<literal>\k{name}</literal>, <literal>\g{name}</literal>,
1814
+
<literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.
1605
1815
</para>
1606
1816
</section>
1607
1817

...
...
@@ -1611,7 +1821,7 @@
1611
1821
An assertion is a test on the characters following or
1612
1822
preceding the current matching point that does not actually
1613
1823
consume any characters. The simple assertions coded as \b,
1614
-
\B, \A, \Z, \z, ^ and $ are described above. More complicated
1824
+
\B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated
1615
1825
assertions are coded as subpatterns. There are two
1616
1826
kinds: those that <emphasis>look ahead</emphasis> of the current position in the
1617
1827
subject string, and those that <emphasis>look behind</emphasis> it.
...
...
@@ -1883,9 +2093,14 @@
1883
2093
more readable (assume the <link
1884
2094
linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>
1885
2095
option) and to divide it into three parts for ease of discussion:
1886
-
1887
-
<literal>( \( )? [^()]+ (?(1) \) )</literal>
1888
2096
</para>
2097
+
<informalexample>
2098
+
<programlisting>
2099
+
<![CDATA[
2100
+
( \( )? [^()]+ (?(1) \) )
2101
+
]]>
2102
+
</programlisting>
2103
+
</informalexample>
1889
2104
<para>
1890
2105
The first part matches an optional opening parenthesis, and
1891
2106
if that character is present, sets it as the first captured
...
...
@@ -1947,6 +2162,41 @@
1947
2162
introduces a comment that continues up to the next newline character
1948
2163
in the pattern.
1949
2164
</para>
2165
+
<para>
2166
+
<example>
2167
+
<title>Usage of comments in PCRE pattern</title>
2168
+
<programlisting role="php">
2169
+
<![CDATA[
2170
+
<?php
2171
+

2172
+
$subject = 'test';
2173
+

2174
+
/* (?# can be used to add comments without enabling PCRE_EXTENDED */
2175
+
$match = preg_match('/te(?# this is a comment)st/', $subject);
2176
+
var_dump($match);
2177
+

2178
+
/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */
2179
+
$match = preg_match('/te #~~~~
2180
+
st/', $subject);
2181
+
var_dump($match);
2182
+

2183
+
/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything
2184
+
that follows an unescaped # on the same line is ignored */
2185
+
$match = preg_match('/te #~~~~
2186
+
st/x', $subject);
2187
+
var_dump($match);
2188
+
]]>
2189
+
</programlisting>
2190
+
&example.outputs;
2191
+
<screen>
2192
+
<![CDATA[
2193
+
int(1)
2194
+
int(0)
2195
+
int(1)
2196
+
]]>
2197
+
</screen>
2198
+
</example>
2199
+
</para>
1950
2200
</section>
1951
2201

1952
2202
<section xml:id="regexp.reference.recursive">
...
...
@@ -2016,7 +2266,7 @@
2016
2266
<literal>(?1)</literal>, <literal>(?2)</literal> and so on
2017
2267
can be used for recursive subpatterns too. It is also possible to use named
2018
2268
subpatterns: <literal>(?P&gt;name)</literal> or
2019
-
<literal>(?P&amp;name)</literal>.
2269
+
<literal>(?&amp;name)</literal>.
2020
2270
</para>
2021
2271
<para>
2022
2272
If the syntax for a recursive subpattern reference (either by number or
2023
2273