PHP: Documentation Tools

reference/pcre/pattern.syntax.xml
bb4abab22bf0204b4dba0140ac5fc9daa6888e0f

...

@@ -8,21 +8,21 @@

 <section xml:id="regexp.introduction">

  <title>Introduction</title>

  <para>

   The syntax and semantics of  the  regular  expressions

   supported  by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number  of

   other  books,  some  of which have copious examples. Jeffrey

   Friedl's  "Mastering  Regular  Expressions",  published   by

   O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.

   The syntax and semantics of the regular expressions

   supported by PCRE are described in this section. Regular expressions are

   also described in the Perl documentation and in a number of

   other books, some of which have copious examples. Jeffrey

   Friedl's "Mastering Regular Expressions", published by

   O'Reilly (ISBN 1-56592-257-3), covers them in great detail.

   The description here is intended as reference documentation.

  </para>

  <para>

   A regular expression is a pattern that is matched against  a

   A regular expression is a pattern that is matched against a

   subject string from left to right. Most characters stand for

   themselves in a pattern, and match the corresponding

   characters in the subject. As a trivial example, the pattern

   <literal>The quick brown fox</literal>

   matches a portion of a subject string that is  identical  to

   matches a portion of a subject string that is identical to

   itself.

  </para>

 </section>

...

@@ -102,15 +102,15 @@

102

 <section xml:id="regexp.reference.meta">

103

  <title>Meta-characters</title>

104

  <para>

105

   The  power  of  regular  expressions comes from the

105

   The power of regular expressions comes from the

106

   ability to include alternatives and repetitions in the

107

   pattern.  These  are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for  themselves  but  instead

107

   pattern. These are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead

109

   are interpreted in some special way.

110

  </para>

111

  <para>

112

   There are two different sets of meta-characters: those  that

113

   are  recognized anywhere in the pattern except within square

112

   There are two different sets of meta-characters: those that

113

   are recognized anywhere in the pattern except within square

114

   brackets, and those that are recognized in square brackets.

115

   Outside square brackets, the meta-characters are as follows:

116

...

@@ -130,7 +130,8 @@

130

       <entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>

131

      </row>

132

      <row>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or

134

        end of line, in multiline mode)</entry>

134

135

      </row>

135

136

      <row>

136

137

       <entry>.</entry><entry>match any character except newline (by default)</entry>

...

@@ -204,9 +205,9 @@

204

205

 <section xml:id="regexp.reference.escape">

205

206

  <title>Escape sequences</title>

206

207

  <para>

207

   The backslash character has several uses. Firstly, if it  is

208

   The backslash character has several uses. Firstly, if it is

208

209

   followed by a non-alphanumeric character, it takes away any

209

   special  meaning that character may have. This use of

210

   special meaning that character may have. This use of

210

211

   backslash as an escape character applies both inside and

211

212

   outside character classes.

212

213

  </para>

...

@@ -215,7 +216,7 @@

215

216

   "\*" in the pattern. This applies whether or not the

216

217

   following character would otherwise be interpreted as a

217

218

   meta-character, so it is always safe to precede a non-alphanumeric

218

   with "\" to specify that it stands for itself.  In

219

   with "\" to specify that it stands for itself. In

219

220

   particular, if you want to match a backslash, you write "\\".

220

221

  </para>

221

222

  <note>

...

@@ -237,10 +238,10 @@

237

238

  <para>

238

239

   A second use of backslash provides a way of encoding

239

240

   non-printing characters in patterns in a visible manner. There

240

   is no restriction on the appearance of non-printing  characters,

241

   is no restriction on the appearance of non-printing characters,

241

242

   apart from the binary zero that terminates a pattern,

242

243

   but when a pattern is being prepared by text editing, it is

243

   usually  easier to use one of the following escape sequences

244

   usually easier to use one of the following escape sequences

244

245

   than the binary character it represents:

245

246

  </para>

246

247

  <para>

...

@@ -331,9 +332,9 @@

331

332

  </para>

332

333

  <para>

333

334

   The precise effect of "<literal>\cx</literal>" is as follows:

334

   if "<literal>x</literal>" is a lower case  letter, it is converted

335

   if "<literal>x</literal>" is a lower case letter, it is converted

335

336

   to upper case. Then bit 6 of the character (hex 40) is inverted.

336

   Thus "<literal>\cz</literal>" becomes  hex 1A, but

337

   Thus "<literal>\cz</literal>" becomes hex 1A, but

337

338

   "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"

338

339

   becomes hex 7B.

339

340

  </para>

...

@@ -349,7 +350,7 @@

349

350

  </para>

350

351

  <para>

351

352

   After "<literal>\0</literal>" up to two further octal digits are read.

352

   In  both cases,  if  there are fewer than two digits, just those that

353

   In both cases, if there are fewer than two digits, just those that

353

354

   are present are used. Thus the sequence "<literal>\0\x\07</literal>"

354

355

   specifies two binary zeros followed by a BEL character. Make sure you

355

356

   supply two digits after the initial zero if the character

...

@@ -358,20 +359,20 @@

358

359

  <para>

359

360

   The handling of a backslash followed by a digit other than 0

360

361

   is complicated. Outside a character class, PCRE reads it

361

   and any following digits as a decimal number. If the  number

362

   is  less  than  10, or if there have been at least that many

363

   previous capturing left parentheses in the  expression,  the

364

   entire  sequence is taken as a <emphasis>back reference</emphasis>. A description

365

   of how this works is given later, following  the  discussion

362

   and any following digits as a decimal number. If the number

363

   is less than 10, or if there have been at least that many

364

   previous capturing left parentheses in the expression, the

365

   entire sequence is taken as a <emphasis>back reference</emphasis>. A description

366

   of how this works is given later, following the discussion

366

367

   of parenthesized subpatterns.

367

368

  </para>

368

369

  <para>

369

   Inside a character  class,  or  if  the  decimal  number  is

370

   Inside a character class, or if the decimal number is

370

371

   greater than 9 and there have not been that many capturing

371

372

   subpatterns, PCRE re-reads up to three octal digits following

372

373

   the backslash, and generates a single byte from the

373

374

   least significant 8 bits of the value. Any subsequent digits

374

   stand for themselves.  For example:

375

   stand for themselves. For example:

375

376

  </para>

376

377

  <para>

377

378

   <variablelist>

...

@@ -439,7 +440,7 @@

439

440

   digits are ever read.

440

441

  </para>

441

442

  <para>

442

   All the sequences that define a single byte value can  be

443

   All the sequences that define a single byte value can be

443

444

   used both inside and outside character classes. In addition,

444

445

   inside a character class, the sequence "<literal>\b</literal>"

445

446

   is interpreted as the backspace character (hex 08). Outside a character

...

@@ -506,7 +507,7 @@

506

507

  </para>

507

508

  <para>

508

509

   A "word" character is any letter or digit or the underscore

509

   character,  that  is,  any  character which can be part of a

510

   character, that is, any character which can be part of a

510

511

   Perl "<emphasis>word</emphasis>". The definition of letters and digits is

511

512

   controlled by PCRE's character tables, and may vary if locale-specific

512

513

   matching is taking place. For example, in the "fr" (French) locale, some

...

@@ -515,15 +516,15 @@

515

516

  </para>

516

517

  <para>

517

518

   These character type sequences can appear both inside and

518

   outside  character classes. They each match one character of

519

   the appropriate type. If the current matching  point is at

519

   outside character classes. They each match one character of

520

   the appropriate type. If the current matching point is at

520

521

   the end of the subject string, all of them fail, since there

521

522

   is no character to match.

522

523

  </para>

523

524

  <para>

524

   The fourth use of backslash is  for  certain  simple

525

   The fourth use of backslash is for certain simple

525

526

   assertions. An assertion specifies a condition that has to be met

526

   at a particular point in  a match, without consuming any

527

   at a particular point in a match, without consuming any

527

528

   characters from the subject string. The use of subpatterns

528

529

   for more complicated assertions is described below. The

529

530

   backslashed assertions are

...

@@ -562,7 +563,7 @@

562

563

   </variablelist>

563

564

  </para>

564

565

  <para>

565

   These assertions may not appear in  character  classes  (but

566

   These assertions may not appear in character classes (but

566

567

   note that "<literal>\b</literal>" has a different meaning, namely the backspace

567

568

   character, inside a character class).

568

569

  </para>

...

@@ -570,20 +571,20 @@

570

571

   A word boundary is a position in the subject string where

571

572

   the current character and the previous character do not both

572

573

   match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches

573

   <literal>\w</literal> and  the  other  matches

574

   <literal>\w</literal> and the other matches

574

575

   <literal>\W</literal>), or the start or end of the string if the first

575

576

   or last character matches <literal>\w</literal>, respectively.

576

577

  </para>

577

578

  <para>

578

579

   The <literal>\A</literal>, <literal>\Z</literal>, and

579

   <literal>\z</literal> assertions differ  from  the  traditional

580

   circumflex  and  dollar  (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only

581

   ever match at the very start and end of the subject  string,

582

   whatever  options  are  set.  They  are  not affected by the

580

   <literal>\z</literal> assertions differ from the traditional

581

   circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )

582

   in that they only ever match at the very start and end of the subject string,

583

   whatever options are set. They are not affected by the

583

584

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or

584

585

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>

585

   options. The  difference  between <literal>\Z</literal> and

586

   <literal>\z</literal>  is that <literal>\Z</literal> matches before a

586

   options. The difference between <literal>\Z</literal> and

587

   <literal>\z</literal> is that <literal>\Z</literal> matches before a

587

588

   newline that is the last character of the string as well as at the end of

588

589

   the string, whereas <literal>\z</literal> matches only at the end.

589

590

  </para>

...

@@ -600,7 +601,11 @@

600

601

   regexp metacharacters in the pattern. For example:

601

602

   <literal>\w+\Q.$.\E$</literal> will match one or more word characters,

602

603

   followed by literals <literal>.$.</literal> and anchored at the end of

603

   the string.

604

   the string. Note that this does not change the behavior of 

605

   delimiters; for instance the pattern <literal>#\Q#\E#$</literal>

606

   is not valid, because the second <literal>#</literal> marks the end

607

   of the pattern, and the <literal>\E#</literal> is interpreted as invalid

608

   modifiers.

604

609

  </para>

605

610

606

611

  <para>

...

@@ -835,7 +840,7 @@

835

840

     <row rowsep="1">

836

841

      <entry><literal>So</literal></entry>

837

842

      <entry>Other symbol</entry>

838

      <entry></entry>

843

      <entry>Includes emojis</entry>

839

844

     </row>

840

845

     <row>

841

846

      <entry><literal>Z</literal></entry>

...

@@ -869,8 +874,8 @@

869

874

   For example, <literal>\p{Lu}</literal> always matches only upper case letters.

870

875

  </para>

871

876

  <para>

872

   Sets of Unicode characters are defined as belonging to certain scripts.  A

873

   character from one of these sets can be matched using a script name.  For

877

   Sets of Unicode characters are defined as belonging to certain scripts. A

878

   character from one of these sets can be matched using a script name. For

874

879

   example:

875

880

  </para>

876

881

  <itemizedlist>

...

@@ -882,7 +887,7 @@

882

887

   </listitem>

883

888

  </itemizedlist>

884

889

  <para>

885

   Those that are not part of an identified script are lumped together  as

890

   Those that are not part of an identified script are lumped together as

886

891

   <literal>Common</literal>. The current list of scripts is:

887

892

  </para>

888

893

  <table>

...

@@ -1051,7 +1056,7 @@

1051

1056

  <para>

1052

1057

   In versions of PCRE older than 8.32 (which corresponds to PHP versions

1053

1058

   before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>

1054

   is equivalent to <literal>(?>\PM\pM*)</literal>.  That is, it matches a

1059

   is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a

1055

1060

   character without the "mark" property, followed by zero or more characters

1056

1061

   with the "mark" property, and treats the sequence as an atomic group (see

1057

1062

   below). Characters with the "mark" property are typically accents that

...

@@ -1071,8 +1076,8 @@

1071

1076

  <para>

1072

1077

   Outside a character class, in the default matching mode, the

1073

1078

   circumflex character (<literal>^</literal>) is an assertion which

1074

   is true only if the current matching point is at the start  of

1075

   the  subject string. Inside a character class, circumflex (<literal>^</literal>)

1079

   is true only if the current matching point is at the start of

1080

   the subject string. Inside a character class, circumflex (<literal>^</literal>)

1076

1081

   has an entirely different meaning (see below).

1077

1082

  </para>

1078

1083

  <para>

...

@@ -1087,12 +1092,12 @@

1087

1092

  </para>

1088

1093

  <para>

1089

1094

   A dollar character (<literal>$</literal>) is an assertion which is

1090

   &true; only if the current  matching point is at the end of the subject

1091

   string, or immediately before a newline character that is  the  last

1095

   &true; only if the current matching point is at the end of the subject

1096

   string, or immediately before a newline character that is the last

1092

1097

   character in the string (by default). Dollar (<literal>$</literal>)

1093

   need not be the last character of the pattern if a  number  of

1094

   alternatives are  involved,  but it should be the last item in any branch

1095

   in which it appears. Dollar has no  special  meaning  in  a

1098

   need not be the last character of the pattern if a number of

1099

   alternatives are involved, but it should be the last item in any branch

1100

   in which it appears. Dollar has no special meaning in a

1096

1101

   character class.

1097

1102

  </para>

1098

1103

  <para>

...

@@ -1118,9 +1123,9 @@

1118

1123

   set.

1119

1124

  </para>

1120

1125

  <para>

1121

   Note that the sequences \A, \Z, and \z can be used to  match

1122

   the  start  and end of the subject in both modes, and if all

1123

   branches of a pattern start with \A is it  always  anchored,

1126

   Note that the sequences \A, \Z, and \z can be used to match

1127

   the start and end of the subject in both modes, and if all

1128

   branches of a pattern start with \A is it always anchored,

1124

1129

   whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1125

1130

   is set or not.

1126

1131

  </para>

...

@@ -1129,14 +1134,14 @@

1129

1134

 <section xml:id="regexp.reference.dot">

1130

1135

  <title>Dot</title>

1131

1136

  <para>

1132

   Outside a character class, a dot in the pattern matches  any

1133

   one  character  in  the  subject,  including  a non-printing

1134

   character, but not (by default) newline.  If the

1137

   Outside a character class, a dot in the pattern matches any

1138

   one character in the subject, including a non-printing

1139

   character, but not (by default) newline. If the

1135

1140

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1136

   option  is  set,  then dots match newlines as well. The

1141

   option is set, then dots match newlines as well. The

1137

1142

   handling of dot is entirely independent of the handling of

1138

   circumflex  and  dollar,  the only relationship being that they

1139

   both involve newline characters.  Dot has no special meaning

1143

   circumflex and dollar, the only relationship being that they

1144

   both involve newline characters. Dot has no special meaning

1140

1145

   in a character class.

1141

1146

  </para>

1142

1147

  <para>

...

@@ -1150,29 +1155,29 @@

1150

1155

  <title>Character classes</title>

1151

1156

  <para>

1152

1157

   An opening square bracket introduces a character class,

1153

   terminated  by  a  closing  square  bracket.  A  closing square

1154

   bracket on its own is  not  special.  If  a  closing  square

1155

   bracket  is  required as a member of the class, it should be

1158

   terminated by a closing square bracket. A closing square

1159

   bracket on its own is not special. If a closing square

1160

   bracket is required as a member of the class, it should be

1156

1161

   the first data character in the class (after an initial

1157

1162

   circumflex, if present) or escaped with a backslash.

1158

1163

  </para>

1159

1164

  <para>

1160

1165

   A character class matches a single character in the subject;

1161

   the  character  must  be in the set of characters defined by

1166

   the character must be in the set of characters defined by

1162

1167

   the class, unless the first character in the class is a

1163

   circumflex,  in which case the subject character must not be in

1164

   the set defined by the class. If a  circumflex  is  actually

1165

   required  as  a  member  of  the class, ensure it is not the

1168

   circumflex, in which case the subject character must not be in

1169

   the set defined by the class. If a circumflex is actually

1170

   required as a member of the class, ensure it is not the

1166

1171

   first character, or escape it with a backslash.

1167

1172

  </para>

1168

1173

  <para>

1169

   For example, the character class [aeiou] matches  any  lower

1174

   For example, the character class [aeiou] matches any lower

1170

1175

   case vowel, while [^aeiou] matches any character that is not

1171

   a lower case vowel. Note that a circumflex is  just  a

1172

   convenient  notation for specifying the characters which are in

1173

   the class by enumerating those that are not. It  is  not  an

1174

   assertion:  it  still  consumes a character from the subject

1175

   string, and fails if the current pointer is at  the  end  of

1176

   a lower case vowel. Note that a circumflex is just a

1177

   convenient notation for specifying the characters which are in

1178

   the class by enumerating those that are not. It is not an

1179

   assertion: it still consumes a character from the subject

1180

   string, and fails if the current pointer is at the end of

1176

1181

   the string.

1177

1182

  </para>

1178

1183

  <para>

...

@@ -1184,61 +1189,62 @@

1184

1189

  </para>

1185

1190

  <para>

1186

1191

   The newline character is never treated in any special way in

1187

   character  classes,  whatever the setting of the <link

1192

   character classes, whatever the setting of the <link

1188

1193

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1189

1194

   or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1190

1195

   options is. A class such as [^a] will always match a newline.

1191

1196

  </para>

1192

1197

  <para>

1193

   The minus (hyphen) character can be used to specify a  range

1194

   of  characters  in  a  character  class.  For example, [d-m]

1195

   matches any letter between d and m, inclusive.  If  a  minus

1196

   character  is required in a class, it must be escaped with a

1198

   The minus (hyphen) character can be used to specify a range

1199

   of characters in a character class. For example, [d-m]

1200

   matches any letter between d and m, inclusive. If a minus

1201

   character is required in a class, it must be escaped with a

1197

1202

   backslash or appear in a position where it cannot be

1198

1203

   interpreted as indicating a range, typically as the first or last

1199

1204

   character in the class.

1200

1205

  </para>

1201

1206

  <para>

1202

   It is not possible to have the literal character "]" as  the

1203

   end  character  of  a  range.  A  pattern such as [W-]46] is

1207

   It is not possible to have the literal character "]" as the

1208

   end character of a range. A pattern such as [W-]46] is

1204

1209

   interpreted as a class of two characters ("W" and "-")

1205

1210

   followed by a literal string "46]", so it would match "W46]" or

1206

   "-46]". However, if the "]" is escaped with a  backslash  it

1207

   is  interpreted  as  the end of range, so [W-\]46] is

1208

   interpreted as a single class containing a range followed by  two

1211

   "-46]". However, if the "]" is escaped with a backslash it

1212

   is interpreted as the end of range, so [W-\]46] is

1213

   interpreted as a single class containing a range followed by two

1209

1214

   separate characters. The octal or hexadecimal representation

1210

1215

   of "]" can also be used to end a range.

1211

1216

  </para>

1212

1217

  <para>

1213

1218

   Ranges operate in ASCII collating sequence. They can also be

1214

   used  for  characters  specified  numerically,  for  example

1215

   [\000-\037]. If a range that includes letters is  used  when

1216

   case-insensitive (caseless)  matching  is set, it matches the

1217

   letters in either case. For example, [W-c] is equivalent  to

1219

   used for characters specified numerically, for example

1220

   [\000-\037]. If a range that includes letters is used when

1221

   case-insensitive (caseless) matching is set, it matches the

1222

   letters in either case. For example, [W-c] is equivalent to

1218

1223

   [][\^_`wxyzabc], matched case-insensitively, and if character

1219

1224

   tables for the "fr" locale are in use, [\xc8-\xcb] matches

1220

1225

   accented E characters in both cases.

1221

1226

  </para>

1222

1227

  <para>

1223

   The character types \d, \D, \s, \S,  \w,  and  \W  may  also

1224

   appear  in  a  character  class, and add the characters that

1228

   The character types \d, \D, \s, \S, \w, and \W may also

1229

   appear in a character class, and add the characters that

1225

1230

   they match to the class. For example, [\dABCDEF] matches any

1226

   hexadecimal  digit.  A  circumflex  can conveniently be used

1227

   with the upper case character types to specify a  more

1231

   hexadecimal digit. A circumflex can conveniently be used

1232

   with the upper case character types to specify a more

1228

1233

   restricted set of characters than the matching lower case type.

1229

   For example, the class [^\W_] matches any letter  or  digit,

1234

   For example, the class [^\W_] matches any letter or digit,

1230

1235

   but not underscore.

1231

1236

  </para>

1232

1237

  <para>

1233

   All non-alphanumeric characters other than \,  -,  ^  (at  the

1234

   start)  and  the  terminating ] are non-special in character

1238

   All non-alphanumeric characters other than \, -, ^ (at the

1239

   start) and the terminating ] are non-special in character

1235

1240

   classes, but it does no harm if they are escaped. The pattern

1236

1241

   terminator is always special and must be escaped when used

1237

1242

   within an expression.

1238

1243

  </para>

1239

1244

  <para>

1240

1245

   Perl supports the POSIX notation for character classes. This uses names

1241

   enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also

1246

   enclosed by <literal>[:</literal> and <literal>:]</literal> within

1247

   the enclosing square brackets. PCRE also

1242

1248

   supports this notation. For example, <literal>[01[:alpha:]%]</literal>

1243

1249

   matches "0", "1", any alphabetic character, or "%". The supported class

1244

1250

   names are:

...

@@ -1293,16 +1299,16 @@

1293

1299

 <section xml:id="regexp.reference.alternation">

1294

1300

  <title>Alternation</title>

1295

1301

  <para>

1296

   Vertical bar characters are  used  to  separate  alternative

1302

   Vertical bar characters are used to separate alternative

1297

1303

   patterns. For example, the pattern

1298

1304

   <literal>gilbert|sullivan</literal>

1299

1305

   matches either "gilbert" or "sullivan". Any number of alternatives

1300

   may  appear,  and an empty alternative is permitted

1301

   (matching the empty string).   The  matching  process  tries

1302

   each  alternative in turn, from left to right, and the first

1303

   one that succeeds is used. If the alternatives are within  a

1304

   subpattern  (defined  below),  "succeeds" means matching the

1305

   rest of the main pattern as well as the alternative  in  the

1306

   may appear, and an empty alternative is permitted

1307

   (matching the empty string). The matching process tries

1308

   each alternative in turn, from left to right, and the first

1309

   one that succeeds is used. If the alternatives are within a

1310

   subpattern (defined below), "succeeds" means matching the

1311

   rest of the main pattern as well as the alternative in the

1306

1312

   subpattern.

1307

1313

  </para>

1308

1314

 </section>

...

@@ -1317,7 +1323,7 @@

1317

1323

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,

1318

1324

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1319

1325

   and PCRE_DUPNAMES can be changed from within the pattern by

1320

   a sequence of Perl option letters enclosed between "(?"  and

1326

   a sequence of Perl option letters enclosed between "(?" and

1321

1327

   ")". The option letters are:

1322

1328

1323

1329

   <table>

...

@@ -1346,7 +1352,8 @@

1346

1352

      </row>

1347

1353

      <row>

1348

1354

       <entry><literal>X</literal></entry>

1349

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>

1355

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>

1356

        (no longer supported as of PHP 7.3.0)</entry>

1350

1357

      </row>

1351

1358

      <row>

1352

1359

       <entry><literal>J</literal></entry>

...

@@ -1357,16 +1364,16 @@

1357

1364

   </table>

1358

1365

  </para>

1359

1366

  <para>

1360

   For example, (?im) sets case-insensitive (caseless), multiline matching. It  is

1367

   For example, (?im) sets case-insensitive (caseless), multiline matching. It is

1361

1368

   also possible to unset these options by preceding the letter

1362

   with a hyphen, and a combined setting and unsetting such  as

1363

   (?im-sx),  which sets <link

1369

   with a hyphen, and a combined setting and unsetting such as

1370

   (?im-sx), which sets <link

1364

1371

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and

1365

1372

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1366

1373

   while unsetting <link

1367

1374

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and

1368

1375

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,

1369

   is also  permitted. If  a  letter  appears both before and after the

1376

   is also permitted. If a letter appears both before and after the

1370

1377

   hyphen, the option is unset.

1371

1378

  </para>

1372

1379

  <para>

...

@@ -1376,14 +1383,14 @@

1376

1383

   and "abC".

1377

1384

  </para>

1378

1385

  <para>

1379

   If an option change occurs inside a subpattern,  the  effect

1380

   is  different.  This is a change of behaviour in Perl 5.005.

1381

   An option change inside a subpattern affects only that  part

1386

   If an option change occurs inside a subpattern, the effect

1387

   is different. This is a change of behaviour in Perl 5.005.

1388

   An option change inside a subpattern affects only that part

1382

1389

   of the subpattern that follows it, so

1383

1390

1384

1391

   <literal>(a(?i)b)c</literal>

1385

1392

1386

   matches  abc  and  aBc  and  no  other   strings   (assuming <link

1393

   matches "abc" and "aBc" and no other strings (assuming <link

1387

1394

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not

1388

1395

   used). By this means, options can be made to have different settings in

1389

1396

   different parts of the pattern. Any changes made in one alternative do

...

@@ -1392,18 +1399,18 @@

1392

1399

1393

1400

   <literal>(a(?i)b|c)</literal>

1394

1401

1395

   matches "ab", "aB", "c", and "C", even though when  matching

1402

   matches "ab", "aB", "c", and "C", even though when matching

1396

1403

   "C" the first branch is abandoned before the option setting.

1397

   This is because the effects of  option  settings  happen  at

1398

   compile  time. There would be some very weird behaviour otherwise.

1404

   This is because the effects of option settings happen at

1405

   compile time. There would be some very weird behaviour otherwise.

1399

1406

  </para>

1400

1407

  <para>

1401

1408

   The PCRE-specific options <link

1402

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and

1403

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>   can

1409

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and

1410

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can

1404

1411

   be changed in the same way as the Perl-compatible options by

1405

   using the characters U and X  respectively.  The  (?X)  flag

1406

   setting  is  special in that it must always occur earlier in

1412

   using the characters U and X respectively. The (?X) flag

1413

   setting is special in that it must always occur earlier in

1407

1414

   the pattern than any of the additional features it turns on,

1408

1415

   even when it is at top level. It is best put at the start.

1409

1416

  </para>

...

@@ -1412,8 +1419,8 @@

1412

1419

 <section xml:id="regexp.reference.subpatterns">

1413

1420

  <title>Subpatterns</title>

1414

1421

  <para>

1415

   Subpatterns are delimited by parentheses  (round  brackets),

1416

   which can be nested.  Marking part of a pattern as a subpattern

1422

   Subpatterns are delimited by parentheses (round brackets),

1423

   which can be nested. Marking part of a pattern as a subpattern

1417

1424

   does two things:

1418

1425

  </para>

1419

1426

  <orderedlist>

...

@@ -1442,30 +1449,30 @@

1442

1449

1443

1450

   <literal>the ((red|white) (king|queen))</literal>

1444

1451

1445

   the captured substrings are "red king", "red",  and  "king",

1452

   the captured substrings are "red king", "red", and "king",

1446

1453

   and are numbered 1, 2, and 3.

1447

1454

  </para>

1448

1455

  <para>

1449

   The fact that plain parentheses fulfill two functions is  not

1450

   always  helpful.  There are often times when a grouping subpattern

1451

   is required without a capturing requirement.  If  an

1456

   The fact that plain parentheses fulfill two functions is not

1457

   always helpful. There are often times when a grouping subpattern

1458

   is required without a capturing requirement. If an

1452

1459

   opening parenthesis is followed by "?:", the subpattern does

1453

   not do any capturing, and is not counted when computing  the

1460

   not do any capturing, and is not counted when computing the

1454

1461

   number of any subsequent capturing subpatterns. For example,

1455

   if the string "the  white  queen"  is  matched  against  the

1462

   if the string "the white queen" is matched against the

1456

1463

   pattern

1457

1464

1458

1465

   <literal>the ((?:red|white) (king|queen))</literal>

1459

1466

1460

   the captured substrings are "white queen" and  "queen",  and

1461

   are  numbered  1  and 2. The maximum number of captured substrings

1467

   the captured substrings are "white queen" and "queen", and

1468

   are numbered 1 and 2. The maximum number of captured substrings

1462

1469

   is 65535. It may not be possible to compile such large patterns,

1463

1470

   however, depending on the configuration options of libpcre.

1464

1471

  </para>

1465

1472

  <para>

1466

   As a  convenient  shorthand,  if  any  option  settings  are

1467

   required  at  the  start  of a non-capturing subpattern, the

1468

   option letters may appear between the "?" and the ":".  Thus

1473

   As a convenient shorthand, if any option settings are

1474

   required at the start of a non-capturing subpattern, the

1475

   option letters may appear between the "?" and the ":". Thus

1469

1476

   the two patterns

1470

1477

  </para>

1471

1478

...

@@ -1479,10 +1486,10 @@

1479

1486

  </informalexample>

1480

1487

1481

1488

  <para>

1482

   match exactly the same set of strings.  Because  alternative

1483

   branches  are  tried from left to right, and options are not

1484

   reset until the end of the subpattern is reached, an  option

1485

   setting  in  one  branch does affect subsequent branches, so

1489

   match exactly the same set of strings. Because alternative

1490

   branches are tried from left to right, and options are not

1491

   reset until the end of the subpattern is reached, an option

1492

   setting in one branch does affect subsequent branches, so

1486

1493

   the above patterns match "SUNDAY" as well as "Saturday".

1487

1494

  </para>

1488

1495

...

@@ -1511,9 +1518,10 @@

1511

1518

1512

1519

  <para>

1513

1520

   Here <literal>Sun</literal> is stored in backreference 2, while

1514

   backreference 1 is empty. Matching yields <literal>Sat</literal> in

1515

   backreference 1 while backreference 2 does not exist. Changing the pattern

1516

   to use the <literal>(?|</literal> fixes this problem:

1521

   backreference 1 is empty. Matching <literal>Saturday</literal> yields

1522

   <literal>Sat</literal> in backreference 1 while backreference 2 does

1523

   not exist. Changing the pattern to use the <literal>(?|</literal> fixes

1524

   this problem:

1517

1525

  </para>

1518

1526

1519

1527

  <informalexample>

...

@@ -1539,45 +1547,56 @@

1539

1547

    <listitem><simpara>the . metacharacter</simpara></listitem>

1540

1548

    <listitem><simpara>a character class</simpara></listitem>

1541

1549

    <listitem><simpara>a back reference (see next section)</simpara></listitem>

1542

    <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -

1550

    <listitem><simpara>a parenthesized subpattern (unless it is an assertion -

1543

1551

     see below)</simpara></listitem>

1544

1552

   </itemizedlist>

1545

1553

  </para>

1546

1554

  <para>

1547

   The general repetition quantifier specifies  a  minimum  and

1548

   maximum  number  of  permitted  matches,  by  giving the two

1549

   numbers in curly brackets (braces), separated  by  a  comma.

1550

   The  numbers  must be less than 65536, and the first must be

1555

   The general repetition quantifier specifies a minimum and

1556

   maximum number of permitted matches, by giving the two

1557

   numbers in curly brackets (braces), separated by a comma.

1558

   The numbers must be less than 65536, and the first must be

1551

1559

   less than or equal to the second. For example:

1552

1560

1553

1561

   <literal>z{2,4}</literal>

1554

1562

1555

   matches "zz", "zzz", or "zzzz". A closing brace on  its  own

1563

   matches "zz", "zzz", or "zzzz". A closing brace on its own

1556

1564

   is not a special character. If the second number is omitted,

1557

   but the comma is present, there is no upper  limit;  if  the

1565

   but the comma is present, there is no upper limit; if the

1558

1566

   second number and the comma are both omitted, the quantifier

1559

1567

   specifies an exact number of required matches. Thus

1560

1568

1561

1569

   <literal>[aeiou]{3,}</literal>

1562

1570

1563

   matches at least 3 successive vowels,  but  may  match  many

1571

   matches at least 3 successive vowels, but may match many

1564

1572

   more, while

1565

1573

1566

1574

   <literal>\d{8}</literal>

1567

1575

1568

   matches exactly 8 digits.  An  opening  curly  bracket  that

1569

   appears  in a position where a quantifier is not allowed, or

1570

   one that does not match the syntax of a quantifier, is taken

1571

   as  a literal character. For example, {,6} is not a quantifier,

1572

   but a literal string of four characters.

1576

   matches exactly 8 digits.

1577

1573

1578

  </para>

1579

  <simpara>

1580

   Prior to PHP 8.4.0, an opening curly bracket that

1581

   appears in a position where a quantifier is not allowed, or

1582

   one that does not match the syntax of a quantifier, is taken

1583

   as a literal character. For example, <literal>{,6}</literal>

1584

   is not a quantifier, but a literal string of four characters.

1585

1586

   As of PHP 8.4.0, the PCRE extension is bundled with PCRE2 version 10.44,

1587

   which allows patterns such as <literal>\d{,8}</literal> and they are

1588

   interpreted as <literal>\d{0,8}</literal>.

1589

1590

   Further, as of PHP 8.4.0, space characters around quantifiers such as

1591

   <literal>\d{0 , 8}</literal> and <literal>\d{ 0 , 8 }</literal> are allowed.

1592

  </simpara>

1574

1593

  <para>

1575

   The quantifier {0} is permitted, causing the  expression  to

1576

   behave  as  if the previous item and the quantifier were not

1594

   The quantifier {0} is permitted, causing the expression to

1595

   behave as if the previous item and the quantifier were not

1577

1596

   present.

1578

1597

  </para>

1579

1598

  <para>

1580

   For convenience (and  historical  compatibility)  the  three

1599

   For convenience (and historical compatibility) the three

1581

1600

   most common quantifiers have single-character abbreviations:

1582

1601

1583

1602

   <table>

...

@@ -1601,63 +1620,63 @@

1601

1620

   </table>

1602

1621

  </para>

1603

1622

  <para>

1604

   It is possible to construct infinite loops  by  following  a

1605

   subpattern  that  can  match no characters with a quantifier

1623

   It is possible to construct infinite loops by following a

1624

   subpattern that can match no characters with a quantifier

1606

1625

   that has no upper limit, for example:

1607

1626

1608

1627

   <literal>(a?)*</literal>

1609

1628

  </para>

1610

1629

  <para>

1611

   Earlier versions of Perl and PCRE used to give an  error  at

1612

   compile  time  for such patterns. However, because there are

1613

   cases where this  can  be  useful,  such  patterns  are  now

1614

   accepted,  but  if  any repetition of the subpattern does in

1630

   Earlier versions of Perl and PCRE used to give an error at

1631

   compile time for such patterns. However, because there are

1632

   cases where this can be useful, such patterns are now

1633

   accepted, but if any repetition of the subpattern does in

1615

1634

   fact match no characters, the loop is forcibly broken.

1616

1635

  </para>

1617

1636

  <para>

1618

   By default, the quantifiers  are  "greedy",  that  is,  they

1619

   match  as much as possible (up to the maximum number of permitted

1620

   times), without causing the rest of  the  pattern  to

1637

   By default, the quantifiers are "greedy", that is, they

1638

   match as much as possible (up to the maximum number of permitted

1639

   times), without causing the rest of the pattern to

1621

1640

   fail. The classic example of where this gives problems is in

1622

1641

   trying to match comments in C programs. These appear between

1623

   the  sequences /* and */ and within the sequence, individual

1624

   * and / characters may appear. An attempt to  match  C  comments

1642

   the sequences /* and */ and within the sequence, individual

1643

   * and / characters may appear. An attempt to match C comments

1625

1644

   by applying the pattern

1626

1645

1627

1646

   <literal>/\*.*\*/</literal>

1628

1647

1629

1648

   to the string

1630

1649

1631

   <literal>/* first comment */  not comment  /* second comment */</literal>

1650

   <literal>/* first comment */ not comment /* second comment */</literal>

1632

1651

1633

   fails, because it matches  the  entire  string  due  to  the

1634

   greediness of the .*  item.

1652

   fails, because it matches the entire string due to the

1653

   greediness of the .* item.

1635

1654

  </para>

1636

1655

  <para>

1637

   However, if a quantifier is followed  by  a  question  mark,

1656

   However, if a quantifier is followed by a question mark,

1638

1657

   then it becomes lazy, and instead matches the minimum

1639

1658

   number of times possible, so the pattern

1640

1659

1641

1660

   <literal>/\*.*?\*/</literal>

1642

1661

1643

1662

   does the right thing with the C comments. The meaning of the

1644

   various  quantifiers is not otherwise changed, just the preferred

1645

   number of matches.  Do not confuse this use of

1646

   question  mark  with  its  use as a quantifier in its own right.

1663

   various quantifiers is not otherwise changed, just the preferred

1664

   number of matches. Do not confuse this use of

1665

   question mark with its use as a quantifier in its own right.

1647

1666

   Because it has two uses, it can sometimes appear doubled, as

1648

1667

in

1649

1668

1650

1669

   <literal>\d??\d</literal>

1651

1670

1652

   which matches one digit by preference, but can match two  if

1671

   which matches one digit by preference, but can match two if

1653

1672

   that is the only way the rest of the pattern matches.

1654

1673

  </para>

1655

1674

  <para>

1656

1675

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>

1657

   option is set (an option which  is  not

1658

   available  in  Perl)  then the quantifiers are not greedy by

1676

   option is set (an option which is not

1677

   available in Perl) then the quantifiers are not greedy by

1659

1678

   default, but individual ones can be made greedy by following

1660

   them  with  a  question mark. In other words, it inverts the

1679

   them with a question mark. In other words, it inverts the

1661

1680

   default behaviour.

1662

1681

  </para>

1663

1682

  <para>

...

@@ -1669,41 +1688,41 @@

1669

1688

  </para>

1670

1689

  <para>

1671

1690

   When a parenthesized subpattern is quantified with a minimum

1672

   repeat  count  that is greater than 1 or with a limited maximum,

1673

   more store is required for the  compiled  pattern,  in

1691

   repeat count that is greater than 1 or with a limited maximum,

1692

   more store is required for the compiled pattern, in

1674

1693

   proportion to the size of the minimum or maximum.

1675

1694

  </para>

1676

1695

  <para>

1677

   If a pattern starts with .* or  .{0,}  and  the  <link

1696

   If a pattern starts with .* or .{0,} and the <link

1678

1697

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1679

1698

   option (equivalent to Perl's /s) is set, thus allowing the .

1680

   to match newlines, then the pattern is implicitly  anchored,

1699

   to match newlines, then the pattern is implicitly anchored,

1681

1700

   because whatever follows will be tried against every character

1682

   position in the subject string, so there is no point  in

1683

   retrying  the overall match at any position after the first.

1701

   position in the subject string, so there is no point in

1702

   retrying the overall match at any position after the first.

1684

1703

   PCRE treats such a pattern as though it were preceded by \A.

1685

   In  cases where it is known that the subject string contains

1704

   In cases where it is known that the subject string contains

1686

1705

   no newlines, it is worth setting <link

1687

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the

1706

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the

1688

1707

   pattern begins with .* in order to

1689

1708

   obtain this optimization, or

1690

1709

   alternatively using ^ to indicate anchoring explicitly.

1691

1710

  </para>

1692

1711

  <para>

1693

   When a capturing subpattern is repeated, the value  captured

1712

   When a capturing subpattern is repeated, the value captured

1694

1713

   is the substring that matched the final iteration. For example, after

1695

1714

1696

1715

   <literal>(tweedle[dume]{3}\s*)+</literal>

1697

1716

1698

   has matched "tweedledum tweedledee" the value  of  the  captured

1699

   substring  is  "tweedledee".  However,  if  there are

1700

   nested capturing  subpatterns,  the  corresponding  captured

1701

   values  may  have been set in previous iterations. For example,

1717

   has matched "tweedledum tweedledee" the value of the captured

1718

   substring is "tweedledee". However, if there are

1719

   nested capturing subpatterns, the corresponding captured

1720

   values may have been set in previous iterations. For example,

1702

1721

   after

1703

1722

1704

1723

   <literal>/(a|(b))+/</literal>

1705

1724

1706

   matches "aba" the value of the second captured substring  is

1725

   matches "aba" the value of the second captured substring is

1707

1726

   "b".

1708

1727

  </para>

1709

1728

 </section>

...

@@ -1711,74 +1730,74 @@

1711

1730

 <section xml:id="regexp.reference.back-references">

1712

1731

  <title>Back references</title>

1713

1732

  <para>

1714

   Outside a character class, a backslash followed by  a  digit

1715

   greater  than  0  (and  possibly  further  digits) is a back

1716

   reference to a capturing subpattern  earlier  (i.e.  to  its

1717

   left)  in  the  pattern,  provided there have been that many

1733

   Outside a character class, a backslash followed by a digit

1734

   greater than 0 (and possibly further digits) is a back

1735

   reference to a capturing subpattern earlier (i.e. to its

1736

   left) in the pattern, provided there have been that many

1718

1737

   previous capturing left parentheses.

1719

1738

  </para>

1720

1739

  <para>

1721

   However, if the decimal number following  the  backslash  is

1722

   less  than  10,  it is always taken as a back reference, and

1723

   causes an error only if there are not  that  many  capturing

1724

   left  parentheses in the entire pattern. In other words, the

1725

   parentheses that are referenced need not be to the  left  of

1726

   the  reference  for  numbers  less  than 10.

1740

   However, if the decimal number following the backslash is

1741

   less than 10, it is always taken as a back reference, and

1742

   causes an error only if there are not that many capturing

1743

   left parentheses in the entire pattern. In other words, the

1744

   parentheses that are referenced need not be to the left of

1745

   the reference for numbers less than 10.

1727

1746

   A "forward back reference" can make sense when a repetition

1728

1747

   is involved and the subpattern to the right has participated

1729

1748

   in an earlier iteration. See the section

1730

   entitled "Backslash" above for further details of  the  handling

1749

   <link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling

1731

1750

   of digits following a backslash.

1732

1751

  </para>

1733

1752

  <para>

1734

   A back reference matches whatever actually matched the  capturing

1753

   A back reference matches whatever actually matched the capturing

1735

1754

   subpattern in the current subject string, rather than

1736

1755

   anything matching the subpattern itself. So the pattern

1737

1756

1738

1757

   <literal>(sens|respons)e and \1ibility</literal>

1739

1758

1740

   matches "sense and sensibility" and "response and  responsibility",

1741

   but  not  "sense  and  responsibility". If case-sensitive (caseful)

1759

   matches "sense and sensibility" and "response and responsibility",

1760

   but not "sense and responsibility". If case-sensitive (caseful)

1742

1761

   matching is in force at the time of the back reference, then

1743

1762

   the case of letters is relevant. For example,

1744

1763

1745

1764

   <literal>((?i)rah)\s+\1</literal>

1746

1765

1747

   matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even

1748

   though  the  original  capturing subpattern is matched

1766

   matches "rah rah" and "RAH RAH", but not "RAH rah", even

1767

   though the original capturing subpattern is matched

1749

1768

   case-insensitively (caselessly).

1750

1769

  </para>

1751

1770

  <para>

1752

   There may be more than one back reference to the  same  subpattern.

1753

   If  a  subpattern  has not actually been used in a

1754

   particular match, then any  back  references  to  it  always

1771

   There may be more than one back reference to the same subpattern.

1772

   If a subpattern has not actually been used in a

1773

   particular match, then any back references to it always

1755

1774

   fail. For example, the pattern

1756

1775

1757

1776

   <literal>(a|(bc))\2</literal>

1758

1777

1759

   always fails if it starts to match  "a"  rather  than  "bc".

1760

   Because  there  may  be up to 99 back references, all digits

1761

   following the backslash are taken as  part  of  a  potential

1778

   always fails if it starts to match "a" rather than "bc".

1779

   Because there may be up to 99 back references, all digits

1780

   following the backslash are taken as part of a potential

1762

1781

   back reference number. If the pattern continues with a digit

1763

1782

   character, then some delimiter must be used to terminate the

1764

1783

   back reference. If the <link

1765

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>  option

1766

   is set, this can be whitespace.  Otherwise an empty comment can be used.

1784

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option

1785

   is set, this can be whitespace. Otherwise an empty comment can be used.

1767

1786

  </para>

1768

1787

  <para>

1769

1788

   A back reference that occurs inside the parentheses to which

1770

   it  refers  fails when the subpattern is first used, so, for

1771

   example, (a\1) never matches.  However, such references  can

1789

   it refers fails when the subpattern is first used, so, for

1790

   example, (a\1) never matches. However, such references can

1772

1791

   be useful inside repeated subpatterns. For example, the pattern

1773

1792

1774

1793

   <literal>(a|b\1)+</literal>

1775

1794

1776

   matches any number of "a"s and also "aba", "ababba" etc.  At

1795

   matches any number of "a"s and also "aba", "ababba" etc. At

1777

1796

   each iteration of the subpattern, the back reference matches

1778

   the character string corresponding to  the  previous  iteration.

1797

   the character string corresponding to the previous iteration.

1779

1798

   In order for this to work, the pattern must be such

1780

   that the first iteration does not need  to  match  the  back

1781

   reference.  This  can  be  done using alternation, as in the

1799

   that the first iteration does not need to match the back

1800

   reference. This can be done using alternation, as in the

1782

1801

   example above, or by a quantifier with a minimum of zero.

1783

1802

  </para>

1784

1803

  <para>

...

@@ -1813,18 +1832,18 @@

1813

1832

 <section xml:id="regexp.reference.assertions">

1814

1833

  <title>Assertions</title>

1815

1834

  <para>

1816

   An assertion is  a  test  on  the  characters  following  or

1817

   preceding  the current matching point that does not actually

1818

   consume any characters. The simple assertions coded  as  \b,

1819

   \B,  \A,  \Z,  \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1820

   assertions are coded as  subpatterns.  There  are  two

1821

   kinds:  those that <emphasis>look ahead</emphasis> of the current position in the

1835

   An assertion is a test on the characters following or

1836

   preceding the current matching point that does not actually

1837

   consume any characters. The simple assertions coded as \b,

1838

   \B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1839

   assertions are coded as subpatterns. There are two

1840

   kinds: those that <emphasis>look ahead</emphasis> of the current position in the

1822

1841

   subject string, and those that <emphasis>look behind</emphasis> it.

1823

1842

  </para>

1824

1843

  <para>

1825

1844

   An assertion subpattern is matched in the normal way, except

1826

   that  it  does not cause the current matching position to be

1827

   changed. <emphasis>Lookahead</emphasis> assertions start with  (?=  for  positive

1845

   that it does not cause the current matching position to be

1846

   changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive

1828

1847

   assertions and (?! for negative assertions. For example,

1829

1848

1830

1849

   <literal>\w+(?=;)</literal>

...

@@ -1834,27 +1853,27 @@

1834

1853

1835

1854

   <literal>foo(?!bar)</literal>

1836

1855

1837

   matches any occurrence of "foo"  that  is  not  followed  by

1856

   matches any occurrence of "foo" that is not followed by

1838

1857

   "bar". Note that the apparently similar pattern

1839

1858

1840

1859

   <literal>(?!foo)bar</literal>

1841

1860

1842

   does not find an occurrence of "bar"  that  is  preceded  by

1861

   does not find an occurrence of "bar" that is preceded by

1843

1862

   something other than "foo"; it finds any occurrence of "bar"

1844

   whatsoever, because the assertion  (?!foo)  is  always  &true;

1845

   when  the  next  three  characters  are  "bar". A lookbehind

1863

   whatsoever, because the assertion (?!foo) is always &true;

1864

   when the next three characters are "bar". A lookbehind

1846

1865

   assertion is needed to achieve this effect.

1847

1866

  </para>

1848

1867

  <para>

1849

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;=  for  positive  assertions

1868

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions

1850

1869

   and (?&lt;! for negative assertions. For example,

1851

1870

1852

1871

   <literal>(?&lt;!foo)bar</literal>

1853

1872

1854

   does find an occurrence of "bar" that  is  not  preceded  by

1873

   does find an occurrence of "bar" that is not preceded by

1855

1874

   "foo". The contents of a lookbehind assertion are restricted

1856

   such that all the strings  it  matches  must  have  a  fixed

1857

   length.  However, if there are several alternatives, they do

1875

   such that all the strings it matches must have a fixed

1876

   length. However, if there are several alternatives, they do

1858

1877

   not all have to have the same fixed length. Thus

1859

1878

1860

1879

   <literal>(?&lt;=bullock|donkey)</literal>

...

@@ -1863,51 +1882,51 @@

1863

1882

1864

1883

   <literal>(?&lt;!dogs?|cats?)</literal>

1865

1884

1866

   causes an error at compile time. Branches  that  match  different

1885

   causes an error at compile time. Branches that match different

1867

1886

   length strings are permitted only at the top level of

1868

   a lookbehind assertion. This is an extension  compared  with

1869

   Perl  5.005,  which  requires all branches to match the same

1887

   a lookbehind assertion. This is an extension compared with

1888

   Perl 5.005, which requires all branches to match the same

1870

1889

   length of string. An assertion such as

1871

1890

1872

1891

   <literal>(?&lt;=ab(c|de))</literal>

1873

1892

1874

   is not permitted, because its single  top-level  branch  can

1893

   is not permitted, because its single top-level branch can

1875

1894

   match two different lengths, but it is acceptable if rewritten

1876

1895

   to use two top-level branches:

1877

1896

1878

1897

   <literal>(?&lt;=abc|abde)</literal>

1879

1898

1880

   The implementation of lookbehind  assertions  is,  for  each

1881

   alternative,  to  temporarily move the current position back

1882

   by the fixed width and then  try  to  match.  If  there  are

1883

   insufficient  characters  before  the  current position, the

1884

   match is deemed to fail.  Lookbehinds  in  conjunction  with

1885

   once-only  subpatterns can be particularly useful for matching

1886

   at the ends of strings; an example is given at  the  end

1899

   The implementation of lookbehind assertions is, for each

1900

   alternative, to temporarily move the current position back

1901

   by the fixed width and then try to match. If there are

1902

   insufficient characters before the current position, the

1903

   match is deemed to fail. Lookbehinds in conjunction with

1904

   once-only subpatterns can be particularly useful for matching

1905

   at the ends of strings; an example is given at the end

1887

1906

   of the section on once-only subpatterns.

1888

1907

  </para>

1889

1908

  <para>

1890

   Several assertions (of any sort) may  occur  in  succession.

1909

   Several assertions (of any sort) may occur in succession.

1891

1910

   For example,

1892

1911

1893

1912

   <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>

1894

1913

1895

   matches "foo" preceded by three digits that are  not  "999".

1896

   Notice  that each of the assertions is applied independently

1897

   at the same point in the subject string. First  there  is  a

1898

   check  that  the  previous  three characters are all digits,

1914

   matches "foo" preceded by three digits that are not "999".

1915

   Notice that each of the assertions is applied independently

1916

   at the same point in the subject string. First there is a

1917

   check that the previous three characters are all digits,

1899

1918

   then there is a check that the same three characters are not

1900

   "999".   This  pattern  does not match "foo" preceded by six

1919

   "999". This pattern does not match "foo" preceded by six

1901

1920

   characters, the first of which are digits and the last three

1902

   of  which  are  not  "999".  For  example,  it doesn't match

1921

   of which are not "999". For example, it doesn't match

1903

1922

   "123abcfoo". A pattern to do that is

1904

1923

1905

1924

   <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>

1906

1925

  </para>

1907

1926

  <para>

1908

   This time the first assertion looks  at  the  preceding  six

1909

   characters,  checking  that  the first three are digits, and

1910

   then the second assertion checks that  the  preceding  three

1927

   This time the first assertion looks at the preceding six

1928

   characters, checking that the first three are digits, and

1929

   then the second assertion checks that the preceding three

1911

1930

   characters are not "999".

1912

1931

  </para>

1913

1932

  <para>

...

@@ -1915,26 +1934,26 @@

1915

1934

1916

1935

   <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>

1917

1936

1918

   matches an occurrence of "baz" that  is  preceded  by  "bar"

1937

   matches an occurrence of "baz" that is preceded by "bar"

1919

1938

   which in turn is not preceded by "foo", while

1920

1939

1921

1940

   <literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>

1922

1941

1923

   is another pattern which matches  "foo"  preceded  by  three

1942

   is another pattern which matches "foo" preceded by three

1924

1943

   digits and any three characters that are not "999".

1925

1944

  </para>

1926

1945

  <para>

1927

1946

   Assertion subpatterns are not capturing subpatterns, and may

1928

   not  be  repeated,  because  it makes no sense to assert the

1929

   same thing several times. If any kind of assertion  contains

1930

   capturing  subpatterns  within it, these are counted for the

1947

   not be repeated, because it makes no sense to assert the

1948

   same thing several times. If any kind of assertion contains

1949

   capturing subpatterns within it, these are counted for the

1931

1950

   purposes of numbering the capturing subpatterns in the whole

1932

   pattern.   However,  substring capturing is carried out only

1933

   for positive assertions, because it does not make sense  for

1951

   pattern. However, substring capturing is carried out only

1952

   for positive assertions, because it does not make sense for

1934

1953

   negative assertions.

1935

1954

  </para>

1936

1955

  <para>

1937

   Assertions count towards the maximum  of  200  parenthesized

1956

   Assertions count towards the maximum of 200 parenthesized

1938

1957

   subpatterns.

1939

1958

  </para>

1940

1959

 </section>

...

@@ -1942,17 +1961,17 @@

1942

1961

 <section xml:id="regexp.reference.onlyonce">

1943

1962

  <title>Once-only subpatterns</title>

1944

1963

  <para>

1945

   With both maximizing and minimizing repetition,  failure  of

1946

   what  follows  normally  causes  the repeated item to be

1964

   With both maximizing and minimizing repetition, failure of

1965

   what follows normally causes the repeated item to be

1947

1966

   re-evaluated to see if a different number of repeats allows the

1948

   rest  of  the  pattern  to  match. Sometimes it is useful to

1949

   prevent this, either to change the nature of the  match,  or

1950

   to  cause  it fail earlier than it otherwise might, when the

1951

   author of the pattern knows there is no  point  in  carrying

1967

   rest of the pattern to match. Sometimes it is useful to

1968

   prevent this, either to change the nature of the match, or

1969

   to cause it fail earlier than it otherwise might, when the

1970

   author of the pattern knows there is no point in carrying

1952

1971

on.

1953

1972

  </para>

1954

1973

  <para>

1955

   Consider, for example, the pattern \d+foo  when  applied  to

1974

   Consider, for example, the pattern \d+foo when applied to

1956

1975

   the subject line

1957

1976

1958

1977

   <literal>123456bar</literal>

...

@@ -1960,108 +1979,108 @@

1960

1979

  <para>

1961

1980

   After matching all 6 digits and then failing to match "foo",

1962

1981

   the normal action of the matcher is to try again with only 5

1963

   digits matching the \d+ item, and then with 4,  and  so  on,

1982

   digits matching the \d+ item, and then with 4, and so on,

1964

1983

   before ultimately failing. Once-only subpatterns provide the

1965

   means for specifying that once a portion of the pattern  has

1966

   matched,  it  is  not to be re-evaluated in this way, so the

1967

   matcher would give up immediately on failing to match  "foo"

1968

   the  first  time.  The  notation  is another kind of special

1984

   means for specifying that once a portion of the pattern has

1985

   matched, it is not to be re-evaluated in this way, so the

1986

   matcher would give up immediately on failing to match "foo"

1987

   the first time. The notation is another kind of special

1969

1988

   parenthesis, starting with (?&gt; as in this example:

1970

1989

1971

1990

   <literal>(?&gt;\d+)bar</literal>

1972

1991

  </para>

1973

1992

  <para>

1974

   This kind of parenthesis "locks up" the  part of the pattern

1975

   it  contains once it has matched, and a failure further into

1976

   the pattern is prevented from backtracking  into  it.

1977

   Backtracking  past  it to previous items, however, works as normal.

1993

   This kind of parenthesis "locks up" the part of the pattern

1994

   it contains once it has matched, and a failure further into

1995

   the pattern is prevented from backtracking into it.

1996

   Backtracking past it to previous items, however, works as normal.

1978

1997

  </para>

1979

1998

  <para>

1980

1999

   An alternative description is that a subpattern of this type

1981

   matches  the  string  of  characters that an identical standalone

2000

   matches the string of characters that an identical standalone

1982

2001

   pattern would match, if anchored at the current point

1983

2002

   in the subject string.

1984

2003

  </para>

1985

2004

  <para>

1986

   Once-only subpatterns are not capturing subpatterns.  Simple

1987

   cases  such as the above example can be thought of as a maximizing

1988

   repeat that must  swallow  everything  it  can.  So,

2005

   Once-only subpatterns are not capturing subpatterns. Simple

2006

   cases such as the above example can be thought of as a maximizing

2007

   repeat that must swallow everything it can. So,

1989

2008

   while both \d+ and \d+? are prepared to adjust the number of

1990

   digits they match in order to make the rest of  the  pattern

2009

   digits they match in order to make the rest of the pattern

1991

2010

   match, (?&gt;\d+) can only match an entire sequence of digits.

1992

2011

  </para>

1993

2012

  <para>

1994

   This construction can of course contain arbitrarily  complicated

2013

   This construction can of course contain arbitrarily complicated

1995

2014

   subpatterns, and it can be nested.

1996

2015

  </para>

1997

2016

  <para>

1998

2017

   Once-only subpatterns can be used in conjunction with

1999

   lookbehind assertions  to specify efficient matching at the end

2018

   lookbehind assertions to specify efficient matching at the end

2000

2019

   of the subject string. Consider a simple pattern such as

2001

2020

2002

2021

   <literal>abcd$</literal>

2003

2022

2004

   when applied to a long string which does not match.  Because

2005

   matching  proceeds  from  left  to right, PCRE will look for

2023

   when applied to a long string which does not match. Because

2024

   matching proceeds from left to right, PCRE will look for

2006

2025

   each "a" in the subject and then see if what follows matches

2007

2026

   the rest of the pattern. If the pattern is specified as

2008

2027

2009

2028

   <literal>^.*abcd$</literal>

2010

2029

2011

   then the initial .* matches the entire string at first,  but

2012

   when  this  fails  (because  there  is no following "a"), it

2030

   then the initial .* matches the entire string at first, but

2031

   when this fails (because there is no following "a"), it

2013

2032

   backtracks to match all but the last character, then all but

2014

   the  last  two  characters, and so on. Once again the search

2015

   for "a" covers the entire string, from right to left, so  we

2033

   the last two characters, and so on. Once again the search

2034

   for "a" covers the entire string, from right to left, so we

2016

2035

   are no better off. However, if the pattern is written as

2017

2036

2018

2037

   <literal>^(?>.*)(?&lt;=abcd)</literal>

2019

2038

2020

   then there can be no backtracking for the .*  item;  it  can

2021

   match  only  the  entire  string.  The subsequent lookbehind

2039

   then there can be no backtracking for the .* item; it can

2040

   match only the entire string. The subsequent lookbehind

2022

2041

   assertion does a single test on the last four characters. If

2023

   it  fails,  the  match  fails immediately. For long strings,

2042

   it fails, the match fails immediately. For long strings,

2024

2043

   this approach makes a significant difference to the processing time.

2025

2044

  </para>

2026

2045

  <para>

2027

2046

   When a pattern contains an unlimited repeat inside a subpattern

2028

2047

   that can itself be repeated an unlimited number of

2029

   times, the use of a once-only subpattern is the only way  to

2030

   avoid  some  failing matches taking a very long time indeed.

2048

   times, the use of a once-only subpattern is the only way to

2049

   avoid some failing matches taking a very long time indeed.

2031

2050

   The pattern

2032

2051

2033

2052

   <literal>(\D+|&lt;\d+>)*[!?]</literal>

2034

2053

2035

   matches an unlimited number of substrings that  either  consist

2036

   of  non-digits,  or digits enclosed in &lt;>, followed by

2054

   matches an unlimited number of substrings that either consist

2055

   of non-digits, or digits enclosed in &lt;>, followed by

2037

2056

   either ! or ?. When it matches, it runs quickly. However, if

2038

2057

   it is applied to

2039

2058

2040

2059

   <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>

2041

2060

2042

   it takes a long  time  before  reporting  failure.  This  is

2061

   it takes a long time before reporting failure. This is

2043

2062

   because the string can be divided between the two repeats in

2044

2063

   a large number of ways, and all have to be tried. (The example

2045

   used  [!?]  rather  than a single character at the end,

2046

   because both PCRE and Perl have an optimization that  allows

2047

   for  fast  failure  when  a  single  character is used. They

2048

   remember the last single character that is  required  for  a

2049

   match,  and  fail early if it is not present in the string.)

2064

   used [!?] rather than a single character at the end,

2065

   because both PCRE and Perl have an optimization that allows

2066

   for fast failure when a single character is used. They

2067

   remember the last single character that is required for a

2068

   match, and fail early if it is not present in the string.)

2050

2069

   If the pattern is changed to

2051

2070

2052

2071

   <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>

2053

2072

2054

   sequences of non-digits cannot be broken, and  failure  happens quickly.

2073

   sequences of non-digits cannot be broken, and failure happens quickly.

2055

2074

  </para>

2056

2075

 </section>

2057

2076

2058

2077

 <section xml:id="regexp.reference.conditional">

2059

2078

  <title>Conditional subpatterns</title>

2060

2079

  <para>

2061

   It is possible to cause the matching process to obey a  subpattern

2062

   conditionally  or to choose between two alternative

2063

   subpatterns, depending on the result  of  an  assertion,  or

2064

   whether  a previous capturing subpattern matched or not. The

2080

   It is possible to cause the matching process to obey a subpattern

2081

   conditionally or to choose between two alternative

2082

   subpatterns, depending on the result of an assertion, or

2083

   whether a previous capturing subpattern matched or not. The

2065

2084

   two possible forms of conditional subpattern are

2066

2085

  </para>

2067

2086

...

@@ -2075,39 +2094,39 @@

2075

2094

  </informalexample>

2076

2095

  <para>

2077

2096

   If the condition is satisfied, the yes-pattern is used; otherwise

2078

   the  no-pattern  (if  present) is used. If there are

2097

   the no-pattern (if present) is used. If there are

2079

2098

   more than two alternatives in the subpattern, a compile-time

2080

2099

   error occurs.

2081

2100

  </para>

2082

2101

  <para>

2083

   There are two kinds of condition. If the  text  between  the

2084

   parentheses  consists  of  a  sequence  of  digits, then the

2085

   condition is satisfied if the capturing subpattern  of  that

2086

   number  has  previously matched. Consider the following pattern,

2087

   which contains non-significant white space to make  it

2088

   more  readable  (assume  the  <link

2102

   There are two kinds of condition. If the text between the

2103

   parentheses consists of a sequence of digits, then the

2104

   condition is satisfied if the capturing subpattern of that

2105

   number has previously matched. Consider the following pattern,

2106

   which contains non-significant white space to make it

2107

   more readable (assume the <link

2089

2108

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2090

   option)  and to divide it into three parts for ease of discussion:

2109

   option) and to divide it into three parts for ease of discussion:

2091

2110

  </para>

2092

2111

  <informalexample>

2093

2112

   <programlisting>

2094

2113

<![CDATA[

2095

( \( )?    [^()]+    (?(1) \) )

2114

( \( )? [^()]+ (?(1) \) )

2096

2115

]]>

2097

2116

   </programlisting>

2098

2117

  </informalexample>

2099

2118

  <para>

2100

   The first part matches an optional opening parenthesis,  and

2101

   if  that character is present, sets it as the first captured

2102

   substring. The second part matches one  or  more  characters

2103

   that  are  not  parentheses. The third part is a conditional

2104

   subpattern that tests whether the first set  of  parentheses

2105

   matched  or  not.  If  they did, that is, if subject started

2106

   with an opening parenthesis, the condition is &true;,  and  so

2107

   the  yes-pattern  is  executed  and a closing parenthesis is

2108

   required. Otherwise, since no-pattern is  not  present,  the

2109

   subpattern  matches  nothing.  In  other words, this pattern

2110

   matches a sequence of non-parentheses,  optionally  enclosed

2119

   The first part matches an optional opening parenthesis, and

2120

   if that character is present, sets it as the first captured

2121

   substring. The second part matches one or more characters

2122

   that are not parentheses. The third part is a conditional

2123

   subpattern that tests whether the first set of parentheses

2124

   matched or not. If they did, that is, if subject started

2125

   with an opening parenthesis, the condition is &true;, and so

2126

   the yes-pattern is executed and a closing parenthesis is

2127

   required. Otherwise, since no-pattern is not present, the

2128

   subpattern matches nothing. In other words, this pattern

2129

   matches a sequence of non-parentheses, optionally enclosed

2111

2130

   in parentheses.

2112

2131

  </para>

2113

2132

  <para>

...

@@ -2116,10 +2135,10 @@

2116

2135

   level", the condition is false.

2117

2136

  </para>

2118

2137

  <para>

2119

   If the condition is not a sequence of digits or (R), it must be  an

2120

   assertion.  This  may be a positive or negative lookahead or

2121

   lookbehind assertion. Consider this pattern, again  containing

2122

   non-significant  white space, and with the two alternatives on

2138

   If the condition is not a sequence of digits or (R), it must be an

2139

   assertion. This may be a positive or negative lookahead or

2140

   lookbehind assertion. Consider this pattern, again containing

2141

   non-significant white space, and with the two alternatives on

2123

2142

   the second line:

2124

2143

  </para>

2125

2144

...

@@ -2127,18 +2146,18 @@

2127

2146

   <programlisting>

2128

2147

<![CDATA[

2129

2148

(?(?=[^a-z]*[a-z])

2130

\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

2149

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

2131

2150

]]>

2132

2151

   </programlisting>

2133

2152

  </informalexample>

2134

2153

  <para>

2135

2154

   The condition is a positive lookahead assertion that matches

2136

2155

   an optional sequence of non-letters followed by a letter. In

2137

   other words, it tests for  the  presence  of  at  least  one

2138

   letter  in the subject. If a letter is found, the subject is

2139

   matched against  the  first  alternative;  otherwise  it  is

2140

   matched  against the second. This pattern matches strings in

2141

   one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are

2156

   other words, it tests for the presence of at least one

2157

   letter in the subject. If a letter is found, the subject is

2158

   matched against the first alternative; otherwise it is

2159

   matched against the second. This pattern matches strings in

2160

   one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

2142

2161

   letters and dd are digits.

2143

2162

  </para>

2144

2163

 </section>

...

@@ -2146,31 +2165,66 @@

2146

2165

 <section xml:id="regexp.reference.comments">

2147

2166

  <title>Comments</title>

2148

2167

  <para>

2149

   The  sequence  (?#  marks  the  start  of  a  comment  which

2150

   continues   up  to  the  next  closing  parenthesis.  Nested

2168

   The sequence (?# marks the start of a comment which

2169

   continues up to the next closing parenthesis. Nested

2151

2170

   parentheses are not permitted. The characters that make up a

2152

2171

   comment play no part in the pattern matching at all.

2153

2172

  </para>

2154

2173

  <para>

2155

2174

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2156

   option is set, an unescaped # character outside  a character class

2175

   option is set, an unescaped # character outside a character class

2157

2176

   introduces a comment that continues up to the next newline character

2158

2177

   in the pattern.

2159

2178

  </para>

2179

  <para>

2180

   <example>

2181

    <title>Usage of comments in PCRE pattern</title>

2182

    <programlisting role="php">

2183

<![CDATA[

2184

<?php

2185

2186

$subject = 'test';

2187

2188

/* (?# can be used to add comments without enabling PCRE_EXTENDED */

2189

$match = preg_match('/te(?# this is a comment)st/', $subject);

2190

var_dump($match);

2191

2192

/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */

2193

$match = preg_match('/te   #~~~~

2194

st/', $subject);

2195

var_dump($match);

2196

2197

/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything

2198

   that follows an unescaped # on the same line is ignored */

2199

$match = preg_match('/te    #~~~~

2200

st/x', $subject);

2201

var_dump($match);

2202

]]>

2203

    </programlisting>

2204

    &example.outputs;

2205

    <screen>

2206

<![CDATA[

2207

int(1)

2208

int(0)

2209

int(1)

2210

]]>

2211

    </screen>

2212

   </example>

2213

  </para>

2160

2214

 </section>

2161

2215

2162

2216

 <section xml:id="regexp.reference.recursive">

2163

2217

  <title>Recursive patterns</title>

2164

2218

  <para>

2165

   Consider the problem of matching a  string  in  parentheses,

2166

   allowing  for  unlimited nested parentheses. Without the use

2167

   of recursion, the best that can be done is to use a  pattern

2168

   that  matches  up  to some fixed depth of nesting. It is not

2169

   possible to handle an arbitrary nesting depth. Perl 5.6  has

2170

   provided   an  experimental  facility  that  allows  regular

2171

   expressions to recurse (among other things).  The  special

2172

   item (?R) is  provided for  the specific  case of recursion.

2173

   This PCRE  pattern  solves the  parentheses  problem (assume

2219

   Consider the problem of matching a string in parentheses,

2220

   allowing for unlimited nested parentheses. Without the use

2221

   of recursion, the best that can be done is to use a pattern

2222

   that matches up to some fixed depth of nesting. It is not

2223

   possible to handle an arbitrary nesting depth. Perl 5.6 has

2224

   provided an experimental facility that allows regular

2225

   expressions to recurse (among other things). The special

2226

   item (?R) is provided for the specific case of recursion.

2227

   This PCRE pattern solves the parentheses problem (assume

2174

2228

   the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2175

2229

   option is set so that white space is

2176

2230

   ignored):

...

@@ -2179,45 +2233,45 @@

2179

2233

  </para>

2180

2234

  <para>

2181

2235

   First it matches an opening parenthesis. Then it matches any

2182

   number  of substrings which can either be a sequence of

2183

   non-parentheses, or a recursive  match  of  the  pattern  itself

2236

   number of substrings which can either be a sequence of

2237

   non-parentheses, or a recursive match of the pattern itself

2184

2238

   (i.e. a correctly parenthesized substring). Finally there is

2185

2239

   a closing parenthesis.

2186

2240

  </para>

2187

2241

  <para>

2188

   This particular example pattern  contains  nested  unlimited

2242

   This particular example pattern contains nested unlimited

2189

2243

   repeats, and so the use of a once-only subpattern for matching

2190

   strings of non-parentheses is  important  when  applying

2191

   the  pattern to strings that do not match. For example, when

2244

   strings of non-parentheses is important when applying

2245

   the pattern to strings that do not match. For example, when

2192

2246

   it is applied to

2193

2247

2194

2248

   <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>

2195

2249

2196

   it yields "no match" quickly. However, if a  once-only  subpattern

2197

   is  not  used,  the match runs for a very long time

2198

   indeed because there are so many different ways the + and  *

2199

   repeats  can carve up the subject, and all have to be tested

2250

   it yields "no match" quickly. However, if a once-only subpattern

2251

   is not used, the match runs for a very long time

2252

   indeed because there are so many different ways the + and *

2253

   repeats can carve up the subject, and all have to be tested

2200

2254

   before failure can be reported.

2201

2255

  </para>

2202

2256

  <para>

2203

   The values set for any capturing subpatterns are those  from

2257

   The values set for any capturing subpatterns are those from

2204

2258

   the outermost level of the recursion at which the subpattern

2205

2259

   value is set. If the pattern above is matched against

2206

2260

2207

2261

   <literal>(ab(cd)ef)</literal>

2208

2262

2209

   the value for the capturing parentheses is  "ef",  which  is

2210

   the  last  value  taken  on  at the top level. If additional

2263

   the value for the capturing parentheses is "ef", which is

2264

   the last value taken on at the top level. If additional

2211

2265

   parentheses are added, giving

2212

2266

2213

2267

   <literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>

2214

2268

   then the string they capture

2215

2269

   is "ab(cd)ef", the contents of the top level parentheses. If

2216

   there are more than 15 capturing parentheses in  a  pattern,

2217

   PCRE  has  to  obtain  extra  memory  to store data during a

2218

   recursion, which it does by using  pcre_malloc,  freeing  it

2219

   via  pcre_free  afterwards. If no memory can be obtained, it

2220

   saves data for the first 15 capturing parentheses  only,  as

2270

   there are more than 15 capturing parentheses in a pattern,

2271

   PCRE has to obtain extra memory to store data during a

2272

   recursion, which it does by using pcre_malloc, freeing it

2273

   via pcre_free afterwards. If no memory can be obtained, it

2274

   saves data for the first 15 capturing parentheses only, as

2221

2275

   there is no way to give an out-of-memory error from within a

2222

2276

   recursion.

2223

2277

  </para>

...

@@ -2256,75 +2310,75 @@

2256

2310

  <title>Performance</title>

2257

2311

  <para>

2258

2312

   Certain items that may appear in patterns are more efficient

2259

   than  others.  It is more efficient to use a character class

2313

   than others. It is more efficient to use a character class

2260

2314

   like [aeiou] than a set of alternatives such as (a|e|i|o|u).

2261

   In  general,  the  simplest  construction  that provides the

2262

   required behaviour is usually the  most  efficient.  Jeffrey

2263

   Friedl's  book contains a lot of discussion about optimizing

2315

   In general, the simplest construction that provides the

2316

   required behaviour is usually the most efficient. Jeffrey

2317

   Friedl's book contains a lot of discussion about optimizing

2264

2318

   regular expressions for efficient performance.

2265

2319

  </para>

2266

2320

  <para>

2267

2321

   When a pattern begins with .* and the <link

2268

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is

2269

   set,  the  pattern  is implicitly anchored by PCRE, since it

2322

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is

2323

   set, the pattern is implicitly anchored by PCRE, since it

2270

2324

   can match only at the start of a subject string. However, if

2271

2325

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

2272

2326

   is not set, PCRE cannot make this optimization,

2273

   because the . metacharacter does not then match  a  newline,

2327

   because the . metacharacter does not then match a newline,

2274

2328

   and if the subject string contains newlines, the pattern may

2275

   match from the character immediately following one  of  them

2329

   match from the character immediately following one of them

2276

2330

   instead of from the very start. For example, the pattern

2277

2331

2278

2332

   <literal>(.*) second</literal>

2279

2333

2280

2334

   matches the subject "first\nand second" (where \n stands for

2281

2335

   a newline character) with the first captured substring being

2282

   "and". In order to do this, PCRE  has  to  retry  the  match

2336

   "and". In order to do this, PCRE has to retry the match

2283

2337

   starting after every newline in the subject.

2284

2338

  </para>

2285

2339

  <para>

2286

2340

   If you are using such a pattern with subject strings that do

2287

   not  contain  newlines,  the best performance is obtained by

2341

   not contain newlines, the best performance is obtained by

2288

2342

   setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,

2289

   or starting the  pattern  with  ^.*  to

2290

   indicate  explicit anchoring. That saves PCRE from having to

2343

   or starting the pattern with ^.* to

2344

   indicate explicit anchoring. That saves PCRE from having to

2291

2345

   scan along the subject looking for a newline to restart at.

2292

2346

  </para>

2293

2347

  <para>

2294

   Beware of patterns that contain nested  indefinite  repeats.

2295

   These  can  take a long time to run when applied to a string

2348

   Beware of patterns that contain nested indefinite repeats.

2349

   These can take a long time to run when applied to a string

2296

2350

   that does not match. Consider the pattern fragment

2297

2351

2298

2352

   <literal>(a+)*</literal>

2299

2353

  </para>

2300

2354

  <para>

2301

   This can match "aaaa" in 33 different ways, and this  number

2302

   increases  very  rapidly  as  the string gets longer. (The *

2303

   repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of

2304

   those  cases other than 0, the + repeats can match different

2355

   This can match "aaaa" in 33 different ways, and this number

2356

   increases very rapidly as the string gets longer. (The *

2357

   repeat can match 0, 1, 2, 3, or 4 times, and for each of

2358

   those cases other than 0, the + repeats can match different

2305

2359

   numbers of times.) When the remainder of the pattern is such

2306

   that  the entire match is going to fail, PCRE has in principle

2307

   to try every possible variation, and this  can  take  an

2360

   that the entire match is going to fail, PCRE has in principle

2361

   to try every possible variation, and this can take an

2308

2362

   extremely long time.

2309

2363

  </para>

2310

2364

  <para>

2311

   An optimization catches some of the more simple  cases  such

2365

   An optimization catches some of the more simple cases such

2312

2366

as

2313

2367

2314

2368

   <literal>(a+)*b</literal>

2315

2369

2316

   where a literal character follows. Before embarking  on  the

2370

   where a literal character follows. Before embarking on the

2317

2371

   standard matching procedure, PCRE checks that there is a "b"

2318

   later in the subject string, and if there is not,  it  fails

2319

   the  match  immediately. However, when there is no following

2320

   literal this optimization cannot be used. You  can  see  the

2372

   later in the subject string, and if there is not, it fails

2373

   the match immediately. However, when there is no following

2374

   literal this optimization cannot be used. You can see the

2321

2375

   difference by comparing the behaviour of

2322

2376

2323

2377

   <literal>(a+)*\d</literal>

2324

2378

2325

   with the pattern above. The former gives  a  failure  almost

2326

   instantly  when  applied  to a whole line of "a" characters,

2327

   whereas the latter takes an appreciable  time  with  strings

2379

   with the pattern above. The former gives a failure almost

2380

   instantly when applied to a whole line of "a" characters,

2381

   whereas the latter takes an appreciable time with strings

2328

2382

   longer than about 20 characters.

2329

2383

  </para>

2330

2384

 </section>

2331

2385

Generated: 04 Jul 2025 01:02:40

Translation status