PHP: Documentation Tools

reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77

...

@@ -8,21 +8,21 @@

 <section xml:id="regexp.introduction">

  <title>Introduction</title>

  <para>

   The syntax and semantics of  the  regular  expressions

   supported  by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number  of

   other  books,  some  of which have copious examples. Jeffrey

   Friedl's  "Mastering  Regular  Expressions",  published   by

   O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.

   The syntax and semantics of the regular expressions

   supported by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number of

   other books, some of which have copious examples. Jeffrey

   Friedl's "Mastering Regular Expressions", published by

   O'Reilly (ISBN 1-56592-257-3), covers them in great detail.

   The description here is intended as reference documentation.

  </para>

  <para>

   A regular expression is a pattern that is matched against  a

   A regular expression is a pattern that is matched against a

   subject string from left to right. Most characters stand for

   themselves in a pattern, and match the corresponding

   characters in the subject. As a trivial example, the pattern

   <literal>The quick brown fox</literal>

   matches a portion of a subject string that is  identical  to

   matches a portion of a subject string that is identical to

   itself.

  </para>

 </section>

...

@@ -102,15 +102,15 @@

102

 <section xml:id="regexp.reference.meta">

103

  <title>Meta-characters</title>

104

  <para>

105

   The  power  of  regular  expressions comes from the

105

   The power of regular expressions comes from the

106

   ability to include alternatives and repetitions in the

107

   pattern.  These  are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for  themselves  but  instead

107

   pattern. These are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead

109

   are interpreted in some special way.

110

  </para>

111

  <para>

112

   There are two different sets of meta-characters: those  that

113

   are  recognized anywhere in the pattern except within square

112

   There are two different sets of meta-characters: those that

113

   are recognized anywhere in the pattern except within square

114

   brackets, and those that are recognized in square brackets.

115

   Outside square brackets, the meta-characters are as follows:

116

...

@@ -130,7 +130,8 @@

130

       <entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>

131

      </row>

132

      <row>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or

134

        end of line, in multiline mode)</entry>

134

135

      </row>

135

136

      <row>

136

137

       <entry>.</entry><entry>match any character except newline (by default)</entry>

...

@@ -204,9 +205,9 @@

204

205

 <section xml:id="regexp.reference.escape">

205

206

  <title>Escape sequences</title>

206

207

  <para>

207

   The backslash character has several uses. Firstly, if it  is

208

   The backslash character has several uses. Firstly, if it is

208

209

   followed by a non-alphanumeric character, it takes away any

209

   special  meaning that character may have. This use of

210

   special meaning that character may have. This use of

210

211

   backslash as an escape character applies both inside and

211

212

   outside character classes.

212

213

  </para>

...

@@ -215,7 +216,7 @@

215

216

   "\*" in the pattern. This applies whether or not the

216

217

   following character would otherwise be interpreted as a

217

218

   meta-character, so it is always safe to precede a non-alphanumeric

218

   with "\" to specify that it stands for itself.  In

219

   with "\" to specify that it stands for itself. In

219

220

   particular, if you want to match a backslash, you write "\\".

220

221

  </para>

221

222

  <note>

...

@@ -237,10 +238,10 @@

237

238

  <para>

238

239

   A second use of backslash provides a way of encoding

239

240

   non-printing characters in patterns in a visible manner. There

240

   is no restriction on the appearance of non-printing  characters,

241

   is no restriction on the appearance of non-printing characters,

241

242

   apart from the binary zero that terminates a pattern,

242

243

   but when a pattern is being prepared by text editing, it is

243

   usually  easier to use one of the following escape sequences

244

   usually easier to use one of the following escape sequences

244

245

   than the binary character it represents:

245

246

  </para>

246

247

  <para>

...

@@ -331,9 +332,9 @@

331

332

  </para>

332

333

  <para>

333

334

   The precise effect of "<literal>\cx</literal>" is as follows:

334

   if "<literal>x</literal>" is a lower case  letter, it is converted

335

   if "<literal>x</literal>" is a lower case letter, it is converted

335

336

   to upper case. Then bit 6 of the character (hex 40) is inverted.

336

   Thus "<literal>\cz</literal>" becomes  hex 1A, but

337

   Thus "<literal>\cz</literal>" becomes hex 1A, but

337

338

   "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"

338

339

   becomes hex 7B.

339

340

  </para>

...

@@ -349,7 +350,7 @@

349

350

  </para>

350

351

  <para>

351

352

   After "<literal>\0</literal>" up to two further octal digits are read.

352

   In  both cases,  if  there are fewer than two digits, just those that

353

   In both cases, if there are fewer than two digits, just those that

353

354

   are present are used. Thus the sequence "<literal>\0\x\07</literal>"

354

355

   specifies two binary zeros followed by a BEL character. Make sure you

355

356

   supply two digits after the initial zero if the character

...

@@ -358,20 +359,20 @@

358

359

  <para>

359

360

   The handling of a backslash followed by a digit other than 0

360

361

   is complicated. Outside a character class, PCRE reads it

361

   and any following digits as a decimal number. If the  number

362

   is  less  than  10, or if there have been at least that many

363

   previous capturing left parentheses in the  expression,  the

364

   entire  sequence is taken as a <emphasis>back reference</emphasis>. A description

365

   of how this works is given later, following  the  discussion

362

   and any following digits as a decimal number. If the number

363

   is less than 10, or if there have been at least that many

364

   previous capturing left parentheses in the expression, the

365

   entire sequence is taken as a <emphasis>back reference</emphasis>. A description

366

   of how this works is given later, following the discussion

366

367

   of parenthesized subpatterns.

367

368

  </para>

368

369

  <para>

369

   Inside a character  class,  or  if  the  decimal  number  is

370

   Inside a character class, or if the decimal number is

370

371

   greater than 9 and there have not been that many capturing

371

372

   subpatterns, PCRE re-reads up to three octal digits following

372

373

   the backslash, and generates a single byte from the

373

374

   least significant 8 bits of the value. Any subsequent digits

374

   stand for themselves.  For example:

375

   stand for themselves. For example:

375

376

  </para>

376

377

  <para>

377

378

   <variablelist>

...

@@ -439,7 +440,7 @@

439

440

   digits are ever read.

440

441

  </para>

441

442

  <para>

442

   All the sequences that define a single byte value can  be

443

   All the sequences that define a single byte value can be

443

444

   used both inside and outside character classes. In addition,

444

445

   inside a character class, the sequence "<literal>\b</literal>"

445

446

   is interpreted as the backspace character (hex 08). Outside a character

...

@@ -506,7 +507,7 @@

506

507

  </para>

507

508

  <para>

508

509

   A "word" character is any letter or digit or the underscore

509

   character,  that  is,  any  character which can be part of a

510

   character, that is, any character which can be part of a

510

511

   Perl "<emphasis>word</emphasis>". The definition of letters and digits is

511

512

   controlled by PCRE's character tables, and may vary if locale-specific

512

513

   matching is taking place. For example, in the "fr" (French) locale, some

...

@@ -515,15 +516,15 @@

515

516

  </para>

516

517

  <para>

517

518

   These character type sequences can appear both inside and

518

   outside  character classes. They each match one character of

519

   the appropriate type. If the current matching  point is at

519

   outside character classes. They each match one character of

520

   the appropriate type. If the current matching point is at

520

521

   the end of the subject string, all of them fail, since there

521

522

   is no character to match.

522

523

  </para>

523

524

  <para>

524

   The fourth use of backslash is  for  certain  simple

525

   The fourth use of backslash is for certain simple

525

526

   assertions. An assertion specifies a condition that has to be met

526

   at a particular point in  a match, without consuming any

527

   at a particular point in a match, without consuming any

527

528

   characters from the subject string. The use of subpatterns

528

529

   for more complicated assertions is described below. The

529

530

   backslashed assertions are

...

@@ -562,7 +563,7 @@

562

563

   </variablelist>

563

564

  </para>

564

565

  <para>

565

   These assertions may not appear in  character  classes  (but

566

   These assertions may not appear in character classes (but

566

567

   note that "<literal>\b</literal>" has a different meaning, namely the backspace

567

568

   character, inside a character class).

568

569

  </para>

...

@@ -570,20 +571,20 @@

570

571

   A word boundary is a position in the subject string where

571

572

   the current character and the previous character do not both

572

573

   match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches

573

   <literal>\w</literal> and  the  other  matches

574

   <literal>\w</literal> and the other matches

574

575

   <literal>\W</literal>), or the start or end of the string if the first

575

576

   or last character matches <literal>\w</literal>, respectively.

576

577

  </para>

577

578

  <para>

578

579

   The <literal>\A</literal>, <literal>\Z</literal>, and

579

   <literal>\z</literal> assertions differ  from  the  traditional

580

   circumflex  and  dollar  (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only

581

   ever match at the very start and end of the subject  string,

582

   whatever  options  are  set.  They  are  not affected by the

580

   <literal>\z</literal> assertions differ from the traditional

581

   circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )

582

   in that they only ever match at the very start and end of the subject string,

583

   whatever options are set. They are not affected by the

583

584

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or

584

585

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>

585

   options. The  difference  between <literal>\Z</literal> and

586

   <literal>\z</literal>  is that <literal>\Z</literal> matches before a

586

   options. The difference between <literal>\Z</literal> and

587

   <literal>\z</literal> is that <literal>\Z</literal> matches before a

587

588

   newline that is the last character of the string as well as at the end of

588

589

   the string, whereas <literal>\z</literal> matches only at the end.

589

590

  </para>

...

@@ -873,8 +874,8 @@

873

874

   For example, <literal>\p{Lu}</literal> always matches only upper case letters.

874

875

  </para>

875

876

  <para>

876

   Sets of Unicode characters are defined as belonging to certain scripts.  A

877

   character from one of these sets can be matched using a script name.  For

877

   Sets of Unicode characters are defined as belonging to certain scripts. A

878

   character from one of these sets can be matched using a script name. For

878

879

   example:

879

880

  </para>

880

881

  <itemizedlist>

...

@@ -886,7 +887,7 @@

886

887

   </listitem>

887

888

  </itemizedlist>

888

889

  <para>

889

   Those that are not part of an identified script are lumped together  as

890

   Those that are not part of an identified script are lumped together as

890

891

   <literal>Common</literal>. The current list of scripts is:

891

892

  </para>

892

893

  <table>

...

@@ -1055,7 +1056,7 @@

1055

1056

  <para>

1056

1057

   In versions of PCRE older than 8.32 (which corresponds to PHP versions

1057

1058

   before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>

1058

   is equivalent to <literal>(?>\PM\pM*)</literal>.  That is, it matches a

1059

   is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a

1059

1060

   character without the "mark" property, followed by zero or more characters

1060

1061

   with the "mark" property, and treats the sequence as an atomic group (see

1061

1062

   below). Characters with the "mark" property are typically accents that

...

@@ -1075,8 +1076,8 @@

1075

1076

  <para>

1076

1077

   Outside a character class, in the default matching mode, the

1077

1078

   circumflex character (<literal>^</literal>) is an assertion which

1078

   is true only if the current matching point is at the start  of

1079

   the  subject string. Inside a character class, circumflex (<literal>^</literal>)

1079

   is true only if the current matching point is at the start of

1080

   the subject string. Inside a character class, circumflex (<literal>^</literal>)

1080

1081

   has an entirely different meaning (see below).

1081

1082

  </para>

1082

1083

  <para>

...

@@ -1091,12 +1092,12 @@

1091

1092

  </para>

1092

1093

  <para>

1093

1094

   A dollar character (<literal>$</literal>) is an assertion which is

1094

   &true; only if the current  matching point is at the end of the subject

1095

   string, or immediately before a newline character that is  the  last

1095

   &true; only if the current matching point is at the end of the subject

1096

   string, or immediately before a newline character that is the last

1096

1097

   character in the string (by default). Dollar (<literal>$</literal>)

1097

   need not be the last character of the pattern if a  number  of

1098

   alternatives are  involved,  but it should be the last item in any branch

1099

   in which it appears. Dollar has no  special  meaning  in  a

1098

   need not be the last character of the pattern if a number of

1099

   alternatives are involved, but it should be the last item in any branch

1100

   in which it appears. Dollar has no special meaning in a

1100

1101

   character class.

1101

1102

  </para>

1102

1103

  <para>

...

@@ -1122,9 +1123,9 @@

1122

1123

   set.

1123

1124

  </para>

1124

1125

  <para>

1125

   Note that the sequences \A, \Z, and \z can be used to  match

1126

   the  start  and end of the subject in both modes, and if all

1127

   branches of a pattern start with \A is it  always  anchored,

1126

   Note that the sequences \A, \Z, and \z can be used to match

1127

   the start and end of the subject in both modes, and if all

1128

   branches of a pattern start with \A is it always anchored,

1128

1129

   whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1129

1130

   is set or not.

1130

1131

  </para>

...

@@ -1133,14 +1134,14 @@

1133

1134

 <section xml:id="regexp.reference.dot">

1134

1135

  <title>Dot</title>

1135

1136

  <para>

1136

   Outside a character class, a dot in the pattern matches  any

1137

   one  character  in  the  subject,  including  a non-printing

1138

   character, but not (by default) newline.  If the

1137

   Outside a character class, a dot in the pattern matches any

1138

   one character in the subject, including a non-printing

1139

   character, but not (by default) newline. If the

1139

1140

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1140

   option  is  set,  then dots match newlines as well. The

1141

   option is set, then dots match newlines as well. The

1141

1142

   handling of dot is entirely independent of the handling of

1142

   circumflex  and  dollar,  the only relationship being that they

1143

   both involve newline characters.  Dot has no special meaning

1143

   circumflex and dollar, the only relationship being that they

1144

   both involve newline characters. Dot has no special meaning

1144

1145

   in a character class.

1145

1146

  </para>

1146

1147

  <para>

...

@@ -1154,29 +1155,29 @@

1154

1155

  <title>Character classes</title>

1155

1156

  <para>

1156

1157

   An opening square bracket introduces a character class,

1157

   terminated  by  a  closing  square  bracket.  A  closing square

1158

   bracket on its own is  not  special.  If  a  closing  square

1159

   bracket  is  required as a member of the class, it should be

1158

   terminated by a closing square bracket. A closing square

1159

   bracket on its own is not special. If a closing square

1160

   bracket is required as a member of the class, it should be

1160

1161

   the first data character in the class (after an initial

1161

1162

   circumflex, if present) or escaped with a backslash.

1162

1163

  </para>

1163

1164

  <para>

1164

1165

   A character class matches a single character in the subject;

1165

   the  character  must  be in the set of characters defined by

1166

   the character must be in the set of characters defined by

1166

1167

   the class, unless the first character in the class is a

1167

   circumflex,  in which case the subject character must not be in

1168

   the set defined by the class. If a  circumflex  is  actually

1169

   required  as  a  member  of  the class, ensure it is not the

1168

   circumflex, in which case the subject character must not be in

1169

   the set defined by the class. If a circumflex is actually

1170

   required as a member of the class, ensure it is not the

1170

1171

   first character, or escape it with a backslash.

1171

1172

  </para>

1172

1173

  <para>

1173

   For example, the character class [aeiou] matches  any  lower

1174

   For example, the character class [aeiou] matches any lower

1174

1175

   case vowel, while [^aeiou] matches any character that is not

1175

   a lower case vowel. Note that a circumflex is  just  a

1176

   convenient  notation for specifying the characters which are in

1177

   the class by enumerating those that are not. It  is  not  an

1178

   assertion:  it  still  consumes a character from the subject

1179

   string, and fails if the current pointer is at  the  end  of

1176

   a lower case vowel. Note that a circumflex is just a

1177

   convenient notation for specifying the characters which are in

1178

   the class by enumerating those that are not. It is not an

1179

   assertion: it still consumes a character from the subject

1180

   string, and fails if the current pointer is at the end of

1180

1181

   the string.

1181

1182

  </para>

1182

1183

  <para>

...

@@ -1188,61 +1189,62 @@

1188

1189

  </para>

1189

1190

  <para>

1190

1191

   The newline character is never treated in any special way in

1191

   character  classes,  whatever the setting of the <link

1192

   character classes, whatever the setting of the <link

1192

1193

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1193

1194

   or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1194

1195

   options is. A class such as [^a] will always match a newline.

1195

1196

  </para>

1196

1197

  <para>

1197

   The minus (hyphen) character can be used to specify a  range

1198

   of  characters  in  a  character  class.  For example, [d-m]

1199

   matches any letter between d and m, inclusive.  If  a  minus

1200

   character  is required in a class, it must be escaped with a

1198

   The minus (hyphen) character can be used to specify a range

1199

   of characters in a character class. For example, [d-m]

1200

   matches any letter between d and m, inclusive. If a minus

1201

   character is required in a class, it must be escaped with a

1201

1202

   backslash or appear in a position where it cannot be

1202

1203

   interpreted as indicating a range, typically as the first or last

1203

1204

   character in the class.

1204

1205

  </para>

1205

1206

  <para>

1206

   It is not possible to have the literal character "]" as  the

1207

   end  character  of  a  range.  A  pattern such as [W-]46] is

1207

   It is not possible to have the literal character "]" as the

1208

   end character of a range. A pattern such as [W-]46] is

1208

1209

   interpreted as a class of two characters ("W" and "-")

1209

1210

   followed by a literal string "46]", so it would match "W46]" or

1210

   "-46]". However, if the "]" is escaped with a  backslash  it

1211

   is  interpreted  as  the end of range, so [W-\]46] is

1212

   interpreted as a single class containing a range followed by  two

1211

   "-46]". However, if the "]" is escaped with a backslash it

1212

   is interpreted as the end of range, so [W-\]46] is

1213

   interpreted as a single class containing a range followed by two

1213

1214

   separate characters. The octal or hexadecimal representation

1214

1215

   of "]" can also be used to end a range.

1215

1216

  </para>

1216

1217

  <para>

1217

1218

   Ranges operate in ASCII collating sequence. They can also be

1218

   used  for  characters  specified  numerically,  for  example

1219

   [\000-\037]. If a range that includes letters is  used  when

1220

   case-insensitive (caseless)  matching  is set, it matches the

1221

   letters in either case. For example, [W-c] is equivalent  to

1219

   used for characters specified numerically, for example

1220

   [\000-\037]. If a range that includes letters is used when

1221

   case-insensitive (caseless) matching is set, it matches the

1222

   letters in either case. For example, [W-c] is equivalent to

1222

1223

   [][\^_`wxyzabc], matched case-insensitively, and if character

1223

1224

   tables for the "fr" locale are in use, [\xc8-\xcb] matches

1224

1225

   accented E characters in both cases.

1225

1226

  </para>

1226

1227

  <para>

1227

   The character types \d, \D, \s, \S,  \w,  and  \W  may  also

1228

   appear  in  a  character  class, and add the characters that

1228

   The character types \d, \D, \s, \S, \w, and \W may also

1229

   appear in a character class, and add the characters that

1229

1230

   they match to the class. For example, [\dABCDEF] matches any

1230

   hexadecimal  digit.  A  circumflex  can conveniently be used

1231

   with the upper case character types to specify a  more

1231

   hexadecimal digit. A circumflex can conveniently be used

1232

   with the upper case character types to specify a more

1232

1233

   restricted set of characters than the matching lower case type.

1233

   For example, the class [^\W_] matches any letter  or  digit,

1234

   For example, the class [^\W_] matches any letter or digit,

1234

1235

   but not underscore.

1235

1236

  </para>

1236

1237

  <para>

1237

   All non-alphanumeric characters other than \,  -,  ^  (at  the

1238

   start)  and  the  terminating ] are non-special in character

1238

   All non-alphanumeric characters other than \, -, ^ (at the

1239

   start) and the terminating ] are non-special in character

1239

1240

   classes, but it does no harm if they are escaped. The pattern

1240

1241

   terminator is always special and must be escaped when used

1241

1242

   within an expression.

1242

1243

  </para>

1243

1244

  <para>

1244

1245

   Perl supports the POSIX notation for character classes. This uses names

1245

   enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also

1246

   enclosed by <literal>[:</literal> and <literal>:]</literal> within

1247

   the enclosing square brackets. PCRE also

1246

1248

   supports this notation. For example, <literal>[01[:alpha:]%]</literal>

1247

1249

   matches "0", "1", any alphabetic character, or "%". The supported class

1248

1250

   names are:

...

@@ -1297,16 +1299,16 @@

1297

1299

 <section xml:id="regexp.reference.alternation">

1298

1300

  <title>Alternation</title>

1299

1301

  <para>

1300

   Vertical bar characters are  used  to  separate  alternative

1302

   Vertical bar characters are used to separate alternative

1301

1303

   patterns. For example, the pattern

1302

1304

   <literal>gilbert|sullivan</literal>

1303

1305

   matches either "gilbert" or "sullivan". Any number of alternatives

1304

   may  appear,  and an empty alternative is permitted

1305

   (matching the empty string).   The  matching  process  tries

1306

   each  alternative in turn, from left to right, and the first

1307

   one that succeeds is used. If the alternatives are within  a

1308

   subpattern  (defined  below),  "succeeds" means matching the

1309

   rest of the main pattern as well as the alternative  in  the

1306

   may appear, and an empty alternative is permitted

1307

   (matching the empty string). The matching process tries

1308

   each alternative in turn, from left to right, and the first

1309

   one that succeeds is used. If the alternatives are within a

1310

   subpattern (defined below), "succeeds" means matching the

1311

   rest of the main pattern as well as the alternative in the

1310

1312

   subpattern.

1311

1313

  </para>

1312

1314

 </section>

...

@@ -1321,7 +1323,7 @@

1321

1323

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,

1322

1324

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1323

1325

   and PCRE_DUPNAMES can be changed from within the pattern by

1324

   a sequence of Perl option letters enclosed between "(?"  and

1326

   a sequence of Perl option letters enclosed between "(?" and

1325

1327

   ")". The option letters are:

1326

1328

1327

1329

   <table>

...

@@ -1350,7 +1352,8 @@

1350

1352

      </row>

1351

1353

      <row>

1352

1354

       <entry><literal>X</literal></entry>

1353

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>

1355

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>

1356

        (no longer supported as of PHP 7.3.0)</entry>

1354

1357

      </row>

1355

1358

      <row>

1356

1359

       <entry><literal>J</literal></entry>

...

@@ -1361,16 +1364,16 @@

1361

1364

   </table>

1362

1365

  </para>

1363

1366

  <para>

1364

   For example, (?im) sets case-insensitive (caseless), multiline matching. It  is

1367

   For example, (?im) sets case-insensitive (caseless), multiline matching. It is

1365

1368

   also possible to unset these options by preceding the letter

1366

   with a hyphen, and a combined setting and unsetting such  as

1367

   (?im-sx),  which sets <link

1369

   with a hyphen, and a combined setting and unsetting such as

1370

   (?im-sx), which sets <link

1368

1371

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and

1369

1372

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1370

1373

   while unsetting <link

1371

1374

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and

1372

1375

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,

1373

   is also  permitted. If  a  letter  appears both before and after the

1376

   is also permitted. If a letter appears both before and after the

1374

1377

   hyphen, the option is unset.

1375

1378

  </para>

1376

1379

  <para>

...

@@ -1380,14 +1383,14 @@

1380

1383

   and "abC".

1381

1384

  </para>

1382

1385

  <para>

1383

   If an option change occurs inside a subpattern,  the  effect

1384

   is  different.  This is a change of behaviour in Perl 5.005.

1385

   An option change inside a subpattern affects only that  part

1386

   If an option change occurs inside a subpattern, the effect

1387

   is different. This is a change of behaviour in Perl 5.005.

1388

   An option change inside a subpattern affects only that part

1386

1389

   of the subpattern that follows it, so

1387

1390

1388

1391

   <literal>(a(?i)b)c</literal>

1389

1392

1390

   matches  abc  and  aBc  and  no  other   strings   (assuming <link

1393

   matches "abc" and "aBc" and no other strings (assuming <link

1391

1394

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not

1392

1395

   used). By this means, options can be made to have different settings in

1393

1396

   different parts of the pattern. Any changes made in one alternative do

...

@@ -1396,18 +1399,18 @@

1396

1399

1397

1400

   <literal>(a(?i)b|c)</literal>

1398

1401

1399

   matches "ab", "aB", "c", and "C", even though when  matching

1402

   matches "ab", "aB", "c", and "C", even though when matching

1400

1403

   "C" the first branch is abandoned before the option setting.

1401

   This is because the effects of  option  settings  happen  at

1402

   compile  time. There would be some very weird behaviour otherwise.

1404

   This is because the effects of option settings happen at

1405

   compile time. There would be some very weird behaviour otherwise.

1403

1406

  </para>

1404

1407

  <para>

1405

1408

   The PCRE-specific options <link

1406

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and

1407

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>   can

1409

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and

1410

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can

1408

1411

   be changed in the same way as the Perl-compatible options by

1409

   using the characters U and X  respectively.  The  (?X)  flag

1410

   setting  is  special in that it must always occur earlier in

1412

   using the characters U and X respectively. The (?X) flag

1413

   setting is special in that it must always occur earlier in

1411

1414

   the pattern than any of the additional features it turns on,

1412

1415

   even when it is at top level. It is best put at the start.

1413

1416

  </para>

...

@@ -1416,8 +1419,8 @@

1416

1419

 <section xml:id="regexp.reference.subpatterns">

1417

1420

  <title>Subpatterns</title>

1418

1421

  <para>

1419

   Subpatterns are delimited by parentheses  (round  brackets),

1420

   which can be nested.  Marking part of a pattern as a subpattern

1422

   Subpatterns are delimited by parentheses (round brackets),

1423

   which can be nested. Marking part of a pattern as a subpattern

1421

1424

   does two things:

1422

1425

  </para>

1423

1426

  <orderedlist>

...

@@ -1446,30 +1449,30 @@

1446

1449

1447

1450

   <literal>the ((red|white) (king|queen))</literal>

1448

1451

1449

   the captured substrings are "red king", "red",  and  "king",

1452

   the captured substrings are "red king", "red", and "king",

1450

1453

   and are numbered 1, 2, and 3.

1451

1454

  </para>

1452

1455

  <para>

1453

   The fact that plain parentheses fulfill two functions is  not

1454

   always  helpful.  There are often times when a grouping subpattern

1455

   is required without a capturing requirement.  If  an

1456

   The fact that plain parentheses fulfill two functions is not

1457

   always helpful. There are often times when a grouping subpattern

1458

   is required without a capturing requirement. If an

1456

1459

   opening parenthesis is followed by "?:", the subpattern does

1457

   not do any capturing, and is not counted when computing  the

1460

   not do any capturing, and is not counted when computing the

1458

1461

   number of any subsequent capturing subpatterns. For example,

1459

   if the string "the  white  queen"  is  matched  against  the

1462

   if the string "the white queen" is matched against the

1460

1463

   pattern

1461

1464

1462

1465

   <literal>the ((?:red|white) (king|queen))</literal>

1463

1466

1464

   the captured substrings are "white queen" and  "queen",  and

1465

   are  numbered  1  and 2. The maximum number of captured substrings

1467

   the captured substrings are "white queen" and "queen", and

1468

   are numbered 1 and 2. The maximum number of captured substrings

1466

1469

   is 65535. It may not be possible to compile such large patterns,

1467

1470

   however, depending on the configuration options of libpcre.

1468

1471

  </para>

1469

1472

  <para>

1470

   As a  convenient  shorthand,  if  any  option  settings  are

1471

   required  at  the  start  of a non-capturing subpattern, the

1472

   option letters may appear between the "?" and the ":".  Thus

1473

   As a convenient shorthand, if any option settings are

1474

   required at the start of a non-capturing subpattern, the

1475

   option letters may appear between the "?" and the ":". Thus

1473

1476

   the two patterns

1474

1477

  </para>

1475

1478

...

@@ -1483,10 +1486,10 @@

1483

1486

  </informalexample>

1484

1487

1485

1488

  <para>

1486

   match exactly the same set of strings.  Because  alternative

1487

   branches  are  tried from left to right, and options are not

1488

   reset until the end of the subpattern is reached, an  option

1489

   setting  in  one  branch does affect subsequent branches, so

1489

   match exactly the same set of strings. Because alternative

1490

   branches are tried from left to right, and options are not

1491

   reset until the end of the subpattern is reached, an option

1492

   setting in one branch does affect subsequent branches, so

1490

1493

   the above patterns match "SUNDAY" as well as "Saturday".

1491

1494

  </para>

1492

1495

...

@@ -1515,9 +1518,10 @@

1515

1518

1516

1519

  <para>

1517

1520

   Here <literal>Sun</literal> is stored in backreference 2, while

1518

   backreference 1 is empty. Matching yields <literal>Sat</literal> in

1519

   backreference 1 while backreference 2 does not exist. Changing the pattern

1520

   to use the <literal>(?|</literal> fixes this problem:

1521

   backreference 1 is empty. Matching <literal>Saturday</literal> yields

1522

   <literal>Sat</literal> in backreference 1 while backreference 2 does

1523

   not exist. Changing the pattern to use the <literal>(?|</literal> fixes

1524

   this problem:

1521

1525

  </para>

1522

1526

1523

1527

  <informalexample>

...

@@ -1543,45 +1547,45 @@

1543

1547

    <listitem><simpara>the . metacharacter</simpara></listitem>

1544

1548

    <listitem><simpara>a character class</simpara></listitem>

1545

1549

    <listitem><simpara>a back reference (see next section)</simpara></listitem>

1546

    <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -

1550

    <listitem><simpara>a parenthesized subpattern (unless it is an assertion -

1547

1551

     see below)</simpara></listitem>

1548

1552

   </itemizedlist>

1549

1553

  </para>

1550

1554

  <para>

1551

   The general repetition quantifier specifies  a  minimum  and

1552

   maximum  number  of  permitted  matches,  by  giving the two

1553

   numbers in curly brackets (braces), separated  by  a  comma.

1554

   The  numbers  must be less than 65536, and the first must be

1555

   The general repetition quantifier specifies a minimum and

1556

   maximum number of permitted matches, by giving the two

1557

   numbers in curly brackets (braces), separated by a comma.

1558

   The numbers must be less than 65536, and the first must be

1555

1559

   less than or equal to the second. For example:

1556

1560

1557

1561

   <literal>z{2,4}</literal>

1558

1562

1559

   matches "zz", "zzz", or "zzzz". A closing brace on  its  own

1563

   matches "zz", "zzz", or "zzzz". A closing brace on its own

1560

1564

   is not a special character. If the second number is omitted,

1561

   but the comma is present, there is no upper  limit;  if  the

1565

   but the comma is present, there is no upper limit; if the

1562

1566

   second number and the comma are both omitted, the quantifier

1563

1567

   specifies an exact number of required matches. Thus

1564

1568

1565

1569

   <literal>[aeiou]{3,}</literal>

1566

1570

1567

   matches at least 3 successive vowels,  but  may  match  many

1571

   matches at least 3 successive vowels, but may match many

1568

1572

   more, while

1569

1573

1570

1574

   <literal>\d{8}</literal>

1571

1575

1572

   matches exactly 8 digits.  An  opening  curly  bracket  that

1573

   appears  in a position where a quantifier is not allowed, or

1576

   matches exactly 8 digits. An opening curly bracket that

1577

   appears in a position where a quantifier is not allowed, or

1574

1578

   one that does not match the syntax of a quantifier, is taken

1575

   as  a literal character. For example, {,6} is not a quantifier,

1579

   as a literal character. For example, {,6} is not a quantifier,

1576

1580

   but a literal string of four characters.

1577

1581

  </para>

1578

1582

  <para>

1579

   The quantifier {0} is permitted, causing the  expression  to

1580

   behave  as  if the previous item and the quantifier were not

1583

   The quantifier {0} is permitted, causing the expression to

1584

   behave as if the previous item and the quantifier were not

1581

1585

   present.

1582

1586

  </para>

1583

1587

  <para>

1584

   For convenience (and  historical  compatibility)  the  three

1588

   For convenience (and historical compatibility) the three

1585

1589

   most common quantifiers have single-character abbreviations:

1586

1590

1587

1591

   <table>

...

@@ -1605,63 +1609,63 @@

1605

1609

   </table>

1606

1610

  </para>

1607

1611

  <para>

1608

   It is possible to construct infinite loops  by  following  a

1609

   subpattern  that  can  match no characters with a quantifier

1612

   It is possible to construct infinite loops by following a

1613

   subpattern that can match no characters with a quantifier

1610

1614

   that has no upper limit, for example:

1611

1615

1612

1616

   <literal>(a?)*</literal>

1613

1617

  </para>

1614

1618

  <para>

1615

   Earlier versions of Perl and PCRE used to give an  error  at

1616

   compile  time  for such patterns. However, because there are

1617

   cases where this  can  be  useful,  such  patterns  are  now

1618

   accepted,  but  if  any repetition of the subpattern does in

1619

   Earlier versions of Perl and PCRE used to give an error at

1620

   compile time for such patterns. However, because there are

1621

   cases where this can be useful, such patterns are now

1622

   accepted, but if any repetition of the subpattern does in

1619

1623

   fact match no characters, the loop is forcibly broken.

1620

1624

  </para>

1621

1625

  <para>

1622

   By default, the quantifiers  are  "greedy",  that  is,  they

1623

   match  as much as possible (up to the maximum number of permitted

1624

   times), without causing the rest of  the  pattern  to

1626

   By default, the quantifiers are "greedy", that is, they

1627

   match as much as possible (up to the maximum number of permitted

1628

   times), without causing the rest of the pattern to

1625

1629

   fail. The classic example of where this gives problems is in

1626

1630

   trying to match comments in C programs. These appear between

1627

   the  sequences /* and */ and within the sequence, individual

1628

   * and / characters may appear. An attempt to  match  C  comments

1631

   the sequences /* and */ and within the sequence, individual

1632

   * and / characters may appear. An attempt to match C comments

1629

1633

   by applying the pattern

1630

1634

1631

1635

   <literal>/\*.*\*/</literal>

1632

1636

1633

1637

   to the string

1634

1638

1635

   <literal>/* first comment */  not comment  /* second comment */</literal>

1639

   <literal>/* first comment */ not comment /* second comment */</literal>

1636

1640

1637

   fails, because it matches  the  entire  string  due  to  the

1638

   greediness of the .*  item.

1641

   fails, because it matches the entire string due to the

1642

   greediness of the .* item.

1639

1643

  </para>

1640

1644

  <para>

1641

   However, if a quantifier is followed  by  a  question  mark,

1645

   However, if a quantifier is followed by a question mark,

1642

1646

   then it becomes lazy, and instead matches the minimum

1643

1647

   number of times possible, so the pattern

1644

1648

1645

1649

   <literal>/\*.*?\*/</literal>

1646

1650

1647

1651

   does the right thing with the C comments. The meaning of the

1648

   various  quantifiers is not otherwise changed, just the preferred

1649

   number of matches.  Do not confuse this use of

1650

   question  mark  with  its  use as a quantifier in its own right.

1652

   various quantifiers is not otherwise changed, just the preferred

1653

   number of matches. Do not confuse this use of

1654

   question mark with its use as a quantifier in its own right.

1651

1655

   Because it has two uses, it can sometimes appear doubled, as

1652

1656

in

1653

1657

1654

1658

   <literal>\d??\d</literal>

1655

1659

1656

   which matches one digit by preference, but can match two  if

1660

   which matches one digit by preference, but can match two if

1657

1661

   that is the only way the rest of the pattern matches.

1658

1662

  </para>

1659

1663

  <para>

1660

1664

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>

1661

   option is set (an option which  is  not

1662

   available  in  Perl)  then the quantifiers are not greedy by

1665

   option is set (an option which is not

1666

   available in Perl) then the quantifiers are not greedy by

1663

1667

   default, but individual ones can be made greedy by following

1664

   them  with  a  question mark. In other words, it inverts the

1668

   them with a question mark. In other words, it inverts the

1665

1669

   default behaviour.

1666

1670

  </para>

1667

1671

  <para>

...

@@ -1673,41 +1677,41 @@

1673

1677

  </para>

1674

1678

  <para>

1675

1679

   When a parenthesized subpattern is quantified with a minimum

1676

   repeat  count  that is greater than 1 or with a limited maximum,

1677

   more store is required for the  compiled  pattern,  in

1680

   repeat count that is greater than 1 or with a limited maximum,

1681

   more store is required for the compiled pattern, in

1678

1682

   proportion to the size of the minimum or maximum.

1679

1683

  </para>

1680

1684

  <para>

1681

   If a pattern starts with .* or  .{0,}  and  the  <link

1685

   If a pattern starts with .* or .{0,} and the <link

1682

1686

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1683

1687

   option (equivalent to Perl's /s) is set, thus allowing the .

1684

   to match newlines, then the pattern is implicitly  anchored,

1688

   to match newlines, then the pattern is implicitly anchored,

1685

1689

   because whatever follows will be tried against every character

1686

   position in the subject string, so there is no point  in

1687

   retrying  the overall match at any position after the first.

1690

   position in the subject string, so there is no point in

1691

   retrying the overall match at any position after the first.

1688

1692

   PCRE treats such a pattern as though it were preceded by \A.

1689

   In  cases where it is known that the subject string contains

1693

   In cases where it is known that the subject string contains

1690

1694

   no newlines, it is worth setting <link

1691

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the

1695

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the

1692

1696

   pattern begins with .* in order to

1693

1697

   obtain this optimization, or

1694

1698

   alternatively using ^ to indicate anchoring explicitly.

1695

1699

  </para>

1696

1700

  <para>

1697

   When a capturing subpattern is repeated, the value  captured

1701

   When a capturing subpattern is repeated, the value captured

1698

1702

   is the substring that matched the final iteration. For example, after

1699

1703

1700

1704

   <literal>(tweedle[dume]{3}\s*)+</literal>

1701

1705

1702

   has matched "tweedledum tweedledee" the value  of  the  captured

1703

   substring  is  "tweedledee".  However,  if  there are

1704

   nested capturing  subpatterns,  the  corresponding  captured

1705

   values  may  have been set in previous iterations. For example,

1706

   has matched "tweedledum tweedledee" the value of the captured

1707

   substring is "tweedledee". However, if there are

1708

   nested capturing subpatterns, the corresponding captured

1709

   values may have been set in previous iterations. For example,

1706

1710

   after

1707

1711

1708

1712

   <literal>/(a|(b))+/</literal>

1709

1713

1710

   matches "aba" the value of the second captured substring  is

1714

   matches "aba" the value of the second captured substring is

1711

1715

   "b".

1712

1716

  </para>

1713

1717

 </section>

...

@@ -1715,74 +1719,74 @@

1715

1719

 <section xml:id="regexp.reference.back-references">

1716

1720

  <title>Back references</title>

1717

1721

  <para>

1718

   Outside a character class, a backslash followed by  a  digit

1719

   greater  than  0  (and  possibly  further  digits) is a back

1720

   reference to a capturing subpattern  earlier  (i.e.  to  its

1721

   left)  in  the  pattern,  provided there have been that many

1722

   Outside a character class, a backslash followed by a digit

1723

   greater than 0 (and possibly further digits) is a back

1724

   reference to a capturing subpattern earlier (i.e. to its

1725

   left) in the pattern, provided there have been that many

1722

1726

   previous capturing left parentheses.

1723

1727

  </para>

1724

1728

  <para>

1725

   However, if the decimal number following  the  backslash  is

1726

   less  than  10,  it is always taken as a back reference, and

1727

   causes an error only if there are not  that  many  capturing

1728

   left  parentheses in the entire pattern. In other words, the

1729

   parentheses that are referenced need not be to the  left  of

1730

   the  reference  for  numbers  less  than 10.

1729

   However, if the decimal number following the backslash is

1730

   less than 10, it is always taken as a back reference, and

1731

   causes an error only if there are not that many capturing

1732

   left parentheses in the entire pattern. In other words, the

1733

   parentheses that are referenced need not be to the left of

1734

   the reference for numbers less than 10.

1731

1735

   A "forward back reference" can make sense when a repetition

1732

1736

   is involved and the subpattern to the right has participated

1733

1737

   in an earlier iteration. See the section

1734

   entitled "Backslash" above for further details of  the  handling

1738

   <link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling

1735

1739

   of digits following a backslash.

1736

1740

  </para>

1737

1741

  <para>

1738

   A back reference matches whatever actually matched the  capturing

1742

   A back reference matches whatever actually matched the capturing

1739

1743

   subpattern in the current subject string, rather than

1740

1744

   anything matching the subpattern itself. So the pattern

1741

1745

1742

1746

   <literal>(sens|respons)e and \1ibility</literal>

1743

1747

1744

   matches "sense and sensibility" and "response and  responsibility",

1745

   but  not  "sense  and  responsibility". If case-sensitive (caseful)

1748

   matches "sense and sensibility" and "response and responsibility",

1749

   but not "sense and responsibility". If case-sensitive (caseful)

1746

1750

   matching is in force at the time of the back reference, then

1747

1751

   the case of letters is relevant. For example,

1748

1752

1749

1753

   <literal>((?i)rah)\s+\1</literal>

1750

1754

1751

   matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even

1752

   though  the  original  capturing subpattern is matched

1755

   matches "rah rah" and "RAH RAH", but not "RAH rah", even

1756

   though the original capturing subpattern is matched

1753

1757

   case-insensitively (caselessly).

1754

1758

  </para>

1755

1759

  <para>

1756

   There may be more than one back reference to the  same  subpattern.

1757

   If  a  subpattern  has not actually been used in a

1758

   particular match, then any  back  references  to  it  always

1760

   There may be more than one back reference to the same subpattern.

1761

   If a subpattern has not actually been used in a

1762

   particular match, then any back references to it always

1759

1763

   fail. For example, the pattern

1760

1764

1761

1765

   <literal>(a|(bc))\2</literal>

1762

1766

1763

   always fails if it starts to match  "a"  rather  than  "bc".

1764

   Because  there  may  be up to 99 back references, all digits

1765

   following the backslash are taken as  part  of  a  potential

1767

   always fails if it starts to match "a" rather than "bc".

1768

   Because there may be up to 99 back references, all digits

1769

   following the backslash are taken as part of a potential

1766

1770

   back reference number. If the pattern continues with a digit

1767

1771

   character, then some delimiter must be used to terminate the

1768

1772

   back reference. If the <link

1769

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>  option

1770

   is set, this can be whitespace.  Otherwise an empty comment can be used.

1773

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option

1774

   is set, this can be whitespace. Otherwise an empty comment can be used.

1771

1775

  </para>

1772

1776

  <para>

1773

1777

   A back reference that occurs inside the parentheses to which

1774

   it  refers  fails when the subpattern is first used, so, for

1775

   example, (a\1) never matches.  However, such references  can

1778

   it refers fails when the subpattern is first used, so, for

1779

   example, (a\1) never matches. However, such references can

1776

1780

   be useful inside repeated subpatterns. For example, the pattern

1777

1781

1778

1782

   <literal>(a|b\1)+</literal>

1779

1783

1780

   matches any number of "a"s and also "aba", "ababba" etc.  At

1784

   matches any number of "a"s and also "aba", "ababba" etc. At

1781

1785

   each iteration of the subpattern, the back reference matches

1782

   the character string corresponding to  the  previous  iteration.

1786

   the character string corresponding to the previous iteration.

1783

1787

   In order for this to work, the pattern must be such

1784

   that the first iteration does not need  to  match  the  back

1785

   reference.  This  can  be  done using alternation, as in the

1788

   that the first iteration does not need to match the back

1789

   reference. This can be done using alternation, as in the

1786

1790

   example above, or by a quantifier with a minimum of zero.

1787

1791

  </para>

1788

1792

  <para>

...

@@ -1817,18 +1821,18 @@

1817

1821

 <section xml:id="regexp.reference.assertions">

1818

1822

  <title>Assertions</title>

1819

1823

  <para>

1820

   An assertion is  a  test  on  the  characters  following  or

1821

   preceding  the current matching point that does not actually

1822

   consume any characters. The simple assertions coded  as  \b,

1823

   \B,  \A,  \Z,  \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1824

   assertions are coded as  subpatterns.  There  are  two

1825

   kinds:  those that <emphasis>look ahead</emphasis> of the current position in the

1824

   An assertion is a test on the characters following or

1825

   preceding the current matching point that does not actually

1826

   consume any characters. The simple assertions coded as \b,

1827

   \B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1828

   assertions are coded as subpatterns. There are two

1829

   kinds: those that <emphasis>look ahead</emphasis> of the current position in the

1826

1830

   subject string, and those that <emphasis>look behind</emphasis> it.

1827

1831

  </para>

1828

1832

  <para>

1829

1833

   An assertion subpattern is matched in the normal way, except

1830

   that  it  does not cause the current matching position to be

1831

   changed. <emphasis>Lookahead</emphasis> assertions start with  (?=  for  positive

1834

   that it does not cause the current matching position to be

1835

   changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive

1832

1836

   assertions and (?! for negative assertions. For example,

1833

1837

1834

1838

   <literal>\w+(?=;)</literal>

...

@@ -1838,27 +1842,27 @@

1838

1842

1839

1843

   <literal>foo(?!bar)</literal>

1840

1844

1841

   matches any occurrence of "foo"  that  is  not  followed  by

1845

   matches any occurrence of "foo" that is not followed by

1842

1846

   "bar". Note that the apparently similar pattern

1843

1847

1844

1848

   <literal>(?!foo)bar</literal>

1845

1849

1846

   does not find an occurrence of "bar"  that  is  preceded  by

1850

   does not find an occurrence of "bar" that is preceded by

1847

1851

   something other than "foo"; it finds any occurrence of "bar"

1848

   whatsoever, because the assertion  (?!foo)  is  always  &true;

1849

   when  the  next  three  characters  are  "bar". A lookbehind

1852

   whatsoever, because the assertion (?!foo) is always &true;

1853

   when the next three characters are "bar". A lookbehind

1850

1854

   assertion is needed to achieve this effect.

1851

1855

  </para>

1852

1856

  <para>

1853

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;=  for  positive  assertions

1857

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions

1854

1858

   and (?&lt;! for negative assertions. For example,

1855

1859

1856

1860

   <literal>(?&lt;!foo)bar</literal>

1857

1861

1858

   does find an occurrence of "bar" that  is  not  preceded  by

1862

   does find an occurrence of "bar" that is not preceded by

1859

1863

   "foo". The contents of a lookbehind assertion are restricted

1860

   such that all the strings  it  matches  must  have  a  fixed

1861

   length.  However, if there are several alternatives, they do

1864

   such that all the strings it matches must have a fixed

1865

   length. However, if there are several alternatives, they do

1862

1866

   not all have to have the same fixed length. Thus

1863

1867

1864

1868

   <literal>(?&lt;=bullock|donkey)</literal>

...

@@ -1867,51 +1871,51 @@

1867

1871

1868

1872

   <literal>(?&lt;!dogs?|cats?)</literal>

1869

1873

1870

   causes an error at compile time. Branches  that  match  different

1874

   causes an error at compile time. Branches that match different

1871

1875

   length strings are permitted only at the top level of

1872

   a lookbehind assertion. This is an extension  compared  with

1873

   Perl  5.005,  which  requires all branches to match the same

1876

   a lookbehind assertion. This is an extension compared with

1877

   Perl 5.005, which requires all branches to match the same

1874

1878

   length of string. An assertion such as

1875

1879

1876

1880

   <literal>(?&lt;=ab(c|de))</literal>

1877

1881

1878

   is not permitted, because its single  top-level  branch  can

1882

   is not permitted, because its single top-level branch can

1879

1883

   match two different lengths, but it is acceptable if rewritten

1880

1884

   to use two top-level branches:

1881

1885

1882

1886

   <literal>(?&lt;=abc|abde)</literal>

1883

1887

1884

   The implementation of lookbehind  assertions  is,  for  each

1885

   alternative,  to  temporarily move the current position back

1886

   by the fixed width and then  try  to  match.  If  there  are

1887

   insufficient  characters  before  the  current position, the

1888

   match is deemed to fail.  Lookbehinds  in  conjunction  with

1889

   once-only  subpatterns can be particularly useful for matching

1890

   at the ends of strings; an example is given at  the  end

1888

   The implementation of lookbehind assertions is, for each

1889

   alternative, to temporarily move the current position back

1890

   by the fixed width and then try to match. If there are

1891

   insufficient characters before the current position, the

1892

   match is deemed to fail. Lookbehinds in conjunction with

1893

   once-only subpatterns can be particularly useful for matching

1894

   at the ends of strings; an example is given at the end

1891

1895

   of the section on once-only subpatterns.

1892

1896

  </para>

1893

1897

  <para>

1894

   Several assertions (of any sort) may  occur  in  succession.

1898

   Several assertions (of any sort) may occur in succession.

1895

1899

   For example,

1896

1900

1897

1901

   <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>

1898

1902

1899

   matches "foo" preceded by three digits that are  not  "999".

1900

   Notice  that each of the assertions is applied independently

1901

   at the same point in the subject string. First  there  is  a

1902

   check  that  the  previous  three characters are all digits,

1903

   matches "foo" preceded by three digits that are not "999".

1904

   Notice that each of the assertions is applied independently

1905

   at the same point in the subject string. First there is a

1906

   check that the previous three characters are all digits,

1903

1907

   then there is a check that the same three characters are not

1904

   "999".   This  pattern  does not match "foo" preceded by six

1908

   "999". This pattern does not match "foo" preceded by six

1905

1909

   characters, the first of which are digits and the last three

1906

   of  which  are  not  "999".  For  example,  it doesn't match

1910

   of which are not "999". For example, it doesn't match

1907

1911

   "123abcfoo". A pattern to do that is

1908

1912

1909

1913

   <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>

1910

1914

  </para>

1911

1915

  <para>

1912

   This time the first assertion looks  at  the  preceding  six

1913

   characters,  checking  that  the first three are digits, and

1914

   then the second assertion checks that  the  preceding  three

1916

   This time the first assertion looks at the preceding six

1917

   characters, checking that the first three are digits, and

1918

   then the second assertion checks that the preceding three

1915

1919

   characters are not "999".

1916

1920

  </para>

1917

1921

  <para>

...

@@ -1919,26 +1923,26 @@

1919

1923

1920

1924

   <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>

1921

1925

1922

   matches an occurrence of "baz" that  is  preceded  by  "bar"

1926

   matches an occurrence of "baz" that is preceded by "bar"

1923

1927

   which in turn is not preceded by "foo", while

1924

1928

1925

1929

   <literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>

1926

1930

1927

   is another pattern which matches  "foo"  preceded  by  three

1931

   is another pattern which matches "foo" preceded by three

1928

1932

   digits and any three characters that are not "999".

1929

1933

  </para>

1930

1934

  <para>

1931

1935

   Assertion subpatterns are not capturing subpatterns, and may

1932

   not  be  repeated,  because  it makes no sense to assert the

1933

   same thing several times. If any kind of assertion  contains

1934

   capturing  subpatterns  within it, these are counted for the

1936

   not be repeated, because it makes no sense to assert the

1937

   same thing several times. If any kind of assertion contains

1938

   capturing subpatterns within it, these are counted for the

1935

1939

   purposes of numbering the capturing subpatterns in the whole

1936

   pattern.   However,  substring capturing is carried out only

1937

   for positive assertions, because it does not make sense  for

1940

   pattern. However, substring capturing is carried out only

1941

   for positive assertions, because it does not make sense for

1938

1942

   negative assertions.

1939

1943

  </para>

1940

1944

  <para>

1941

   Assertions count towards the maximum  of  200  parenthesized

1945

   Assertions count towards the maximum of 200 parenthesized

1942

1946

   subpatterns.

1943

1947

  </para>

1944

1948

 </section>

...

@@ -1946,17 +1950,17 @@

1946

1950

 <section xml:id="regexp.reference.onlyonce">

1947

1951

  <title>Once-only subpatterns</title>

1948

1952

  <para>

1949

   With both maximizing and minimizing repetition,  failure  of

1950

   what  follows  normally  causes  the repeated item to be

1953

   With both maximizing and minimizing repetition, failure of

1954

   what follows normally causes the repeated item to be

1951

1955

   re-evaluated to see if a different number of repeats allows the

1952

   rest  of  the  pattern  to  match. Sometimes it is useful to

1953

   prevent this, either to change the nature of the  match,  or

1954

   to  cause  it fail earlier than it otherwise might, when the

1955

   author of the pattern knows there is no  point  in  carrying

1956

   rest of the pattern to match. Sometimes it is useful to

1957

   prevent this, either to change the nature of the match, or

1958

   to cause it fail earlier than it otherwise might, when the

1959

   author of the pattern knows there is no point in carrying

1956

1960

on.

1957

1961

  </para>

1958

1962

  <para>

1959

   Consider, for example, the pattern \d+foo  when  applied  to

1963

   Consider, for example, the pattern \d+foo when applied to

1960

1964

   the subject line

1961

1965

1962

1966

   <literal>123456bar</literal>

...

@@ -1964,108 +1968,108 @@

1964

1968

  <para>

1965

1969

   After matching all 6 digits and then failing to match "foo",

1966

1970

   the normal action of the matcher is to try again with only 5

1967

   digits matching the \d+ item, and then with 4,  and  so  on,

1971

   digits matching the \d+ item, and then with 4, and so on,

1968

1972

   before ultimately failing. Once-only subpatterns provide the

1969

   means for specifying that once a portion of the pattern  has

1970

   matched,  it  is  not to be re-evaluated in this way, so the

1971

   matcher would give up immediately on failing to match  "foo"

1972

   the  first  time.  The  notation  is another kind of special

1973

   means for specifying that once a portion of the pattern has

1974

   matched, it is not to be re-evaluated in this way, so the

1975

   matcher would give up immediately on failing to match "foo"

1976

   the first time. The notation is another kind of special

1973

1977

   parenthesis, starting with (?&gt; as in this example:

1974

1978

1975

1979

   <literal>(?&gt;\d+)bar</literal>

1976

1980

  </para>

1977

1981

  <para>

1978

   This kind of parenthesis "locks up" the  part of the pattern

1979

   it  contains once it has matched, and a failure further into

1980

   the pattern is prevented from backtracking  into  it.

1981

   Backtracking  past  it to previous items, however, works as normal.

1982

   This kind of parenthesis "locks up" the part of the pattern

1983

   it contains once it has matched, and a failure further into

1984

   the pattern is prevented from backtracking into it.

1985

   Backtracking past it to previous items, however, works as normal.

1982

1986

  </para>

1983

1987

  <para>

1984

1988

   An alternative description is that a subpattern of this type

1985

   matches  the  string  of  characters that an identical standalone

1989

   matches the string of characters that an identical standalone

1986

1990

   pattern would match, if anchored at the current point

1987

1991

   in the subject string.

1988

1992

  </para>

1989

1993

  <para>

1990

   Once-only subpatterns are not capturing subpatterns.  Simple

1991

   cases  such as the above example can be thought of as a maximizing

1992

   repeat that must  swallow  everything  it  can.  So,

1994

   Once-only subpatterns are not capturing subpatterns. Simple

1995

   cases such as the above example can be thought of as a maximizing

1996

   repeat that must swallow everything it can. So,

1993

1997

   while both \d+ and \d+? are prepared to adjust the number of

1994

   digits they match in order to make the rest of  the  pattern

1998

   digits they match in order to make the rest of the pattern

1995

1999

   match, (?&gt;\d+) can only match an entire sequence of digits.

1996

2000

  </para>

1997

2001

  <para>

1998

   This construction can of course contain arbitrarily  complicated

2002

   This construction can of course contain arbitrarily complicated

1999

2003

   subpatterns, and it can be nested.

2000

2004

  </para>

2001

2005

  <para>

2002

2006

   Once-only subpatterns can be used in conjunction with

2003

   lookbehind assertions  to specify efficient matching at the end

2007

   lookbehind assertions to specify efficient matching at the end

2004

2008

   of the subject string. Consider a simple pattern such as

2005

2009

2006

2010

   <literal>abcd$</literal>

2007

2011

2008

   when applied to a long string which does not match.  Because

2009

   matching  proceeds  from  left  to right, PCRE will look for

2012

   when applied to a long string which does not match. Because

2013

   matching proceeds from left to right, PCRE will look for

2010

2014

   each "a" in the subject and then see if what follows matches

2011

2015

   the rest of the pattern. If the pattern is specified as

2012

2016

2013

2017

   <literal>^.*abcd$</literal>

2014

2018

2015

   then the initial .* matches the entire string at first,  but

2016

   when  this  fails  (because  there  is no following "a"), it

2019

   then the initial .* matches the entire string at first, but

2020

   when this fails (because there is no following "a"), it

2017

2021

   backtracks to match all but the last character, then all but

2018

   the  last  two  characters, and so on. Once again the search

2019

   for "a" covers the entire string, from right to left, so  we

2022

   the last two characters, and so on. Once again the search

2023

   for "a" covers the entire string, from right to left, so we

2020

2024

   are no better off. However, if the pattern is written as

2021

2025

2022

2026

   <literal>^(?>.*)(?&lt;=abcd)</literal>

2023

2027

2024

   then there can be no backtracking for the .*  item;  it  can

2025

   match  only  the  entire  string.  The subsequent lookbehind

2028

   then there can be no backtracking for the .* item; it can

2029

   match only the entire string. The subsequent lookbehind

2026

2030

   assertion does a single test on the last four characters. If

2027

   it  fails,  the  match  fails immediately. For long strings,

2031

   it fails, the match fails immediately. For long strings,

2028

2032

   this approach makes a significant difference to the processing time.

2029

2033

  </para>

2030

2034

  <para>

2031

2035

   When a pattern contains an unlimited repeat inside a subpattern

2032

2036

   that can itself be repeated an unlimited number of

2033

   times, the use of a once-only subpattern is the only way  to

2034

   avoid  some  failing matches taking a very long time indeed.

2037

   times, the use of a once-only subpattern is the only way to

2038

   avoid some failing matches taking a very long time indeed.

2035

2039

   The pattern

2036

2040

2037

2041

   <literal>(\D+|&lt;\d+>)*[!?]</literal>

2038

2042

2039

   matches an unlimited number of substrings that  either  consist

2040

   of  non-digits,  or digits enclosed in &lt;>, followed by

2043

   matches an unlimited number of substrings that either consist

2044

   of non-digits, or digits enclosed in &lt;>, followed by

2041

2045

   either ! or ?. When it matches, it runs quickly. However, if

2042

2046

   it is applied to

2043

2047

2044

2048

   <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>

2045

2049

2046

   it takes a long  time  before  reporting  failure.  This  is

2050

   it takes a long time before reporting failure. This is

2047

2051

   because the string can be divided between the two repeats in

2048

2052

   a large number of ways, and all have to be tried. (The example

2049

   used  [!?]  rather  than a single character at the end,

2050

   because both PCRE and Perl have an optimization that  allows

2051

   for  fast  failure  when  a  single  character is used. They

2052

   remember the last single character that is  required  for  a

2053

   match,  and  fail early if it is not present in the string.)

2053

   used [!?] rather than a single character at the end,

2054

   because both PCRE and Perl have an optimization that allows

2055

   for fast failure when a single character is used. They

2056

   remember the last single character that is required for a

2057

   match, and fail early if it is not present in the string.)

2054

2058

   If the pattern is changed to

2055

2059

2056

2060

   <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>

2057

2061

2058

   sequences of non-digits cannot be broken, and  failure  happens quickly.

2062

   sequences of non-digits cannot be broken, and failure happens quickly.

2059

2063

  </para>

2060

2064

 </section>

2061

2065

2062

2066

 <section xml:id="regexp.reference.conditional">

2063

2067

  <title>Conditional subpatterns</title>

2064

2068

  <para>

2065

   It is possible to cause the matching process to obey a  subpattern

2066

   conditionally  or to choose between two alternative

2067

   subpatterns, depending on the result  of  an  assertion,  or

2068

   whether  a previous capturing subpattern matched or not. The

2069

   It is possible to cause the matching process to obey a subpattern

2070

   conditionally or to choose between two alternative

2071

   subpatterns, depending on the result of an assertion, or

2072

   whether a previous capturing subpattern matched or not. The

2069

2073

   two possible forms of conditional subpattern are

2070

2074

  </para>

2071

2075

...

@@ -2079,39 +2083,39 @@

2079

2083

  </informalexample>

2080

2084

  <para>

2081

2085

   If the condition is satisfied, the yes-pattern is used; otherwise

2082

   the  no-pattern  (if  present) is used. If there are

2086

   the no-pattern (if present) is used. If there are

2083

2087

   more than two alternatives in the subpattern, a compile-time

2084

2088

   error occurs.

2085

2089

  </para>

2086

2090

  <para>

2087

   There are two kinds of condition. If the  text  between  the

2088

   parentheses  consists  of  a  sequence  of  digits, then the

2089

   condition is satisfied if the capturing subpattern  of  that

2090

   number  has  previously matched. Consider the following pattern,

2091

   which contains non-significant white space to make  it

2092

   more  readable  (assume  the  <link

2091

   There are two kinds of condition. If the text between the

2092

   parentheses consists of a sequence of digits, then the

2093

   condition is satisfied if the capturing subpattern of that

2094

   number has previously matched. Consider the following pattern,

2095

   which contains non-significant white space to make it

2096

   more readable (assume the <link

2093

2097

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2094

   option)  and to divide it into three parts for ease of discussion:

2098

   option) and to divide it into three parts for ease of discussion:

2095

2099

  </para>

2096

2100

  <informalexample>

2097

2101

   <programlisting>

2098

2102

<![CDATA[

2099

( \( )?    [^()]+    (?(1) \) )

2103

( \( )? [^()]+ (?(1) \) )

2100

2104

]]>

2101

2105

   </programlisting>

2102

2106

  </informalexample>

2103

2107

  <para>

2104

   The first part matches an optional opening parenthesis,  and

2105

   if  that character is present, sets it as the first captured

2106

   substring. The second part matches one  or  more  characters

2107

   that  are  not  parentheses. The third part is a conditional

2108

   subpattern that tests whether the first set  of  parentheses

2109

   matched  or  not.  If  they did, that is, if subject started

2110

   with an opening parenthesis, the condition is &true;,  and  so

2111

   the  yes-pattern  is  executed  and a closing parenthesis is

2112

   required. Otherwise, since no-pattern is  not  present,  the

2113

   subpattern  matches  nothing.  In  other words, this pattern

2114

   matches a sequence of non-parentheses,  optionally  enclosed

2108

   The first part matches an optional opening parenthesis, and

2109

   if that character is present, sets it as the first captured

2110

   substring. The second part matches one or more characters

2111

   that are not parentheses. The third part is a conditional

2112

   subpattern that tests whether the first set of parentheses

2113

   matched or not. If they did, that is, if subject started

2114

   with an opening parenthesis, the condition is &true;, and so

2115

   the yes-pattern is executed and a closing parenthesis is

2116

   required. Otherwise, since no-pattern is not present, the

2117

   subpattern matches nothing. In other words, this pattern

2118

   matches a sequence of non-parentheses, optionally enclosed

2115

2119

   in parentheses.

2116

2120

  </para>

2117

2121

  <para>

...

@@ -2120,10 +2124,10 @@

2120

2124

   level", the condition is false.

2121

2125

  </para>

2122

2126

  <para>

2123

   If the condition is not a sequence of digits or (R), it must be  an

2124

   assertion.  This  may be a positive or negative lookahead or

2125

   lookbehind assertion. Consider this pattern, again  containing

2126

   non-significant  white space, and with the two alternatives on

2127

   If the condition is not a sequence of digits or (R), it must be an

2128

   assertion. This may be a positive or negative lookahead or

2129

   lookbehind assertion. Consider this pattern, again containing

2130

   non-significant white space, and with the two alternatives on

2127

2131

   the second line:

2128

2132

  </para>

2129

2133

...

@@ -2131,18 +2135,18 @@

2131

2135

   <programlisting>

2132

2136

<![CDATA[

2133

2137

(?(?=[^a-z]*[a-z])

2134

\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

2138

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

2135

2139

]]>

2136

2140

   </programlisting>

2137

2141

  </informalexample>

2138

2142

  <para>

2139

2143

   The condition is a positive lookahead assertion that matches

2140

2144

   an optional sequence of non-letters followed by a letter. In

2141

   other words, it tests for  the  presence  of  at  least  one

2142

   letter  in the subject. If a letter is found, the subject is

2143

   matched against  the  first  alternative;  otherwise  it  is

2144

   matched  against the second. This pattern matches strings in

2145

   one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are

2145

   other words, it tests for the presence of at least one

2146

   letter in the subject. If a letter is found, the subject is

2147

   matched against the first alternative; otherwise it is

2148

   matched against the second. This pattern matches strings in

2149

   one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

2146

2150

   letters and dd are digits.

2147

2151

  </para>

2148

2152

 </section>

...

@@ -2150,14 +2154,14 @@

2150

2154

 <section xml:id="regexp.reference.comments">

2151

2155

  <title>Comments</title>

2152

2156

  <para>

2153

   The  sequence  (?#  marks  the  start  of  a  comment  which

2154

   continues   up  to  the  next  closing  parenthesis.  Nested

2157

   The sequence (?# marks the start of a comment which

2158

   continues up to the next closing parenthesis. Nested

2155

2159

   parentheses are not permitted. The characters that make up a

2156

2160

   comment play no part in the pattern matching at all.

2157

2161

  </para>

2158

2162

  <para>

2159

2163

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2160

   option is set, an unescaped # character outside  a character class

2164

   option is set, an unescaped # character outside a character class

2161

2165

   introduces a comment that continues up to the next newline character

2162

2166

   in the pattern.

2163

2167

  </para>

...

@@ -2201,15 +2205,15 @@ int(1)

2201

2205

 <section xml:id="regexp.reference.recursive">

2202

2206

  <title>Recursive patterns</title>

2203

2207

  <para>

2204

   Consider the problem of matching a  string  in  parentheses,

2205

   allowing  for  unlimited nested parentheses. Without the use

2206

   of recursion, the best that can be done is to use a  pattern

2207

   that  matches  up  to some fixed depth of nesting. It is not

2208

   possible to handle an arbitrary nesting depth. Perl 5.6  has

2209

   provided   an  experimental  facility  that  allows  regular

2210

   expressions to recurse (among other things).  The  special

2211

   item (?R) is  provided for  the specific  case of recursion.

2212

   This PCRE  pattern  solves the  parentheses  problem (assume

2208

   Consider the problem of matching a string in parentheses,

2209

   allowing for unlimited nested parentheses. Without the use

2210

   of recursion, the best that can be done is to use a pattern

2211

   that matches up to some fixed depth of nesting. It is not

2212

   possible to handle an arbitrary nesting depth. Perl 5.6 has

2213

   provided an experimental facility that allows regular

2214

   expressions to recurse (among other things). The special

2215

   item (?R) is provided for the specific case of recursion.

2216

   This PCRE pattern solves the parentheses problem (assume

2213

2217

   the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2214

2218

   option is set so that white space is

2215

2219

   ignored):

...

@@ -2218,45 +2222,45 @@ int(1)

2218

2222

  </para>

2219

2223

  <para>

2220

2224

   First it matches an opening parenthesis. Then it matches any

2221

   number  of substrings which can either be a sequence of

2222

   non-parentheses, or a recursive  match  of  the  pattern  itself

2225

   number of substrings which can either be a sequence of

2226

   non-parentheses, or a recursive match of the pattern itself

2223

2227

   (i.e. a correctly parenthesized substring). Finally there is

2224

2228

   a closing parenthesis.

2225

2229

  </para>

2226

2230

  <para>

2227

   This particular example pattern  contains  nested  unlimited

2231

   This particular example pattern contains nested unlimited

2228

2232

   repeats, and so the use of a once-only subpattern for matching

2229

   strings of non-parentheses is  important  when  applying

2230

   the  pattern to strings that do not match. For example, when

2233

   strings of non-parentheses is important when applying

2234

   the pattern to strings that do not match. For example, when

2231

2235

   it is applied to

2232

2236

2233

2237

   <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>

2234

2238

2235

   it yields "no match" quickly. However, if a  once-only  subpattern

2236

   is  not  used,  the match runs for a very long time

2237

   indeed because there are so many different ways the + and  *

2238

   repeats  can carve up the subject, and all have to be tested

2239

   it yields "no match" quickly. However, if a once-only subpattern

2240

   is not used, the match runs for a very long time

2241

   indeed because there are so many different ways the + and *

2242

   repeats can carve up the subject, and all have to be tested

2239

2243

   before failure can be reported.

2240

2244

  </para>

2241

2245

  <para>

2242

   The values set for any capturing subpatterns are those  from

2246

   The values set for any capturing subpatterns are those from

2243

2247

   the outermost level of the recursion at which the subpattern

2244

2248

   value is set. If the pattern above is matched against

2245

2249

2246

2250

   <literal>(ab(cd)ef)</literal>

2247

2251

2248

   the value for the capturing parentheses is  "ef",  which  is

2249

   the  last  value  taken  on  at the top level. If additional

2252

   the value for the capturing parentheses is "ef", which is

2253

   the last value taken on at the top level. If additional

2250

2254

   parentheses are added, giving

2251

2255

2252

2256

   <literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>

2253

2257

   then the string they capture

2254

2258

   is "ab(cd)ef", the contents of the top level parentheses. If

2255

   there are more than 15 capturing parentheses in  a  pattern,

2256

   PCRE  has  to  obtain  extra  memory  to store data during a

2257

   recursion, which it does by using  pcre_malloc,  freeing  it

2258

   via  pcre_free  afterwards. If no memory can be obtained, it

2259

   saves data for the first 15 capturing parentheses  only,  as

2259

   there are more than 15 capturing parentheses in a pattern,

2260

   PCRE has to obtain extra memory to store data during a

2261

   recursion, which it does by using pcre_malloc, freeing it

2262

   via pcre_free afterwards. If no memory can be obtained, it

2263

   saves data for the first 15 capturing parentheses only, as

2260

2264

   there is no way to give an out-of-memory error from within a

2261

2265

   recursion.

2262

2266

  </para>

...

@@ -2295,75 +2299,75 @@ int(1)

2295

2299

  <title>Performance</title>

2296

2300

  <para>

2297

2301

   Certain items that may appear in patterns are more efficient

2298

   than  others.  It is more efficient to use a character class

2302

   than others. It is more efficient to use a character class

2299

2303

   like [aeiou] than a set of alternatives such as (a|e|i|o|u).

2300

   In  general,  the  simplest  construction  that provides the

2301

   required behaviour is usually the  most  efficient.  Jeffrey

2302

   Friedl's  book contains a lot of discussion about optimizing

2304

   In general, the simplest construction that provides the

2305

   required behaviour is usually the most efficient. Jeffrey

2306

   Friedl's book contains a lot of discussion about optimizing

2303

2307

   regular expressions for efficient performance.

2304

2308

  </para>

2305

2309

  <para>

2306

2310

   When a pattern begins with .* and the <link

2307

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is

2308

   set,  the  pattern  is implicitly anchored by PCRE, since it

2311

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is

2312

   set, the pattern is implicitly anchored by PCRE, since it

2309

2313

   can match only at the start of a subject string. However, if

2310

2314

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

2311

2315

   is not set, PCRE cannot make this optimization,

2312

   because the . metacharacter does not then match  a  newline,

2316

   because the . metacharacter does not then match a newline,

2313

2317

   and if the subject string contains newlines, the pattern may

2314

   match from the character immediately following one  of  them

2318

   match from the character immediately following one of them

2315

2319

   instead of from the very start. For example, the pattern

2316

2320

2317

2321

   <literal>(.*) second</literal>

2318

2322

2319

2323

   matches the subject "first\nand second" (where \n stands for

2320

2324

   a newline character) with the first captured substring being

2321

   "and". In order to do this, PCRE  has  to  retry  the  match

2325

   "and". In order to do this, PCRE has to retry the match

2322

2326

   starting after every newline in the subject.

2323

2327

  </para>

2324

2328

  <para>

2325

2329

   If you are using such a pattern with subject strings that do

2326

   not  contain  newlines,  the best performance is obtained by

2330

   not contain newlines, the best performance is obtained by

2327

2331

   setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,

2328

   or starting the  pattern  with  ^.*  to

2329

   indicate  explicit anchoring. That saves PCRE from having to

2332

   or starting the pattern with ^.* to

2333

   indicate explicit anchoring. That saves PCRE from having to

2330

2334

   scan along the subject looking for a newline to restart at.

2331

2335

  </para>

2332

2336

  <para>

2333

   Beware of patterns that contain nested  indefinite  repeats.

2334

   These  can  take a long time to run when applied to a string

2337

   Beware of patterns that contain nested indefinite repeats.

2338

   These can take a long time to run when applied to a string

2335

2339

   that does not match. Consider the pattern fragment

2336

2340

2337

2341

   <literal>(a+)*</literal>

2338

2342

  </para>

2339

2343

  <para>

2340

   This can match "aaaa" in 33 different ways, and this  number

2341

   increases  very  rapidly  as  the string gets longer. (The *

2342

   repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of

2343

   those  cases other than 0, the + repeats can match different

2344

   This can match "aaaa" in 33 different ways, and this number

2345

   increases very rapidly as the string gets longer. (The *

2346

   repeat can match 0, 1, 2, 3, or 4 times, and for each of

2347

   those cases other than 0, the + repeats can match different

2344

2348

   numbers of times.) When the remainder of the pattern is such

2345

   that  the entire match is going to fail, PCRE has in principle

2346

   to try every possible variation, and this  can  take  an

2349

   that the entire match is going to fail, PCRE has in principle

2350

   to try every possible variation, and this can take an

2347

2351

   extremely long time.

2348

2352

  </para>

2349

2353

  <para>

2350

   An optimization catches some of the more simple  cases  such

2354

   An optimization catches some of the more simple cases such

2351

2355

as

2352

2356

2353

2357

   <literal>(a+)*b</literal>

2354

2358

2355

   where a literal character follows. Before embarking  on  the

2359

   where a literal character follows. Before embarking on the

2356

2360

   standard matching procedure, PCRE checks that there is a "b"

2357

   later in the subject string, and if there is not,  it  fails

2358

   the  match  immediately. However, when there is no following

2359

   literal this optimization cannot be used. You  can  see  the

2361

   later in the subject string, and if there is not, it fails

2362

   the match immediately. However, when there is no following

2363

   literal this optimization cannot be used. You can see the

2360

2364

   difference by comparing the behaviour of

2361

2365

2362

2366

   <literal>(a+)*\d</literal>

2363

2367

2364

   with the pattern above. The former gives  a  failure  almost

2365

   instantly  when  applied  to a whole line of "a" characters,

2366

   whereas the latter takes an appreciable  time  with  strings

2368

   with the pattern above. The former gives a failure almost

2369

   instantly when applied to a whole line of "a" characters,

2370

   whereas the latter takes an appreciable time with strings

2367

2371

   longer than about 20 characters.

2368

2372

  </para>

2369

2373

 </section>

2370

2374

Generated: 19 Apr 2024 11:18:56

Tools (French Manual)