PHP: Documentation Tools

reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77

...

@@ -8,21 +8,21 @@

 <section xml:id="regexp.introduction">

  <title>Introduction</title>

  <para>

   The syntax and semantics of  the  regular  expressions

   supported  by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number  of

   other  books,  some  of which have copious examples. Jeffrey

   Friedl's  "Mastering  Regular  Expressions",  published   by

   O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.

   The syntax and semantics of the regular expressions

   supported by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number of

   other books, some of which have copious examples. Jeffrey

   Friedl's "Mastering Regular Expressions", published by

   O'Reilly (ISBN 1-56592-257-3), covers them in great detail.

   The description here is intended as reference documentation.

  </para>

  <para>

   A regular expression is a pattern that is matched against  a

   A regular expression is a pattern that is matched against a

   subject string from left to right. Most characters stand for

   themselves in a pattern, and match the corresponding

   characters in the subject. As a trivial example, the pattern

   <literal>The quick brown fox</literal>

   matches a portion of a subject string that is  identical  to

   matches a portion of a subject string that is identical to

   itself.

  </para>

 </section>

...

@@ -32,6 +32,7 @@

   When using the PCRE functions, it is required that the pattern is enclosed

   by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,

   non-backslash, non-whitespace character.

   Leading whitespace before a valid delimiter is silently ignored.

  </para>

  <para>

   Often used delimiters are forward slashes (<literal>/</literal>), hash

...

@@ -101,15 +102,15 @@

101

102

 <section xml:id="regexp.reference.meta">

102

103

  <title>Meta-characters</title>

103

104

  <para>

104

   The  power  of  regular  expressions comes from the

105

   The power of regular expressions comes from the

105

106

   ability to include alternatives and repetitions in the

106

   pattern.  These  are encoded in the pattern by the use of

107

   <emphasis>meta-characters</emphasis>, which do not stand for  themselves  but  instead

107

   pattern. These are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead

108

109

   are interpreted in some special way.

109

110

  </para>

110

111

  <para>

111

   There are two different sets of meta-characters: those  that

112

   are  recognized anywhere in the pattern except within square

112

   There are two different sets of meta-characters: those that

113

   are recognized anywhere in the pattern except within square

113

114

   brackets, and those that are recognized in square brackets.

114

115

   Outside square brackets, the meta-characters are as follows:

115

116

...

@@ -129,7 +130,8 @@

129

130

       <entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>

130

131

      </row>

131

132

      <row>

132

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or end of line, in multiline mode)</entry>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or

134

        end of line, in multiline mode)</entry>

133

135

      </row>

134

136

      <row>

135

137

       <entry>.</entry><entry>match any character except newline (by default)</entry>

...

@@ -203,9 +205,9 @@

203

205

 <section xml:id="regexp.reference.escape">

204

206

  <title>Escape sequences</title>

205

207

  <para>

206

   The backslash character has several uses. Firstly, if it  is

208

   The backslash character has several uses. Firstly, if it is

207

209

   followed by a non-alphanumeric character, it takes away any

208

   special  meaning that character may have. This use of

210

   special meaning that character may have. This use of

209

211

   backslash as an escape character applies both inside and

210

212

   outside character classes.

211

213

  </para>

...

@@ -214,7 +216,7 @@

214

216

   "\*" in the pattern. This applies whether or not the

215

217

   following character would otherwise be interpreted as a

216

218

   meta-character, so it is always safe to precede a non-alphanumeric

217

   with "\" to specify that it stands for itself.  In

219

   with "\" to specify that it stands for itself. In

218

220

   particular, if you want to match a backslash, you write "\\".

219

221

  </para>

220

222

  <note>

...

@@ -236,10 +238,10 @@

236

238

  <para>

237

239

   A second use of backslash provides a way of encoding

238

240

   non-printing characters in patterns in a visible manner. There

239

   is no restriction on the appearance of non-printing  characters,

241

   is no restriction on the appearance of non-printing characters,

240

242

   apart from the binary zero that terminates a pattern,

241

243

   but when a pattern is being prepared by text editing, it is

242

   usually  easier to use one of the following escape sequences

244

   usually easier to use one of the following escape sequences

243

245

   than the binary character it represents:

244

246

  </para>

245

247

  <para>

...

@@ -330,9 +332,9 @@

330

332

  </para>

331

333

  <para>

332

334

   The precise effect of "<literal>\cx</literal>" is as follows:

333

   if "<literal>x</literal>" is a lower case  letter, it is converted

335

   if "<literal>x</literal>" is a lower case letter, it is converted

334

336

   to upper case. Then bit 6 of the character (hex 40) is inverted.

335

   Thus "<literal>\cz</literal>" becomes  hex 1A, but

337

   Thus "<literal>\cz</literal>" becomes hex 1A, but

336

338

   "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"

337

339

   becomes hex 7B.

338

340

  </para>

...

@@ -348,7 +350,7 @@

348

350

  </para>

349

351

  <para>

350

352

   After "<literal>\0</literal>" up to two further octal digits are read.

351

   In  both cases,  if  there are fewer than two digits, just those that

353

   In both cases, if there are fewer than two digits, just those that

352

354

   are present are used. Thus the sequence "<literal>\0\x\07</literal>"

353

355

   specifies two binary zeros followed by a BEL character. Make sure you

354

356

   supply two digits after the initial zero if the character

...

@@ -357,20 +359,20 @@

357

359

  <para>

358

360

   The handling of a backslash followed by a digit other than 0

359

361

   is complicated. Outside a character class, PCRE reads it

360

   and any following digits as a decimal number. If the  number

361

   is  less  than  10, or if there have been at least that many

362

   previous capturing left parentheses in the  expression,  the

363

   entire  sequence is taken as a <emphasis>back reference</emphasis>. A description

364

   of how this works is given later, following  the  discussion

362

   and any following digits as a decimal number. If the number

363

   is less than 10, or if there have been at least that many

364

   previous capturing left parentheses in the expression, the

365

   entire sequence is taken as a <emphasis>back reference</emphasis>. A description

366

   of how this works is given later, following the discussion

365

367

   of parenthesized subpatterns.

366

368

  </para>

367

369

  <para>

368

   Inside a character  class,  or  if  the  decimal  number  is

370

   Inside a character class, or if the decimal number is

369

371

   greater than 9 and there have not been that many capturing

370

372

   subpatterns, PCRE re-reads up to three octal digits following

371

373

   the backslash, and generates a single byte from the

372

374

   least significant 8 bits of the value. Any subsequent digits

373

   stand for themselves.  For example:

375

   stand for themselves. For example:

374

376

  </para>

375

377

  <para>

376

378

   <variablelist>

...

@@ -438,7 +440,7 @@

438

440

   digits are ever read.

439

441

  </para>

440

442

  <para>

441

   All the sequences that define a single byte value can  be

443

   All the sequences that define a single byte value can be

442

444

   used both inside and outside character classes. In addition,

443

445

   inside a character class, the sequence "<literal>\b</literal>"

444

446

   is interpreted as the backspace character (hex 08). Outside a character

...

@@ -460,11 +462,11 @@

460

462

    </varlistentry>

461

463

    <varlistentry>

462

464

     <term><emphasis>\h</emphasis></term>

463

     <listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

465

     <listitem><simpara>any horizontal whitespace character</simpara></listitem>

464

466

    </varlistentry>

465

467

    <varlistentry>

466

468

     <term><emphasis>\H</emphasis></term>

467

     <listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

469

     <listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>

468

470

    </varlistentry>

469

471

    <varlistentry>

470

472

     <term><emphasis>\s</emphasis></term>

...

@@ -476,11 +478,11 @@

476

478

    </varlistentry>

477

479

    <varlistentry>

478

480

     <term><emphasis>\v</emphasis></term>

479

     <listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

481

     <listitem><simpara>any vertical whitespace character</simpara></listitem>

480

482

    </varlistentry>

481

483

    <varlistentry>

482

484

     <term><emphasis>\V</emphasis></term>

483

     <listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

485

     <listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>

484

486

    </varlistentry>

485

487

    <varlistentry>

486

488

     <term><emphasis>\w</emphasis></term>

...

@@ -505,7 +507,7 @@

505

507

  </para>

506

508

  <para>

507

509

   A "word" character is any letter or digit or the underscore

508

   character,  that  is,  any  character which can be part of a

510

   character, that is, any character which can be part of a

509

511

   Perl "<emphasis>word</emphasis>". The definition of letters and digits is

510

512

   controlled by PCRE's character tables, and may vary if locale-specific

511

513

   matching is taking place. For example, in the "fr" (French) locale, some

...

@@ -514,15 +516,15 @@

514

516

  </para>

515

517

  <para>

516

518

   These character type sequences can appear both inside and

517

   outside  character classes. They each match one character of

518

   the appropriate type. If the current matching  point is at

519

   outside character classes. They each match one character of

520

   the appropriate type. If the current matching point is at

519

521

   the end of the subject string, all of them fail, since there

520

522

   is no character to match.

521

523

  </para>

522

524

  <para>

523

   The fourth use of backslash is  for  certain  simple

525

   The fourth use of backslash is for certain simple

524

526

   assertions. An assertion specifies a condition that has to be met

525

   at a particular point in  a match, without consuming any

527

   at a particular point in a match, without consuming any

526

528

   characters from the subject string. The use of subpatterns

527

529

   for more complicated assertions is described below. The

528

530

   backslashed assertions are

...

@@ -561,7 +563,7 @@

561

563

   </variablelist>

562

564

  </para>

563

565

  <para>

564

   These assertions may not appear in  character  classes  (but

566

   These assertions may not appear in character classes (but

565

567

   note that "<literal>\b</literal>" has a different meaning, namely the backspace

566

568

   character, inside a character class).

567

569

  </para>

...

@@ -569,20 +571,20 @@

569

571

   A word boundary is a position in the subject string where

570

572

   the current character and the previous character do not both

571

573

   match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches

572

   <literal>\w</literal> and  the  other  matches

574

   <literal>\w</literal> and the other matches

573

575

   <literal>\W</literal>), or the start or end of the string if the first

574

576

   or last character matches <literal>\w</literal>, respectively.

575

577

  </para>

576

578

  <para>

577

579

   The <literal>\A</literal>, <literal>\Z</literal>, and

578

   <literal>\z</literal> assertions differ  from  the  traditional

579

   circumflex  and  dollar  (described in <link linkend="regexp.reference.anchors">anchors</link> ) in that they only

580

   ever match at the very start and end of the subject  string,

581

   whatever  options  are  set.  They  are  not affected by the

580

   <literal>\z</literal> assertions differ from the traditional

581

   circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )

582

   in that they only ever match at the very start and end of the subject string,

583

   whatever options are set. They are not affected by the

582

584

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or

583

585

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>

584

   options. The  difference  between <literal>\Z</literal> and

585

   <literal>\z</literal>  is that <literal>\Z</literal> matches before a

586

   options. The difference between <literal>\Z</literal> and

587

   <literal>\z</literal> is that <literal>\Z</literal> matches before a

586

588

   newline that is the last character of the string as well as at the end of

587

589

   the string, whereas <literal>\z</literal> matches only at the end.

588

590

  </para>

...

@@ -599,12 +601,16 @@

599

601

   regexp metacharacters in the pattern. For example:

600

602

   <literal>\w+\Q.$.\E$</literal> will match one or more word characters,

601

603

   followed by literals <literal>.$.</literal> and anchored at the end of

602

   the string.

604

   the string. Note that this does not change the behavior of 

605

   delimiters; for instance the pattern <literal>#\Q#\E#$</literal>

606

   is not valid, because the second <literal>#</literal> marks the end

607

   of the pattern, and the <literal>\E#</literal> is interpreted as invalid

608

   modifiers.

603

609

  </para>

604

610

605

611

  <para>

606

   <literal>\K</literal> can be used to reset the match start since

607

   PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches

612

   <literal>\K</literal> can be used to reset the match start. 

613

   For example, the pattern <literal>foo\Kbar</literal> matches

608

614

   "foobar", but reports that it has matched "bar". The use of

609

615

   <literal>\K</literal> does not interfere with the setting of captured

610

616

   substrings. For example, when the pattern <literal>(foo)\Kbar</literal>

...

@@ -868,8 +874,8 @@

868

874

   For example, <literal>\p{Lu}</literal> always matches only upper case letters.

869

875

  </para>

870

876

  <para>

871

   Sets of Unicode characters are defined as belonging to certain scripts.  A

872

   character from one of these sets can be matched using a script name.  For

877

   Sets of Unicode characters are defined as belonging to certain scripts. A

878

   character from one of these sets can be matched using a script name. For

873

879

   example:

874

880

  </para>

875

881

  <itemizedlist>

...

@@ -881,7 +887,7 @@

881

887

   </listitem>

882

888

  </itemizedlist>

883

889

  <para>

884

   Those that are not part of an identified script are lumped together  as

890

   Those that are not part of an identified script are lumped together as

885

891

   <literal>Common</literal>. The current list of scripts is:

886

892

  </para>

887

893

  <table>

...

@@ -1050,7 +1056,7 @@

1050

1056

  <para>

1051

1057

   In versions of PCRE older than 8.32 (which corresponds to PHP versions

1052

1058

   before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>

1053

   is equivalent to <literal>(?>\PM\pM*)</literal>.  That is, it matches a

1059

   is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a

1054

1060

   character without the "mark" property, followed by zero or more characters

1055

1061

   with the "mark" property, and treats the sequence as an atomic group (see

1056

1062

   below). Characters with the "mark" property are typically accents that

...

@@ -1070,8 +1076,8 @@

1070

1076

  <para>

1071

1077

   Outside a character class, in the default matching mode, the

1072

1078

   circumflex character (<literal>^</literal>) is an assertion which

1073

   is true only if the current matching point is at the start  of

1074

   the  subject string. Inside a character class, circumflex (<literal>^</literal>)

1079

   is true only if the current matching point is at the start of

1080

   the subject string. Inside a character class, circumflex (<literal>^</literal>)

1075

1081

   has an entirely different meaning (see below).

1076

1082

  </para>

1077

1083

  <para>

...

@@ -1086,12 +1092,12 @@

1086

1092

  </para>

1087

1093

  <para>

1088

1094

   A dollar character (<literal>$</literal>) is an assertion which is

1089

   &true; only if the current  matching point is at the end of the subject

1090

   string, or immediately before a newline character that is  the  last

1095

   &true; only if the current matching point is at the end of the subject

1096

   string, or immediately before a newline character that is the last

1091

1097

   character in the string (by default). Dollar (<literal>$</literal>)

1092

   need not be the last character of the pattern if a  number  of

1093

   alternatives are  involved,  but it should be the last item in any branch

1094

   in which it appears. Dollar has no  special  meaning  in  a

1098

   need not be the last character of the pattern if a number of

1099

   alternatives are involved, but it should be the last item in any branch

1100

   in which it appears. Dollar has no special meaning in a

1095

1101

   character class.

1096

1102

  </para>

1097

1103

  <para>

...

@@ -1117,9 +1123,9 @@

1117

1123

   set.

1118

1124

  </para>

1119

1125

  <para>

1120

   Note that the sequences \A, \Z, and \z can be used to  match

1121

   the  start  and end of the subject in both modes, and if all

1122

   branches of a pattern start with \A is it  always  anchored,

1126

   Note that the sequences \A, \Z, and \z can be used to match

1127

   the start and end of the subject in both modes, and if all

1128

   branches of a pattern start with \A is it always anchored,

1123

1129

   whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1124

1130

   is set or not.

1125

1131

  </para>

...

@@ -1128,14 +1134,14 @@

1128

1134

 <section xml:id="regexp.reference.dot">

1129

1135

  <title>Dot</title>

1130

1136

  <para>

1131

   Outside a character class, a dot in the pattern matches  any

1132

   one  character  in  the  subject,  including  a non-printing

1133

   character, but not (by default) newline.  If the

1137

   Outside a character class, a dot in the pattern matches any

1138

   one character in the subject, including a non-printing

1139

   character, but not (by default) newline. If the

1134

1140

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1135

   option  is  set,  then dots match newlines as well. The

1141

   option is set, then dots match newlines as well. The

1136

1142

   handling of dot is entirely independent of the handling of

1137

   circumflex  and  dollar,  the only relationship being that they

1138

   both involve newline characters.  Dot has no special meaning

1143

   circumflex and dollar, the only relationship being that they

1144

   both involve newline characters. Dot has no special meaning

1139

1145

   in a character class.

1140

1146

  </para>

1141

1147

  <para>

...

@@ -1149,29 +1155,29 @@

1149

1155

  <title>Character classes</title>

1150

1156

  <para>

1151

1157

   An opening square bracket introduces a character class,

1152

   terminated  by  a  closing  square  bracket.  A  closing square

1153

   bracket on its own is  not  special.  If  a  closing  square

1154

   bracket  is  required as a member of the class, it should be

1158

   terminated by a closing square bracket. A closing square

1159

   bracket on its own is not special. If a closing square

1160

   bracket is required as a member of the class, it should be

1155

1161

   the first data character in the class (after an initial

1156

1162

   circumflex, if present) or escaped with a backslash.

1157

1163

  </para>

1158

1164

  <para>

1159

1165

   A character class matches a single character in the subject;

1160

   the  character  must  be in the set of characters defined by

1166

   the character must be in the set of characters defined by

1161

1167

   the class, unless the first character in the class is a

1162

   circumflex,  in which case the subject character must not be in

1163

   the set defined by the class. If a  circumflex  is  actually

1164

   required  as  a  member  of  the class, ensure it is not the

1168

   circumflex, in which case the subject character must not be in

1169

   the set defined by the class. If a circumflex is actually

1170

   required as a member of the class, ensure it is not the

1165

1171

   first character, or escape it with a backslash.

1166

1172

  </para>

1167

1173

  <para>

1168

   For example, the character class [aeiou] matches  any  lower

1174

   For example, the character class [aeiou] matches any lower

1169

1175

   case vowel, while [^aeiou] matches any character that is not

1170

   a lower case vowel. Note that a circumflex is  just  a

1171

   convenient  notation for specifying the characters which are in

1172

   the class by enumerating those that are not. It  is  not  an

1173

   assertion:  it  still  consumes a character from the subject

1174

   string, and fails if the current pointer is at  the  end  of

1176

   a lower case vowel. Note that a circumflex is just a

1177

   convenient notation for specifying the characters which are in

1178

   the class by enumerating those that are not. It is not an

1179

   assertion: it still consumes a character from the subject

1180

   string, and fails if the current pointer is at the end of

1175

1181

   the string.

1176

1182

  </para>

1177

1183

  <para>

...

@@ -1183,61 +1189,62 @@

1183

1189

  </para>

1184

1190

  <para>

1185

1191

   The newline character is never treated in any special way in

1186

   character  classes,  whatever the setting of the <link

1192

   character classes, whatever the setting of the <link

1187

1193

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1188

1194

   or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1189

1195

   options is. A class such as [^a] will always match a newline.

1190

1196

  </para>

1191

1197

  <para>

1192

   The minus (hyphen) character can be used to specify a  range

1193

   of  characters  in  a  character  class.  For example, [d-m]

1194

   matches any letter between d and m, inclusive.  If  a  minus

1195

   character  is required in a class, it must be escaped with a

1198

   The minus (hyphen) character can be used to specify a range

1199

   of characters in a character class. For example, [d-m]

1200

   matches any letter between d and m, inclusive. If a minus

1201

   character is required in a class, it must be escaped with a

1196

1202

   backslash or appear in a position where it cannot be

1197

1203

   interpreted as indicating a range, typically as the first or last

1198

1204

   character in the class.

1199

1205

  </para>

1200

1206

  <para>

1201

   It is not possible to have the literal character "]" as  the

1202

   end  character  of  a  range.  A  pattern such as [W-]46] is

1207

   It is not possible to have the literal character "]" as the

1208

   end character of a range. A pattern such as [W-]46] is

1203

1209

   interpreted as a class of two characters ("W" and "-")

1204

1210

   followed by a literal string "46]", so it would match "W46]" or

1205

   "-46]". However, if the "]" is escaped with a  backslash  it

1206

   is  interpreted  as  the end of range, so [W-\]46] is

1207

   interpreted as a single class containing a range followed by  two

1211

   "-46]". However, if the "]" is escaped with a backslash it

1212

   is interpreted as the end of range, so [W-\]46] is

1213

   interpreted as a single class containing a range followed by two

1208

1214

   separate characters. The octal or hexadecimal representation

1209

1215

   of "]" can also be used to end a range.

1210

1216

  </para>

1211

1217

  <para>

1212

1218

   Ranges operate in ASCII collating sequence. They can also be

1213

   used  for  characters  specified  numerically,  for  example

1214

   [\000-\037]. If a range that includes letters is  used  when

1215

   case-insensitive (caseless)  matching  is set, it matches the

1216

   letters in either case. For example, [W-c] is equivalent  to

1219

   used for characters specified numerically, for example

1220

   [\000-\037]. If a range that includes letters is used when

1221

   case-insensitive (caseless) matching is set, it matches the

1222

   letters in either case. For example, [W-c] is equivalent to

1217

1223

   [][\^_`wxyzabc], matched case-insensitively, and if character

1218

1224

   tables for the "fr" locale are in use, [\xc8-\xcb] matches

1219

1225

   accented E characters in both cases.

1220

1226

  </para>

1221

1227

  <para>

1222

   The character types \d, \D, \s, \S,  \w,  and  \W  may  also

1223

   appear  in  a  character  class, and add the characters that

1228

   The character types \d, \D, \s, \S, \w, and \W may also

1229

   appear in a character class, and add the characters that

1224

1230

   they match to the class. For example, [\dABCDEF] matches any

1225

   hexadecimal  digit.  A  circumflex  can conveniently be used

1226

   with the upper case character types to specify a  more

1231

   hexadecimal digit. A circumflex can conveniently be used

1232

   with the upper case character types to specify a more

1227

1233

   restricted set of characters than the matching lower case type.

1228

   For example, the class [^\W_] matches any letter  or  digit,

1234

   For example, the class [^\W_] matches any letter or digit,

1229

1235

   but not underscore.

1230

1236

  </para>

1231

1237

  <para>

1232

   All non-alphanumeric characters other than \,  -,  ^  (at  the

1233

   start)  and  the  terminating ] are non-special in character

1238

   All non-alphanumeric characters other than \, -, ^ (at the

1239

   start) and the terminating ] are non-special in character

1234

1240

   classes, but it does no harm if they are escaped. The pattern

1235

1241

   terminator is always special and must be escaped when used

1236

1242

   within an expression.

1237

1243

  </para>

1238

1244

  <para>

1239

1245

   Perl supports the POSIX notation for character classes. This uses names

1240

   enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also

1246

   enclosed by <literal>[:</literal> and <literal>:]</literal> within

1247

   the enclosing square brackets. PCRE also

1241

1248

   supports this notation. For example, <literal>[01[:alpha:]%]</literal>

1242

1249

   matches "0", "1", any alphabetic character, or "%". The supported class

1243

1250

   names are:

...

@@ -1276,7 +1283,7 @@

1276

1283

  <para>

1277

1284

   In UTF-8 mode, characters with values greater than 128 do not match any

1278

1285

   of the POSIX character classes.

1279

   As of PHP 5.3.0 and libpcre 8.10 some character classes are changed to use

1286

   As of libpcre 8.10 some character classes are changed to use

1280

1287

   Unicode character properties, in which case the mentioned restriction does

1281

1288

   not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>

1282

1289

   for details.

...

@@ -1292,16 +1299,16 @@

1292

1299

 <section xml:id="regexp.reference.alternation">

1293

1300

  <title>Alternation</title>

1294

1301

  <para>

1295

   Vertical bar characters are  used  to  separate  alternative

1302

   Vertical bar characters are used to separate alternative

1296

1303

   patterns. For example, the pattern

1297

1304

   <literal>gilbert|sullivan</literal>

1298

1305

   matches either "gilbert" or "sullivan". Any number of alternatives

1299

   may  appear,  and an empty alternative is permitted

1300

   (matching the empty string).   The  matching  process  tries

1301

   each  alternative in turn, from left to right, and the first

1302

   one that succeeds is used. If the alternatives are within  a

1303

   subpattern  (defined  below),  "succeeds" means matching the

1304

   rest of the main pattern as well as the alternative  in  the

1306

   may appear, and an empty alternative is permitted

1307

   (matching the empty string). The matching process tries

1308

   each alternative in turn, from left to right, and the first

1309

   one that succeeds is used. If the alternatives are within a

1310

   subpattern (defined below), "succeeds" means matching the

1311

   rest of the main pattern as well as the alternative in the

1305

1312

   subpattern.

1306

1313

  </para>

1307

1314

 </section>

...

@@ -1316,7 +1323,7 @@

1316

1323

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,

1317

1324

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1318

1325

   and PCRE_DUPNAMES can be changed from within the pattern by

1319

   a sequence of Perl option letters enclosed between "(?"  and

1326

   a sequence of Perl option letters enclosed between "(?" and

1320

1327

   ")". The option letters are:

1321

1328

1322

1329

   <table>

...

@@ -1345,7 +1352,8 @@

1345

1352

      </row>

1346

1353

      <row>

1347

1354

       <entry><literal>X</literal></entry>

1348

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> (no longer supported as of PHP 7.3.0)</entry>

1355

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>

1356

        (no longer supported as of PHP 7.3.0)</entry>

1349

1357

      </row>

1350

1358

      <row>

1351

1359

       <entry><literal>J</literal></entry>

...

@@ -1356,16 +1364,16 @@

1356

1364

   </table>

1357

1365

  </para>

1358

1366

  <para>

1359

   For example, (?im) sets case-insensitive (caseless), multiline matching. It  is

1367

   For example, (?im) sets case-insensitive (caseless), multiline matching. It is

1360

1368

   also possible to unset these options by preceding the letter

1361

   with a hyphen, and a combined setting and unsetting such  as

1362

   (?im-sx),  which sets <link

1369

   with a hyphen, and a combined setting and unsetting such as

1370

   (?im-sx), which sets <link

1363

1371

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and

1364

1372

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1365

1373

   while unsetting <link

1366

1374

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and

1367

1375

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,

1368

   is also  permitted. If  a  letter  appears both before and after the

1376

   is also permitted. If a letter appears both before and after the

1369

1377

   hyphen, the option is unset.

1370

1378

  </para>

1371

1379

  <para>

...

@@ -1375,14 +1383,14 @@

1375

1383

   and "abC".

1376

1384

  </para>

1377

1385

  <para>

1378

   If an option change occurs inside a subpattern,  the  effect

1379

   is  different.  This is a change of behaviour in Perl 5.005.

1380

   An option change inside a subpattern affects only that  part

1386

   If an option change occurs inside a subpattern, the effect

1387

   is different. This is a change of behaviour in Perl 5.005.

1388

   An option change inside a subpattern affects only that part

1381

1389

   of the subpattern that follows it, so

1382

1390

1383

1391

   <literal>(a(?i)b)c</literal>

1384

1392

1385

   matches  abc  and  aBc  and  no  other   strings   (assuming <link

1393

   matches "abc" and "aBc" and no other strings (assuming <link

1386

1394

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not

1387

1395

   used). By this means, options can be made to have different settings in

1388

1396

   different parts of the pattern. Any changes made in one alternative do

...

@@ -1391,18 +1399,18 @@

1391

1399

1392

1400

   <literal>(a(?i)b|c)</literal>

1393

1401

1394

   matches "ab", "aB", "c", and "C", even though when  matching

1402

   matches "ab", "aB", "c", and "C", even though when matching

1395

1403

   "C" the first branch is abandoned before the option setting.

1396

   This is because the effects of  option  settings  happen  at

1397

   compile  time. There would be some very weird behaviour otherwise.

1404

   This is because the effects of option settings happen at

1405

   compile time. There would be some very weird behaviour otherwise.

1398

1406

  </para>

1399

1407

  <para>

1400

1408

   The PCRE-specific options <link

1401

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and

1402

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>   can

1409

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and

1410

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can

1403

1411

   be changed in the same way as the Perl-compatible options by

1404

   using the characters U and X  respectively.  The  (?X)  flag

1405

   setting  is  special in that it must always occur earlier in

1412

   using the characters U and X respectively. The (?X) flag

1413

   setting is special in that it must always occur earlier in

1406

1414

   the pattern than any of the additional features it turns on,

1407

1415

   even when it is at top level. It is best put at the start.

1408

1416

  </para>

...

@@ -1411,8 +1419,8 @@

1411

1419

 <section xml:id="regexp.reference.subpatterns">

1412

1420

  <title>Subpatterns</title>

1413

1421

  <para>

1414

   Subpatterns are delimited by parentheses  (round  brackets),

1415

   which can be nested.  Marking part of a pattern as a subpattern

1422

   Subpatterns are delimited by parentheses (round brackets),

1423

   which can be nested. Marking part of a pattern as a subpattern

1416

1424

   does two things:

1417

1425

  </para>

1418

1426

  <orderedlist>

...

@@ -1441,30 +1449,30 @@

1441

1449

1442

1450

   <literal>the ((red|white) (king|queen))</literal>

1443

1451

1444

   the captured substrings are "red king", "red",  and  "king",

1452

   the captured substrings are "red king", "red", and "king",

1445

1453

   and are numbered 1, 2, and 3.

1446

1454

  </para>

1447

1455

  <para>

1448

   The fact that plain parentheses fulfill two functions is  not

1449

   always  helpful.  There are often times when a grouping subpattern

1450

   is required without a capturing requirement.  If  an

1456

   The fact that plain parentheses fulfill two functions is not

1457

   always helpful. There are often times when a grouping subpattern

1458

   is required without a capturing requirement. If an

1451

1459

   opening parenthesis is followed by "?:", the subpattern does

1452

   not do any capturing, and is not counted when computing  the

1460

   not do any capturing, and is not counted when computing the

1453

1461

   number of any subsequent capturing subpatterns. For example,

1454

   if the string "the  white  queen"  is  matched  against  the

1462

   if the string "the white queen" is matched against the

1455

1463

   pattern

1456

1464

1457

1465

   <literal>the ((?:red|white) (king|queen))</literal>

1458

1466

1459

   the captured substrings are "white queen" and  "queen",  and

1460

   are  numbered  1  and 2. The maximum number of captured substrings

1467

   the captured substrings are "white queen" and "queen", and

1468

   are numbered 1 and 2. The maximum number of captured substrings

1461

1469

   is 65535. It may not be possible to compile such large patterns,

1462

1470

   however, depending on the configuration options of libpcre.

1463

1471

  </para>

1464

1472

  <para>

1465

   As a  convenient  shorthand,  if  any  option  settings  are

1466

   required  at  the  start  of a non-capturing subpattern, the

1467

   option letters may appear between the "?" and the ":".  Thus

1473

   As a convenient shorthand, if any option settings are

1474

   required at the start of a non-capturing subpattern, the

1475

   option letters may appear between the "?" and the ":". Thus

1468

1476

   the two patterns

1469

1477

  </para>

1470

1478

...

@@ -1478,10 +1486,10 @@

1478

1486

  </informalexample>

1479

1487

1480

1488

  <para>

1481

   match exactly the same set of strings.  Because  alternative

1482

   branches  are  tried from left to right, and options are not

1483

   reset until the end of the subpattern is reached, an  option

1484

   setting  in  one  branch does affect subsequent branches, so

1489

   match exactly the same set of strings. Because alternative

1490

   branches are tried from left to right, and options are not

1491

   reset until the end of the subpattern is reached, an option

1492

   setting in one branch does affect subsequent branches, so

1485

1493

   the above patterns match "SUNDAY" as well as "Saturday".

1486

1494

  </para>

1487

1495

...

@@ -1489,7 +1497,7 @@

1489

1497

   It is possible to name a subpattern using the syntax

1490

1498

   <literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then

1491

1499

   be indexed in the matches array by its normal numeric position and

1492

   also by name. PHP 5.2.2 introduced two alternative syntaxes

1500

   also by name. There are two alternative syntaxes

1493

1501

   <literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.

1494

1502

  </para>

1495

1503

...

@@ -1510,9 +1518,10 @@

1510

1518

1511

1519

  <para>

1512

1520

   Here <literal>Sun</literal> is stored in backreference 2, while

1513

   backreference 1 is empty. Matching yields <literal>Sat</literal> in

1514

   backreference 1 while backreference 2 does not exist. Changing the pattern

1515

   to use the <literal>(?|</literal> fixes this problem:

1521

   backreference 1 is empty. Matching <literal>Saturday</literal> yields

1522

   <literal>Sat</literal> in backreference 1 while backreference 2 does

1523

   not exist. Changing the pattern to use the <literal>(?|</literal> fixes

1524

   this problem:

1516

1525

  </para>

1517

1526

1518

1527

  <informalexample>

...

@@ -1538,45 +1547,45 @@

1538

1547

    <listitem><simpara>the . metacharacter</simpara></listitem>

1539

1548

    <listitem><simpara>a character class</simpara></listitem>

1540

1549

    <listitem><simpara>a back reference (see next section)</simpara></listitem>

1541

    <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -

1550

    <listitem><simpara>a parenthesized subpattern (unless it is an assertion -

1542

1551

     see below)</simpara></listitem>

1543

1552

   </itemizedlist>

1544

1553

  </para>

1545

1554

  <para>

1546

   The general repetition quantifier specifies  a  minimum  and

1547

   maximum  number  of  permitted  matches,  by  giving the two

1548

   numbers in curly brackets (braces), separated  by  a  comma.

1549

   The  numbers  must be less than 65536, and the first must be

1555

   The general repetition quantifier specifies a minimum and

1556

   maximum number of permitted matches, by giving the two

1557

   numbers in curly brackets (braces), separated by a comma.

1558

   The numbers must be less than 65536, and the first must be

1550

1559

   less than or equal to the second. For example:

1551

1560

1552

1561

   <literal>z{2,4}</literal>

1553

1562

1554

   matches "zz", "zzz", or "zzzz". A closing brace on  its  own

1563

   matches "zz", "zzz", or "zzzz". A closing brace on its own

1555

1564

   is not a special character. If the second number is omitted,

1556

   but the comma is present, there is no upper  limit;  if  the

1565

   but the comma is present, there is no upper limit; if the

1557

1566

   second number and the comma are both omitted, the quantifier

1558

1567

   specifies an exact number of required matches. Thus

1559

1568

1560

1569

   <literal>[aeiou]{3,}</literal>

1561

1570

1562

   matches at least 3 successive vowels,  but  may  match  many

1571

   matches at least 3 successive vowels, but may match many

1563

1572

   more, while

1564

1573

1565

1574

   <literal>\d{8}</literal>

1566

1575

1567

   matches exactly 8 digits.  An  opening  curly  bracket  that

1568

   appears  in a position where a quantifier is not allowed, or

1576

   matches exactly 8 digits. An opening curly bracket that

1577

   appears in a position where a quantifier is not allowed, or

1569

1578

   one that does not match the syntax of a quantifier, is taken

1570

   as  a literal character. For example, {,6} is not a quantifier,

1579

   as a literal character. For example, {,6} is not a quantifier,

1571

1580

   but a literal string of four characters.

1572

1581

  </para>

1573

1582

  <para>

1574

   The quantifier {0} is permitted, causing the  expression  to

1575

   behave  as  if the previous item and the quantifier were not

1583

   The quantifier {0} is permitted, causing the expression to

1584

   behave as if the previous item and the quantifier were not

1576

1585

   present.

1577

1586

  </para>

1578

1587

  <para>

1579

   For convenience (and  historical  compatibility)  the  three

1588

   For convenience (and historical compatibility) the three

1580

1589

   most common quantifiers have single-character abbreviations:

1581

1590

1582

1591

   <table>

...

@@ -1600,63 +1609,63 @@

1600

1609

   </table>

1601

1610

  </para>

1602

1611

  <para>

1603

   It is possible to construct infinite loops  by  following  a

1604

   subpattern  that  can  match no characters with a quantifier

1612

   It is possible to construct infinite loops by following a

1613

   subpattern that can match no characters with a quantifier

1605

1614

   that has no upper limit, for example:

1606

1615

1607

1616

   <literal>(a?)*</literal>

1608

1617

  </para>

1609

1618

  <para>

1610

   Earlier versions of Perl and PCRE used to give an  error  at

1611

   compile  time  for such patterns. However, because there are

1612

   cases where this  can  be  useful,  such  patterns  are  now

1613

   accepted,  but  if  any repetition of the subpattern does in

1619

   Earlier versions of Perl and PCRE used to give an error at

1620

   compile time for such patterns. However, because there are

1621

   cases where this can be useful, such patterns are now

1622

   accepted, but if any repetition of the subpattern does in

1614

1623

   fact match no characters, the loop is forcibly broken.

1615

1624

  </para>

1616

1625

  <para>

1617

   By default, the quantifiers  are  "greedy",  that  is,  they

1618

   match  as much as possible (up to the maximum number of permitted

1619

   times), without causing the rest of  the  pattern  to

1626

   By default, the quantifiers are "greedy", that is, they

1627

   match as much as possible (up to the maximum number of permitted

1628

   times), without causing the rest of the pattern to

1620

1629

   fail. The classic example of where this gives problems is in

1621

1630

   trying to match comments in C programs. These appear between

1622

   the  sequences /* and */ and within the sequence, individual

1623

   * and / characters may appear. An attempt to  match  C  comments

1631

   the sequences /* and */ and within the sequence, individual

1632

   * and / characters may appear. An attempt to match C comments

1624

1633

   by applying the pattern

1625

1634

1626

1635

   <literal>/\*.*\*/</literal>

1627

1636

1628

1637

   to the string

1629

1638

1630

   <literal>/* first comment */  not comment  /* second comment */</literal>

1639

   <literal>/* first comment */ not comment /* second comment */</literal>

1631

1640

1632

   fails, because it matches  the  entire  string  due  to  the

1633

   greediness of the .*  item.

1641

   fails, because it matches the entire string due to the

1642

   greediness of the .* item.

1634

1643

  </para>

1635

1644

  <para>

1636

   However, if a quantifier is followed  by  a  question  mark,

1645

   However, if a quantifier is followed by a question mark,

1637

1646

   then it becomes lazy, and instead matches the minimum

1638

1647

   number of times possible, so the pattern

1639

1648

1640

1649

   <literal>/\*.*?\*/</literal>

1641

1650

1642

1651

   does the right thing with the C comments. The meaning of the

1643

   various  quantifiers is not otherwise changed, just the preferred

1644

   number of matches.  Do not confuse this use of

1645

   question  mark  with  its  use as a quantifier in its own right.

1652

   various quantifiers is not otherwise changed, just the preferred

1653

   number of matches. Do not confuse this use of

1654

   question mark with its use as a quantifier in its own right.

1646

1655

   Because it has two uses, it can sometimes appear doubled, as

1647

1656

in

1648

1657

1649

1658

   <literal>\d??\d</literal>

1650

1659

1651

   which matches one digit by preference, but can match two  if

1660

   which matches one digit by preference, but can match two if

1652

1661

   that is the only way the rest of the pattern matches.

1653

1662

  </para>

1654

1663

  <para>

1655

1664

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>

1656

   option is set (an option which  is  not

1657

   available  in  Perl)  then the quantifiers are not greedy by

1665

   option is set (an option which is not

1666

   available in Perl) then the quantifiers are not greedy by

1658

1667

   default, but individual ones can be made greedy by following

1659

   them  with  a  question mark. In other words, it inverts the

1668

   them with a question mark. In other words, it inverts the

1660

1669

   default behaviour.

1661

1670

  </para>

1662

1671

  <para>

...

@@ -1668,41 +1677,41 @@

1668

1677

  </para>

1669

1678

  <para>

1670

1679

   When a parenthesized subpattern is quantified with a minimum

1671

   repeat  count  that is greater than 1 or with a limited maximum,

1672

   more store is required for the  compiled  pattern,  in

1680

   repeat count that is greater than 1 or with a limited maximum,

1681

   more store is required for the compiled pattern, in

1673

1682

   proportion to the size of the minimum or maximum.

1674

1683

  </para>

1675

1684

  <para>

1676

   If a pattern starts with .* or  .{0,}  and  the  <link

1685

   If a pattern starts with .* or .{0,} and the <link

1677

1686

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1678

1687

   option (equivalent to Perl's /s) is set, thus allowing the .

1679

   to match newlines, then the pattern is implicitly  anchored,

1688

   to match newlines, then the pattern is implicitly anchored,

1680

1689

   because whatever follows will be tried against every character

1681

   position in the subject string, so there is no point  in

1682

   retrying  the overall match at any position after the first.

1690

   position in the subject string, so there is no point in

1691

   retrying the overall match at any position after the first.

1683

1692

   PCRE treats such a pattern as though it were preceded by \A.

1684

   In  cases where it is known that the subject string contains

1693

   In cases where it is known that the subject string contains

1685

1694

   no newlines, it is worth setting <link

1686

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the

1695

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the

1687

1696

   pattern begins with .* in order to

1688

1697

   obtain this optimization, or

1689

1698

   alternatively using ^ to indicate anchoring explicitly.

1690

1699

  </para>

1691

1700

  <para>

1692

   When a capturing subpattern is repeated, the value  captured

1701

   When a capturing subpattern is repeated, the value captured

1693

1702

   is the substring that matched the final iteration. For example, after

1694

1703

1695

1704

   <literal>(tweedle[dume]{3}\s*)+</literal>

1696

1705

1697

   has matched "tweedledum tweedledee" the value  of  the  captured

1698

   substring  is  "tweedledee".  However,  if  there are

1699

   nested capturing  subpatterns,  the  corresponding  captured

1700

   values  may  have been set in previous iterations. For example,

1706

   has matched "tweedledum tweedledee" the value of the captured

1707

   substring is "tweedledee". However, if there are

1708

   nested capturing subpatterns, the corresponding captured

1709

   values may have been set in previous iterations. For example,

1701

1710

   after

1702

1711

1703

1712

   <literal>/(a|(b))+/</literal>

1704

1713

1705

   matches "aba" the value of the second captured substring  is

1714

   matches "aba" the value of the second captured substring is

1706

1715

   "b".

1707

1716

  </para>

1708

1717

 </section>

...

@@ -1710,78 +1719,78 @@

1710

1719

 <section xml:id="regexp.reference.back-references">

1711

1720

  <title>Back references</title>

1712

1721

  <para>

1713

   Outside a character class, a backslash followed by  a  digit

1714

   greater  than  0  (and  possibly  further  digits) is a back

1715

   reference to a capturing subpattern  earlier  (i.e.  to  its

1716

   left)  in  the  pattern,  provided there have been that many

1722

   Outside a character class, a backslash followed by a digit

1723

   greater than 0 (and possibly further digits) is a back

1724

   reference to a capturing subpattern earlier (i.e. to its

1725

   left) in the pattern, provided there have been that many

1717

1726

   previous capturing left parentheses.

1718

1727

  </para>

1719

1728

  <para>

1720

   However, if the decimal number following  the  backslash  is

1721

   less  than  10,  it is always taken as a back reference, and

1722

   causes an error only if there are not  that  many  capturing

1723

   left  parentheses in the entire pattern. In other words, the

1724

   parentheses that are referenced need not be to the  left  of

1725

   the  reference  for  numbers  less  than 10.

1729

   However, if the decimal number following the backslash is

1730

   less than 10, it is always taken as a back reference, and

1731

   causes an error only if there are not that many capturing

1732

   left parentheses in the entire pattern. In other words, the

1733

   parentheses that are referenced need not be to the left of

1734

   the reference for numbers less than 10.

1726

1735

   A "forward back reference" can make sense when a repetition

1727

1736

   is involved and the subpattern to the right has participated

1728

1737

   in an earlier iteration. See the section

1729

   entitled "Backslash" above for further details of  the  handling

1738

   <link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling

1730

1739

   of digits following a backslash.

1731

1740

  </para>

1732

1741

  <para>

1733

   A back reference matches whatever actually matched the  capturing

1742

   A back reference matches whatever actually matched the capturing

1734

1743

   subpattern in the current subject string, rather than

1735

1744

   anything matching the subpattern itself. So the pattern

1736

1745

1737

1746

   <literal>(sens|respons)e and \1ibility</literal>

1738

1747

1739

   matches "sense and sensibility" and "response and  responsibility",

1740

   but  not  "sense  and  responsibility". If case-sensitive (caseful)

1748

   matches "sense and sensibility" and "response and responsibility",

1749

   but not "sense and responsibility". If case-sensitive (caseful)

1741

1750

   matching is in force at the time of the back reference, then

1742

1751

   the case of letters is relevant. For example,

1743

1752

1744

1753

   <literal>((?i)rah)\s+\1</literal>

1745

1754

1746

   matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even

1747

   though  the  original  capturing subpattern is matched

1755

   matches "rah rah" and "RAH RAH", but not "RAH rah", even

1756

   though the original capturing subpattern is matched

1748

1757

   case-insensitively (caselessly).

1749

1758

  </para>

1750

1759

  <para>

1751

   There may be more than one back reference to the  same  subpattern.

1752

   If  a  subpattern  has not actually been used in a

1753

   particular match, then any  back  references  to  it  always

1760

   There may be more than one back reference to the same subpattern.

1761

   If a subpattern has not actually been used in a

1762

   particular match, then any back references to it always

1754

1763

   fail. For example, the pattern

1755

1764

1756

1765

   <literal>(a|(bc))\2</literal>

1757

1766

1758

   always fails if it starts to match  "a"  rather  than  "bc".

1759

   Because  there  may  be up to 99 back references, all digits

1760

   following the backslash are taken as  part  of  a  potential

1767

   always fails if it starts to match "a" rather than "bc".

1768

   Because there may be up to 99 back references, all digits

1769

   following the backslash are taken as part of a potential

1761

1770

   back reference number. If the pattern continues with a digit

1762

1771

   character, then some delimiter must be used to terminate the

1763

1772

   back reference. If the <link

1764

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>  option

1765

   is set, this can be whitespace.  Otherwise an empty comment can be used.

1773

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option

1774

   is set, this can be whitespace. Otherwise an empty comment can be used.

1766

1775

  </para>

1767

1776

  <para>

1768

1777

   A back reference that occurs inside the parentheses to which

1769

   it  refers  fails when the subpattern is first used, so, for

1770

   example, (a\1) never matches.  However, such references  can

1778

   it refers fails when the subpattern is first used, so, for

1779

   example, (a\1) never matches. However, such references can

1771

1780

   be useful inside repeated subpatterns. For example, the pattern

1772

1781

1773

1782

   <literal>(a|b\1)+</literal>

1774

1783

1775

   matches any number of "a"s and also "aba", "ababba" etc.  At

1784

   matches any number of "a"s and also "aba", "ababba" etc. At

1776

1785

   each iteration of the subpattern, the back reference matches

1777

   the character string corresponding to  the  previous  iteration.

1786

   the character string corresponding to the previous iteration.

1778

1787

   In order for this to work, the pattern must be such

1779

   that the first iteration does not need  to  match  the  back

1780

   reference.  This  can  be  done using alternation, as in the

1788

   that the first iteration does not need to match the back

1789

   reference. This can be done using alternation, as in the

1781

1790

   example above, or by a quantifier with a minimum of zero.

1782

1791

  </para>

1783

1792

  <para>

1784

   As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be

1793

   The <literal>\g</literal> escape sequence can be

1785

1794

   used for absolute and relative referencing of subpatterns.

1786

1795

   This escape sequence must be followed by an unsigned number or a negative

1787

1796

   number, optionally enclosed in braces. The sequences <literal>\1</literal>,

...

@@ -1802,29 +1811,28 @@

1802

1811

  </para>

1803

1812

  <para>

1804

1813

   Back references to the named subpatterns can be achieved by

1805

   <literal>(?P=name)</literal> or, since PHP 5.2.2, also by

1806

   <literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>.

1807

   Additionally PHP 5.2.4 added support for <literal>\k{name}</literal>

1808

   and <literal>\g{name}</literal>, and PHP 5.2.7 for

1809

   <literal>\g&lt;name&gt;</literal> and <literal>\g'name'</literal>.

1814

   <literal>(?P=name)</literal>,

1815

   <literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,

1816

   <literal>\k{name}</literal>, <literal>\g{name}</literal>,

1817

   <literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.

1810

1818

  </para>

1811

1819

 </section>

1812

1820

1813

1821

 <section xml:id="regexp.reference.assertions">

1814

1822

  <title>Assertions</title>

1815

1823

  <para>

1816

   An assertion is  a  test  on  the  characters  following  or

1817

   preceding  the current matching point that does not actually

1818

   consume any characters. The simple assertions coded  as  \b,

1819

   \B,  \A,  \Z,  \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1820

   assertions are coded as  subpatterns.  There  are  two

1821

   kinds:  those that <emphasis>look ahead</emphasis> of the current position in the

1824

   An assertion is a test on the characters following or

1825

   preceding the current matching point that does not actually

1826

   consume any characters. The simple assertions coded as \b,

1827

   \B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1828

   assertions are coded as subpatterns. There are two

1829

   kinds: those that <emphasis>look ahead</emphasis> of the current position in the

1822

1830

   subject string, and those that <emphasis>look behind</emphasis> it.

1823

1831

  </para>

1824

1832

  <para>

1825

1833

   An assertion subpattern is matched in the normal way, except

1826

   that  it  does not cause the current matching position to be

1827

   changed. <emphasis>Lookahead</emphasis> assertions start with  (?=  for  positive

1834

   that it does not cause the current matching position to be

1835

   changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive

1828

1836

   assertions and (?! for negative assertions. For example,

1829

1837

1830

1838

   <literal>\w+(?=;)</literal>

...

@@ -1834,27 +1842,27 @@

1834

1842

1835

1843

   <literal>foo(?!bar)</literal>

1836

1844

1837

   matches any occurrence of "foo"  that  is  not  followed  by

1845

   matches any occurrence of "foo" that is not followed by

1838

1846

   "bar". Note that the apparently similar pattern

1839

1847

1840

1848

   <literal>(?!foo)bar</literal>

1841

1849

1842

   does not find an occurrence of "bar"  that  is  preceded  by

1850

   does not find an occurrence of "bar" that is preceded by

1843

1851

   something other than "foo"; it finds any occurrence of "bar"

1844

   whatsoever, because the assertion  (?!foo)  is  always  &true;

1845

   when  the  next  three  characters  are  "bar". A lookbehind

1852

   whatsoever, because the assertion (?!foo) is always &true;

1853

   when the next three characters are "bar". A lookbehind

1846

1854

   assertion is needed to achieve this effect.

1847

1855

  </para>

1848

1856

  <para>

1849

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;=  for  positive  assertions

1857

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions

1850

1858

   and (?&lt;! for negative assertions. For example,

1851

1859

1852

1860

   <literal>(?&lt;!foo)bar</literal>

1853

1861

1854

   does find an occurrence of "bar" that  is  not  preceded  by

1862

   does find an occurrence of "bar" that is not preceded by

1855

1863

   "foo". The contents of a lookbehind assertion are restricted

1856

   such that all the strings  it  matches  must  have  a  fixed

1857

   length.  However, if there are several alternatives, they do

1864

   such that all the strings it matches must have a fixed

1865

   length. However, if there are several alternatives, they do

1858

1866

   not all have to have the same fixed length. Thus

1859

1867

1860

1868

   <literal>(?&lt;=bullock|donkey)</literal>

...

@@ -1863,51 +1871,51 @@

1863

1871

1864

1872

   <literal>(?&lt;!dogs?|cats?)</literal>

1865

1873

1866

   causes an error at compile time. Branches  that  match  different

1874

   causes an error at compile time. Branches that match different

1867

1875

   length strings are permitted only at the top level of

1868

   a lookbehind assertion. This is an extension  compared  with

1869

   Perl  5.005,  which  requires all branches to match the same

1876

   a lookbehind assertion. This is an extension compared with

1877

   Perl 5.005, which requires all branches to match the same

1870

1878

   length of string. An assertion such as

1871

1879

1872

1880

   <literal>(?&lt;=ab(c|de))</literal>

1873

1881

1874

   is not permitted, because its single  top-level  branch  can

1882

   is not permitted, because its single top-level branch can

1875

1883

   match two different lengths, but it is acceptable if rewritten

1876

1884

   to use two top-level branches:

1877

1885

1878

1886

   <literal>(?&lt;=abc|abde)</literal>

1879

1887

1880

   The implementation of lookbehind  assertions  is,  for  each

1881

   alternative,  to  temporarily move the current position back

1882

   by the fixed width and then  try  to  match.  If  there  are

1883

   insufficient  characters  before  the  current position, the

1884

   match is deemed to fail.  Lookbehinds  in  conjunction  with

1885

   once-only  subpatterns can be particularly useful for matching

1886

   at the ends of strings; an example is given at  the  end

1888

   The implementation of lookbehind assertions is, for each

1889

   alternative, to temporarily move the current position back

1890

   by the fixed width and then try to match. If there are

1891

   insufficient characters before the current position, the

1892

   match is deemed to fail. Lookbehinds in conjunction with

1893

   once-only subpatterns can be particularly useful for matching

1894

   at the ends of strings; an example is given at the end

1887

1895

   of the section on once-only subpatterns.

1888

1896

  </para>

1889

1897

  <para>

1890

   Several assertions (of any sort) may  occur  in  succession.

1898

   Several assertions (of any sort) may occur in succession.

1891

1899

   For example,

1892

1900

1893

1901

   <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>

1894

1902

1895

   matches "foo" preceded by three digits that are  not  "999".

1896

   Notice  that each of the assertions is applied independently

1897

   at the same point in the subject string. First  there  is  a

1898

   check  that  the  previous  three characters are all digits,

1903

   matches "foo" preceded by three digits that are not "999".

1904

   Notice that each of the assertions is applied independently

1905

   at the same point in the subject string. First there is a

1906

   check that the previous three characters are all digits,

1899

1907

   then there is a check that the same three characters are not

1900

   "999".   This  pattern  does not match "foo" preceded by six

1908

   "999". This pattern does not match "foo" preceded by six

1901

1909

   characters, the first of which are digits and the last three

1902

   of  which  are  not  "999".  For  example,  it doesn't match

1910

   of which are not "999". For example, it doesn't match

1903

1911

   "123abcfoo". A pattern to do that is

1904

1912

1905

1913

   <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>

1906

1914

  </para>

1907

1915

  <para>

1908

   This time the first assertion looks  at  the  preceding  six

1909

   characters,  checking  that  the first three are digits, and

1910

   then the second assertion checks that  the  preceding  three

1916

   This time the first assertion looks at the preceding six

1917

   characters, checking that the first three are digits, and

1918

   then the second assertion checks that the preceding three

1911

1919

   characters are not "999".

1912

1920

  </para>

1913

1921

  <para>

...

@@ -1915,26 +1923,26 @@

1915

1923

1916

1924

   <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>

1917

1925

1918

   matches an occurrence of "baz" that  is  preceded  by  "bar"

1926

   matches an occurrence of "baz" that is preceded by "bar"

1919

1927

   which in turn is not preceded by "foo", while

1920

1928

1921

1929

   <literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>

1922

1930

1923

   is another pattern which matches  "foo"  preceded  by  three

1931

   is another pattern which matches "foo" preceded by three

1924

1932

   digits and any three characters that are not "999".

1925

1933

  </para>

1926

1934

  <para>

1927

1935

   Assertion subpatterns are not capturing subpatterns, and may

1928

   not  be  repeated,  because  it makes no sense to assert the

1929

   same thing several times. If any kind of assertion  contains

1930

   capturing  subpatterns  within it, these are counted for the

1936

   not be repeated, because it makes no sense to assert the

1937

   same thing several times. If any kind of assertion contains

1938

   capturing subpatterns within it, these are counted for the

1931

1939

   purposes of numbering the capturing subpatterns in the whole

1932

   pattern.   However,  substring capturing is carried out only

1933

   for positive assertions, because it does not make sense  for

1940

   pattern. However, substring capturing is carried out only

1941

   for positive assertions, because it does not make sense for

1934

1942

   negative assertions.

1935

1943

  </para>

1936

1944

  <para>

1937

   Assertions count towards the maximum  of  200  parenthesized

1945

   Assertions count towards the maximum of 200 parenthesized

1938

1946

   subpatterns.

1939

1947

  </para>

1940

1948

 </section>

...

@@ -1942,17 +1950,17 @@

1942

1950

 <section xml:id="regexp.reference.onlyonce">

1943

1951

  <title>Once-only subpatterns</title>

1944

1952

  <para>

1945

   With both maximizing and minimizing repetition,  failure  of

1946

   what  follows  normally  causes  the repeated item to be

1953

   With both maximizing and minimizing repetition, failure of

1954

   what follows normally causes the repeated item to be

1947

1955

   re-evaluated to see if a different number of repeats allows the

1948

   rest  of  the  pattern  to  match. Sometimes it is useful to

1949

   prevent this, either to change the nature of the  match,  or

1950

   to  cause  it fail earlier than it otherwise might, when the

1951

   author of the pattern knows there is no  point  in  carrying

1956

   rest of the pattern to match. Sometimes it is useful to

1957

   prevent this, either to change the nature of the match, or

1958

   to cause it fail earlier than it otherwise might, when the

1959

   author of the pattern knows there is no point in carrying

1952

1960

on.

1953

1961

  </para>

1954

1962

  <para>

1955

   Consider, for example, the pattern \d+foo  when  applied  to

1963

   Consider, for example, the pattern \d+foo when applied to

1956

1964

   the subject line

1957

1965

1958

1966

   <literal>123456bar</literal>

...

@@ -1960,108 +1968,108 @@

1960

1968

  <para>

1961

1969

   After matching all 6 digits and then failing to match "foo",

1962

1970

   the normal action of the matcher is to try again with only 5

1963

   digits matching the \d+ item, and then with 4,  and  so  on,

1971

   digits matching the \d+ item, and then with 4, and so on,

1964

1972

   before ultimately failing. Once-only subpatterns provide the

1965

   means for specifying that once a portion of the pattern  has

1966

   matched,  it  is  not to be re-evaluated in this way, so the

1967

   matcher would give up immediately on failing to match  "foo"

1968

   the  first  time.  The  notation  is another kind of special

1973

   means for specifying that once a portion of the pattern has

1974

   matched, it is not to be re-evaluated in this way, so the

1975

   matcher would give up immediately on failing to match "foo"

1976

   the first time. The notation is another kind of special

1969

1977

   parenthesis, starting with (?&gt; as in this example:

1970

1978

1971

1979

   <literal>(?&gt;\d+)bar</literal>

1972

1980

  </para>

1973

1981

  <para>

1974

   This kind of parenthesis "locks up" the  part of the pattern

1975

   it  contains once it has matched, and a failure further into

1976

   the pattern is prevented from backtracking  into  it.

1977

   Backtracking  past  it to previous items, however, works as normal.

1982

   This kind of parenthesis "locks up" the part of the pattern

1983

   it contains once it has matched, and a failure further into

1984

   the pattern is prevented from backtracking into it.

1985

   Backtracking past it to previous items, however, works as normal.

1978

1986

  </para>

1979

1987

  <para>

1980

1988

   An alternative description is that a subpattern of this type

1981

   matches  the  string  of  characters that an identical standalone

1989

   matches the string of characters that an identical standalone

1982

1990

   pattern would match, if anchored at the current point

1983

1991

   in the subject string.

1984

1992

  </para>

1985

1993

  <para>

1986

   Once-only subpatterns are not capturing subpatterns.  Simple

1987

   cases  such as the above example can be thought of as a maximizing

1988

   repeat that must  swallow  everything  it  can.  So,

1994

   Once-only subpatterns are not capturing subpatterns. Simple

1995

   cases such as the above example can be thought of as a maximizing

1996

   repeat that must swallow everything it can. So,

1989

1997

   while both \d+ and \d+? are prepared to adjust the number of

1990

   digits they match in order to make the rest of  the  pattern

1998

   digits they match in order to make the rest of the pattern

1991

1999

   match, (?&gt;\d+) can only match an entire sequence of digits.

1992

2000

  </para>

1993

2001

  <para>

1994

   This construction can of course contain arbitrarily  complicated

2002

   This construction can of course contain arbitrarily complicated

1995

2003

   subpatterns, and it can be nested.

1996

2004

  </para>

1997

2005

  <para>

1998

2006

   Once-only subpatterns can be used in conjunction with

1999

   lookbehind assertions  to specify efficient matching at the end

2007

   lookbehind assertions to specify efficient matching at the end

2000

2008

   of the subject string. Consider a simple pattern such as

2001

2009

2002

2010

   <literal>abcd$</literal>

2003

2011

2004

   when applied to a long string which does not match.  Because

2005

   matching  proceeds  from  left  to right, PCRE will look for

2012

   when applied to a long string which does not match. Because

2013

   matching proceeds from left to right, PCRE will look for

2006

2014

   each "a" in the subject and then see if what follows matches

2007

2015

   the rest of the pattern. If the pattern is specified as

2008

2016

2009

2017

   <literal>^.*abcd$</literal>

2010

2018

2011

   then the initial .* matches the entire string at first,  but

2012

   when  this  fails  (because  there  is no following "a"), it

2019

   then the initial .* matches the entire string at first, but

2020

   when this fails (because there is no following "a"), it

2013

2021

   backtracks to match all but the last character, then all but

2014

   the  last  two  characters, and so on. Once again the search

2015

   for "a" covers the entire string, from right to left, so  we

2022

   the last two characters, and so on. Once again the search

2023

   for "a" covers the entire string, from right to left, so we

2016

2024

   are no better off. However, if the pattern is written as

2017

2025

2018

2026

   <literal>^(?>.*)(?&lt;=abcd)</literal>

2019

2027

2020

   then there can be no backtracking for the .*  item;  it  can

2021

   match  only  the  entire  string.  The subsequent lookbehind

2028

   then there can be no backtracking for the .* item; it can

2029

   match only the entire string. The subsequent lookbehind

2022

2030

   assertion does a single test on the last four characters. If

2023

   it  fails,  the  match  fails immediately. For long strings,

2031

   it fails, the match fails immediately. For long strings,

2024

2032

   this approach makes a significant difference to the processing time.

2025

2033

  </para>

2026

2034

  <para>

2027

2035

   When a pattern contains an unlimited repeat inside a subpattern

2028

2036

   that can itself be repeated an unlimited number of

2029

   times, the use of a once-only subpattern is the only way  to

2030

   avoid  some  failing matches taking a very long time indeed.

2037

   times, the use of a once-only subpattern is the only way to

2038

   avoid some failing matches taking a very long time indeed.

2031

2039

   The pattern

2032

2040

2033

2041

   <literal>(\D+|&lt;\d+>)*[!?]</literal>

2034

2042

2035

   matches an unlimited number of substrings that  either  consist

2036

   of  non-digits,  or digits enclosed in &lt;>, followed by

2043

   matches an unlimited number of substrings that either consist

2044

   of non-digits, or digits enclosed in &lt;>, followed by

2037

2045

   either ! or ?. When it matches, it runs quickly. However, if

2038

2046

   it is applied to

2039

2047

2040

2048

   <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>

2041

2049

2042

   it takes a long  time  before  reporting  failure.  This  is

2050

   it takes a long time before reporting failure. This is

2043

2051

   because the string can be divided between the two repeats in

2044

2052

   a large number of ways, and all have to be tried. (The example

2045

   used  [!?]  rather  than a single character at the end,

2046

   because both PCRE and Perl have an optimization that  allows

2047

   for  fast  failure  when  a  single  character is used. They

2048

   remember the last single character that is  required  for  a

2049

   match,  and  fail early if it is not present in the string.)

2053

   used [!?] rather than a single character at the end,

2054

   because both PCRE and Perl have an optimization that allows

2055

   for fast failure when a single character is used. They

2056

   remember the last single character that is required for a

2057

   match, and fail early if it is not present in the string.)

2050

2058

   If the pattern is changed to

2051

2059

2052

2060

   <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>

2053

2061

2054

   sequences of non-digits cannot be broken, and  failure  happens quickly.

2062

   sequences of non-digits cannot be broken, and failure happens quickly.

2055

2063

  </para>

2056

2064

 </section>

2057

2065

2058

2066

 <section xml:id="regexp.reference.conditional">

2059

2067

  <title>Conditional subpatterns</title>

2060

2068

  <para>

2061

   It is possible to cause the matching process to obey a  subpattern

2062

   conditionally  or to choose between two alternative

2063

   subpatterns, depending on the result  of  an  assertion,  or

2064

   whether  a previous capturing subpattern matched or not. The

2069

   It is possible to cause the matching process to obey a subpattern

2070

   conditionally or to choose between two alternative

2071

   subpatterns, depending on the result of an assertion, or

2072

   whether a previous capturing subpattern matched or not. The

2065

2073

   two possible forms of conditional subpattern are

2066

2074

  </para>

2067

2075

...

@@ -2075,39 +2083,39 @@

2075

2083

  </informalexample>

2076

2084

  <para>

2077

2085

   If the condition is satisfied, the yes-pattern is used; otherwise

2078

   the  no-pattern  (if  present) is used. If there are

2086

   the no-pattern (if present) is used. If there are

2079

2087

   more than two alternatives in the subpattern, a compile-time

2080

2088

   error occurs.

2081

2089

  </para>

2082

2090

  <para>

2083

   There are two kinds of condition. If the  text  between  the

2084

   parentheses  consists  of  a  sequence  of  digits, then the

2085

   condition is satisfied if the capturing subpattern  of  that

2086

   number  has  previously matched. Consider the following pattern,

2087

   which contains non-significant white space to make  it

2088

   more  readable  (assume  the  <link

2091

   There are two kinds of condition. If the text between the

2092

   parentheses consists of a sequence of digits, then the

2093

   condition is satisfied if the capturing subpattern of that

2094

   number has previously matched. Consider the following pattern,

2095

   which contains non-significant white space to make it

2096

   more readable (assume the <link

2089

2097

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2090

   option)  and to divide it into three parts for ease of discussion:

2098

   option) and to divide it into three parts for ease of discussion:

2091

2099

  </para>

2092

2100

  <informalexample>

2093

2101

   <programlisting>

2094

2102

<![CDATA[

2095

( \( )?    [^()]+    (?(1) \) )

2103

( \( )? [^()]+ (?(1) \) )

2096

2104

]]>

2097

2105

   </programlisting>

2098

2106

  </informalexample>

2099

2107

  <para>

2100

   The first part matches an optional opening parenthesis,  and

2101

   if  that character is present, sets it as the first captured

2102

   substring. The second part matches one  or  more  characters

2103

   that  are  not  parentheses. The third part is a conditional

2104

   subpattern that tests whether the first set  of  parentheses

2105

   matched  or  not.  If  they did, that is, if subject started

2106

   with an opening parenthesis, the condition is &true;,  and  so

2107

   the  yes-pattern  is  executed  and a closing parenthesis is

2108

   required. Otherwise, since no-pattern is  not  present,  the

2109

   subpattern  matches  nothing.  In  other words, this pattern

2110

   matches a sequence of non-parentheses,  optionally  enclosed

2108

   The first part matches an optional opening parenthesis, and

2109

   if that character is present, sets it as the first captured

2110

   substring. The second part matches one or more characters

2111

   that are not parentheses. The third part is a conditional

2112

   subpattern that tests whether the first set of parentheses

2113

   matched or not. If they did, that is, if subject started

2114

   with an opening parenthesis, the condition is &true;, and so

2115

   the yes-pattern is executed and a closing parenthesis is

2116

   required. Otherwise, since no-pattern is not present, the

2117

   subpattern matches nothing. In other words, this pattern

2118

   matches a sequence of non-parentheses, optionally enclosed

2111

2119

   in parentheses.

2112

2120

  </para>

2113

2121

  <para>

...

@@ -2116,10 +2124,10 @@

2116

2124

   level", the condition is false.

2117

2125

  </para>

2118

2126

  <para>

2119

   If the condition is not a sequence of digits or (R), it must be  an

2120

   assertion.  This  may be a positive or negative lookahead or

2121

   lookbehind assertion. Consider this pattern, again  containing

2122

   non-significant  white space, and with the two alternatives on

2127

   If the condition is not a sequence of digits or (R), it must be an

2128

   assertion. This may be a positive or negative lookahead or

2129

   lookbehind assertion. Consider this pattern, again containing

2130

   non-significant white space, and with the two alternatives on

2123

2131

   the second line:

2124

2132

  </para>

2125

2133

...

@@ -2127,18 +2135,18 @@

2127

2135

   <programlisting>

2128

2136

<![CDATA[

2129

2137

(?(?=[^a-z]*[a-z])

2130

\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

2138

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

2131

2139

]]>

2132

2140

   </programlisting>

2133

2141

  </informalexample>

2134

2142

  <para>

2135

2143

   The condition is a positive lookahead assertion that matches

2136

2144

   an optional sequence of non-letters followed by a letter. In

2137

   other words, it tests for  the  presence  of  at  least  one

2138

   letter  in the subject. If a letter is found, the subject is

2139

   matched against  the  first  alternative;  otherwise  it  is

2140

   matched  against the second. This pattern matches strings in

2141

   one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are

2145

   other words, it tests for the presence of at least one

2146

   letter in the subject. If a letter is found, the subject is

2147

   matched against the first alternative; otherwise it is

2148

   matched against the second. This pattern matches strings in

2149

   one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

2142

2150

   letters and dd are digits.

2143

2151

  </para>

2144

2152

 </section>

...

@@ -2146,31 +2154,66 @@

2146

2154

 <section xml:id="regexp.reference.comments">

2147

2155

  <title>Comments</title>

2148

2156

  <para>

2149

   The  sequence  (?#  marks  the  start  of  a  comment  which

2150

   continues   up  to  the  next  closing  parenthesis.  Nested

2157

   The sequence (?# marks the start of a comment which

2158

   continues up to the next closing parenthesis. Nested

2151

2159

   parentheses are not permitted. The characters that make up a

2152

2160

   comment play no part in the pattern matching at all.

2153

2161

  </para>

2154

2162

  <para>

2155

2163

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2156

   option is set, an unescaped # character outside  a character class

2164

   option is set, an unescaped # character outside a character class

2157

2165

   introduces a comment that continues up to the next newline character

2158

2166

   in the pattern.

2159

2167

  </para>

2168

  <para>

2169

   <example>

2170

    <title>Usage of comments in PCRE pattern</title>

2171

    <programlisting role="php">

2172

<![CDATA[

2173

<?php

2174

2175

$subject = 'test';

2176

2177

/* (?# can be used to add comments without enabling PCRE_EXTENDED */

2178

$match = preg_match('/te(?# this is a comment)st/', $subject);

2179

var_dump($match);

2180

2181

/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */

2182

$match = preg_match('/te   #~~~~

2183

st/', $subject);

2184

var_dump($match);

2185

2186

/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything

2187

   that follows an unescaped # on the same line is ignored */

2188

$match = preg_match('/te    #~~~~

2189

st/x', $subject);

2190

var_dump($match);

2191

]]>

2192

    </programlisting>

2193

    &example.outputs;

2194

    <screen>

2195

<![CDATA[

2196

int(1)

2197

int(0)

2198

int(1)

2199

]]>

2200

    </screen>

2201

   </example>

2202

  </para>

2160

2203

 </section>

2161

2204

2162

2205

 <section xml:id="regexp.reference.recursive">

2163

2206

  <title>Recursive patterns</title>

2164

2207

  <para>

2165

   Consider the problem of matching a  string  in  parentheses,

2166

   allowing  for  unlimited nested parentheses. Without the use

2167

   of recursion, the best that can be done is to use a  pattern

2168

   that  matches  up  to some fixed depth of nesting. It is not

2169

   possible to handle an arbitrary nesting depth. Perl 5.6  has

2170

   provided   an  experimental  facility  that  allows  regular

2171

   expressions to recurse (among other things).  The  special

2172

   item (?R) is  provided for  the specific  case of recursion.

2173

   This PCRE  pattern  solves the  parentheses  problem (assume

2208

   Consider the problem of matching a string in parentheses,

2209

   allowing for unlimited nested parentheses. Without the use

2210

   of recursion, the best that can be done is to use a pattern

2211

   that matches up to some fixed depth of nesting. It is not

2212

   possible to handle an arbitrary nesting depth. Perl 5.6 has

2213

   provided an experimental facility that allows regular

2214

   expressions to recurse (among other things). The special

2215

   item (?R) is provided for the specific case of recursion.

2216

   This PCRE pattern solves the parentheses problem (assume

2174

2217

   the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

2175

2218

   option is set so that white space is

2176

2219

   ignored):

...

@@ -2179,45 +2222,45 @@

2179

2222

  </para>

2180

2223

  <para>

2181

2224

   First it matches an opening parenthesis. Then it matches any

2182

   number  of substrings which can either be a sequence of

2183

   non-parentheses, or a recursive  match  of  the  pattern  itself

2225

   number of substrings which can either be a sequence of

2226

   non-parentheses, or a recursive match of the pattern itself

2184

2227

   (i.e. a correctly parenthesized substring). Finally there is

2185

2228

   a closing parenthesis.

2186

2229

  </para>

2187

2230

  <para>

2188

   This particular example pattern  contains  nested  unlimited

2231

   This particular example pattern contains nested unlimited

2189

2232

   repeats, and so the use of a once-only subpattern for matching

2190

   strings of non-parentheses is  important  when  applying

2191

   the  pattern to strings that do not match. For example, when

2233

   strings of non-parentheses is important when applying

2234

   the pattern to strings that do not match. For example, when

2192

2235

   it is applied to

2193

2236

2194

2237

   <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>

2195

2238

2196

   it yields "no match" quickly. However, if a  once-only  subpattern

2197

   is  not  used,  the match runs for a very long time

2198

   indeed because there are so many different ways the + and  *

2199

   repeats  can carve up the subject, and all have to be tested

2239

   it yields "no match" quickly. However, if a once-only subpattern

2240

   is not used, the match runs for a very long time

2241

   indeed because there are so many different ways the + and *

2242

   repeats can carve up the subject, and all have to be tested

2200

2243

   before failure can be reported.

2201

2244

  </para>

2202

2245

  <para>

2203

   The values set for any capturing subpatterns are those  from

2246

   The values set for any capturing subpatterns are those from

2204

2247

   the outermost level of the recursion at which the subpattern

2205

2248

   value is set. If the pattern above is matched against

2206

2249

2207

2250

   <literal>(ab(cd)ef)</literal>

2208

2251

2209

   the value for the capturing parentheses is  "ef",  which  is

2210

   the  last  value  taken  on  at the top level. If additional

2252

   the value for the capturing parentheses is "ef", which is

2253

   the last value taken on at the top level. If additional

2211

2254

   parentheses are added, giving

2212

2255

2213

2256

   <literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>

2214

2257

   then the string they capture

2215

2258

   is "ab(cd)ef", the contents of the top level parentheses. If

2216

   there are more than 15 capturing parentheses in  a  pattern,

2217

   PCRE  has  to  obtain  extra  memory  to store data during a

2218

   recursion, which it does by using  pcre_malloc,  freeing  it

2219

   via  pcre_free  afterwards. If no memory can be obtained, it

2220

   saves data for the first 15 capturing parentheses  only,  as

2259

   there are more than 15 capturing parentheses in a pattern,

2260

   PCRE has to obtain extra memory to store data during a

2261

   recursion, which it does by using pcre_malloc, freeing it

2262

   via pcre_free afterwards. If no memory can be obtained, it

2263

   saves data for the first 15 capturing parentheses only, as

2221

2264

   there is no way to give an out-of-memory error from within a

2222

2265

   recursion.

2223

2266

  </para>

...

@@ -2256,75 +2299,75 @@

2256

2299

  <title>Performance</title>

2257

2300

  <para>

2258

2301

   Certain items that may appear in patterns are more efficient

2259

   than  others.  It is more efficient to use a character class

2302

   than others. It is more efficient to use a character class

2260

2303

   like [aeiou] than a set of alternatives such as (a|e|i|o|u).

2261

   In  general,  the  simplest  construction  that provides the

2262

   required behaviour is usually the  most  efficient.  Jeffrey

2263

   Friedl's  book contains a lot of discussion about optimizing

2304

   In general, the simplest construction that provides the

2305

   required behaviour is usually the most efficient. Jeffrey

2306

   Friedl's book contains a lot of discussion about optimizing

2264

2307

   regular expressions for efficient performance.

2265

2308

  </para>

2266

2309

  <para>

2267

2310

   When a pattern begins with .* and the <link

2268

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is

2269

   set,  the  pattern  is implicitly anchored by PCRE, since it

2311

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is

2312

   set, the pattern is implicitly anchored by PCRE, since it

2270

2313

   can match only at the start of a subject string. However, if

2271

2314

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

2272

2315

   is not set, PCRE cannot make this optimization,

2273

   because the . metacharacter does not then match  a  newline,

2316

   because the . metacharacter does not then match a newline,

2274

2317

   and if the subject string contains newlines, the pattern may

2275

   match from the character immediately following one  of  them

2318

   match from the character immediately following one of them

2276

2319

   instead of from the very start. For example, the pattern

2277

2320

2278

2321

   <literal>(.*) second</literal>

2279

2322

2280

2323

   matches the subject "first\nand second" (where \n stands for

2281

2324

   a newline character) with the first captured substring being

2282

   "and". In order to do this, PCRE  has  to  retry  the  match

2325

   "and". In order to do this, PCRE has to retry the match

2283

2326

   starting after every newline in the subject.

2284

2327

  </para>

2285

2328

  <para>

2286

2329

   If you are using such a pattern with subject strings that do

2287

   not  contain  newlines,  the best performance is obtained by

2330

   not contain newlines, the best performance is obtained by

2288

2331

   setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,

2289

   or starting the  pattern  with  ^.*  to

2290

   indicate  explicit anchoring. That saves PCRE from having to

2332

   or starting the pattern with ^.* to

2333

   indicate explicit anchoring. That saves PCRE from having to

2291

2334

   scan along the subject looking for a newline to restart at.

2292

2335

  </para>

2293

2336

  <para>

2294

   Beware of patterns that contain nested  indefinite  repeats.

2295

   These  can  take a long time to run when applied to a string

2337

   Beware of patterns that contain nested indefinite repeats.

2338

   These can take a long time to run when applied to a string

2296

2339

   that does not match. Consider the pattern fragment

2297

2340

2298

2341

   <literal>(a+)*</literal>

2299

2342

  </para>

2300

2343

  <para>

2301

   This can match "aaaa" in 33 different ways, and this  number

2302

   increases  very  rapidly  as  the string gets longer. (The *

2303

   repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of

2304

   those  cases other than 0, the + repeats can match different

2344

   This can match "aaaa" in 33 different ways, and this number

2345

   increases very rapidly as the string gets longer. (The *

2346

   repeat can match 0, 1, 2, 3, or 4 times, and for each of

2347

   those cases other than 0, the + repeats can match different

2305

2348

   numbers of times.) When the remainder of the pattern is such

2306

   that  the entire match is going to fail, PCRE has in principle

2307

   to try every possible variation, and this  can  take  an

2349

   that the entire match is going to fail, PCRE has in principle

2350

   to try every possible variation, and this can take an

2308

2351

   extremely long time.

2309

2352

  </para>

2310

2353

  <para>

2311

   An optimization catches some of the more simple  cases  such

2354

   An optimization catches some of the more simple cases such

2312

2355

as

2313

2356

2314

2357

   <literal>(a+)*b</literal>

2315

2358

2316

   where a literal character follows. Before embarking  on  the

2359

   where a literal character follows. Before embarking on the

2317

2360

   standard matching procedure, PCRE checks that there is a "b"

2318

   later in the subject string, and if there is not,  it  fails

2319

   the  match  immediately. However, when there is no following

2320

   literal this optimization cannot be used. You  can  see  the

2361

   later in the subject string, and if there is not, it fails

2362

   the match immediately. However, when there is no following

2363

   literal this optimization cannot be used. You can see the

2321

2364

   difference by comparing the behaviour of

2322

2365

2323

2366

   <literal>(a+)*\d</literal>

2324

2367

2325

   with the pattern above. The former gives  a  failure  almost

2326

   instantly  when  applied  to a whole line of "a" characters,

2327

   whereas the latter takes an appreciable  time  with  strings

2368

   with the pattern above. The former gives a failure almost

2369

   instantly when applied to a whole line of "a" characters,

2370

   whereas the latter takes an appreciable time with strings

2328

2371

   longer than about 20 characters.

2329

2372

  </para>

2330

2373

 </section>

2331

2374

Generated: 30 Apr 2024 11:18:30

Tools (Spanish Manual)