PHP: Documentation Tools

reference/pcre/pattern.syntax.xml
bb4abab22bf0204b4dba0140ac5fc9daa6888e0f

...

@@ -1,28 +1,28 @@

<?xml version="1.0" encoding="utf-8"?>

<!-- $Revision$ -->

<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->

<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">

<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">

 <title>Pattern Syntax</title>

 <titleabbrev>PCRE regex syntax</titleabbrev>

 <section xml:id="regexp.introduction">

  <title>Introduction</title>

  <para>

   The syntax and semantics of  the  regular  expressions

   supported  by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number  of

   other  books,  some  of which have copious examples. Jeffrey

   Friedl's  "Mastering  Regular  Expressions",  published   by

   O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.

   The syntax and semantics of the regular expressions

   supported by PCRE are described in this section. Regular expressions are

   also described in the Perl documentation and in a number of

   other books, some of which have copious examples. Jeffrey

   Friedl's "Mastering Regular Expressions", published by

   O'Reilly (ISBN 1-56592-257-3), covers them in great detail.

   The description here is intended as reference documentation.

  </para>

  <para>

   A regular expression is a pattern that is matched against  a

   A regular expression is a pattern that is matched against a

   subject string from left to right. Most characters stand for

   themselves in a pattern, and match the corresponding

   characters in the subject. As a trivial example, the pattern

   <literal>The quick brown fox</literal>

   matches a portion of a subject string that is  identical  to

   matches a portion of a subject string that is identical to

   itself.

  </para>

 </section>

...

@@ -32,6 +32,7 @@

   When using the PCRE functions, it is required that the pattern is enclosed

   by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,

   non-backslash, non-whitespace character.

   Leading whitespace before a valid delimiter is silently ignored.

  </para>

  <para>

   Often used delimiters are forward slashes (<literal>/</literal>), hash

...

@@ -48,6 +49,26 @@

    </programlisting>

   </informalexample>

  </para>

  <para>

   It is also possible to use

   bracket style delimiters where the opening and closing brackets are the

   starting and ending delimiter, respectively. <literal>()</literal>,

   <literal>{}</literal>, <literal>[]</literal> and <literal>&lt;&gt;</literal>

   are all valid bracket style delimiter pairs.

   <informalexample>

    <programlisting>

<![CDATA[

(this [is] a (pattern))

{this [is] a (pattern)}

[this [is] a (pattern)]

<this [is] a (pattern)>

]]>

    </programlisting>

   </informalexample>

   Bracket style delimiters do not need to be escaped when they are used as meta

   characters within the pattern, but as with other delimiters they must be

   escaped when they are used as literal characters.

  </para>

  <para>

   If the delimiter needs to be matched inside the pattern it must be

   escaped using a backslash. If the delimiter appears often inside the

...

@@ -65,18 +86,6 @@

   for injection into a pattern and its optional second parameter may be used

   to specify the delimiter to be escaped.

  </para>

  <para>

   In addition to the aforementioned delimiters, it is also possible to use

   bracket style delimiters where the opening and closing brackets are the

   starting and ending delimiter, respectively.

   <informalexample>

    <programlisting>

<![CDATA[

{this is a pattern}

]]>

    </programlisting>

   </informalexample>

  </para>

  <para>

   You may add <link linkend="reference.pcre.pattern.modifiers">pattern

   modifiers</link> after the ending delimiter. The following is an example

...

@@ -93,103 +102,100 @@

102

 <section xml:id="regexp.reference.meta">

103

  <title>Meta-characters</title>

104

  <para>

   The  power  of  regular  expressions comes from the

105

   The power of regular expressions comes from the

106

   ability to include alternatives and repetitions in the

   pattern.  These  are encoded in the pattern by the use of 

   <emphasis>meta-characters</emphasis>, which do not stand for  themselves  but  instead

107

   pattern. These are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead

100

109

   are interpreted in some special way.

101

110

  </para>

102

111

  <para>

103

   There are two different sets of meta-characters: those  that

104

   are  recognized anywhere in the pattern except within square

112

   There are two different sets of meta-characters: those that

113

   are recognized anywhere in the pattern except within square

105

114

   brackets, and those that are recognized in square brackets.

106

115

   Outside square brackets, the meta-characters are as follows:

107

   <variablelist>

108

    <varlistentry>

109

     <term><emphasis>\</emphasis></term>

110

     <listitem><simpara>general escape character with several uses</simpara></listitem>

111

    </varlistentry>

112

    <varlistentry>

113

     <term><emphasis>^</emphasis></term>

114

     <listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>

115

    </varlistentry>

116

    <varlistentry>

117

     <term><emphasis>$</emphasis></term>

118

     <listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem>

119

    </varlistentry>

120

    <varlistentry>

121

     <term><emphasis>.</emphasis></term>

122

     <listitem><simpara>match any character except newline (by default)</simpara></listitem>

123

    </varlistentry>

124

    <varlistentry>

125

     <term><emphasis>[</emphasis></term>

126

     <listitem><simpara>start character class definition</simpara></listitem>

127

    </varlistentry>

128

    <varlistentry>

129

     <term><emphasis>]</emphasis></term>

130

     <listitem><simpara>end character class definition</simpara></listitem>

131

    </varlistentry>

132

    <varlistentry>

133

     <term><emphasis>|</emphasis></term>

134

     <listitem><simpara>start of alternative branch</simpara></listitem>

135

    </varlistentry>

136

    <varlistentry>

137

     <term><emphasis>(</emphasis></term>

138

     <listitem><simpara>start subpattern</simpara></listitem>

139

    </varlistentry>

140

    <varlistentry>

141

     <term><emphasis>)</emphasis></term>

142

     <listitem><simpara>end subpattern</simpara></listitem>

143

    </varlistentry>

144

    <varlistentry>

145

     <term><emphasis>?</emphasis></term>

146

     <listitem>

147

      <simpara>

148

       extends the meaning of (, also 0 or 1 quantifier, also makes greedy

149

       quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)

150

      </simpara>

151

     </listitem>

152

    </varlistentry>

153

    <varlistentry>

154

     <term><emphasis>*</emphasis></term>

155

     <listitem><simpara>0 or more quantifier</simpara></listitem>

156

    </varlistentry>

157

    <varlistentry>

158

     <term><emphasis>+</emphasis></term>

159

     <listitem><simpara>1 or more quantifier</simpara></listitem>

160

    </varlistentry>

161

    <varlistentry>

162

     <term><emphasis>{</emphasis></term>

163

     <listitem><simpara>start min/max quantifier</simpara></listitem>

164

    </varlistentry>

165

    <varlistentry>

166

     <term><emphasis>}</emphasis></term>

167

     <listitem><simpara>end min/max quantifier</simpara></listitem>

168

    </varlistentry>

169

   </variablelist>

116

117

   <table>

118

     <title>Meta-characters outside square brackets</title>

119

    <tgroup cols="2">

120

     <thead>

121

      <row>

122

       <entry>Meta-character</entry><entry>Description</entry>

123

      </row>

124

     </thead>

125

     <tbody>

126

      <row>

127

       <entry>\</entry><entry>general escape character with several uses</entry>

128

      </row>

129

      <row>

130

       <entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>

131

      </row>

132

      <row>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or

134

        end of line, in multiline mode)</entry>

135

      </row>

136

      <row>

137

       <entry>.</entry><entry>match any character except newline (by default)</entry>

138

      </row>

139

      <row>

140

       <entry>[</entry><entry>start character class definition</entry>

141

      </row>

142

      <row>

143

       <entry>]</entry><entry>end character class definition</entry>

144

      </row>

145

      <row>

146

       <entry>|</entry><entry>start of alternative branch</entry>

147

      </row>

148

      <row>

149

       <entry>(</entry><entry>start subpattern</entry>

150

      </row>

151

      <row>

152

       <entry>)</entry><entry>end subpattern</entry>

153

      </row>

154

      <row>

155

       <entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy

156

        quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>

157

      </row>

158

      <row>

159

       <entry>*</entry><entry>0 or more quantifier</entry>

160

      </row>

161

      <row>

162

       <entry>+</entry><entry>1 or more quantifier</entry>

163

      </row>

164

      <row>

165

       <entry>{</entry><entry>start min/max quantifier</entry>

166

      </row>

167

      <row>

168

       <entry>}</entry><entry>end min/max quantifier</entry>

169

      </row>

170

     </tbody>

171

    </tgroup>

172

   </table>

170

173

171

174

   Part of a pattern that is in square brackets is called a

172

   "character class". In a character class the only

175

   <link linkend="regexp.reference.character-classes">character class</link>. In a character class the only

173

176

   meta-characters are:

174

177

175

   <variablelist>

176

    <varlistentry>

177

     <term><emphasis>\</emphasis></term>

178

     <listitem><simpara>general escape character</simpara></listitem>

179

    </varlistentry>

180

    <varlistentry>

181

     <term><emphasis>^</emphasis></term>

182

     <listitem><simpara>negate the class, but only if the first character</simpara></listitem>

183

    </varlistentry>

184

    <varlistentry>

185

     <term><emphasis>-</emphasis></term>

186

     <listitem><simpara>indicates character range</simpara></listitem>

187

    </varlistentry>

188

    <varlistentry>

189

     <term><emphasis>]</emphasis></term>

190

     <listitem><simpara>terminates the character class</simpara></listitem>

191

    </varlistentry>

192

   </variablelist>

178

   <table>

179

     <title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>

180

    <tgroup cols="2">

181

     <thead>

182

      <row>

183

       <entry>Meta-character</entry><entry>Description</entry>

184

      </row>

185

     </thead>

186

     <tbody>

187

      <row>

188

       <entry>\</entry><entry>general escape character</entry>

189

      </row>

190

      <row>

191

       <entry>^</entry><entry>negate the class, but only if the first character</entry>

192

      </row>

193

      <row>

194

       <entry>-</entry><entry>indicates character range</entry>

195

      </row>

196

     </tbody>

197

    </tgroup>

198

   </table>

193

199

194

200

   The following sections describe the use of each of the

195

201

   meta-characters.

...

@@ -199,9 +205,9 @@

199

205

 <section xml:id="regexp.reference.escape">

200

206

  <title>Escape sequences</title>

201

207

  <para>

202

   The backslash character has several uses. Firstly, if it  is

208

   The backslash character has several uses. Firstly, if it is

203

209

   followed by a non-alphanumeric character, it takes away any

204

   special  meaning that character may have. This use of

210

   special meaning that character may have. This use of

205

211

   backslash as an escape character applies both inside and

206

212

   outside character classes.

207

213

  </para>

...

@@ -210,7 +216,7 @@

210

216

   "\*" in the pattern. This applies whether or not the

211

217

   following character would otherwise be interpreted as a

212

218

   meta-character, so it is always safe to precede a non-alphanumeric

213

   with "\" to specify that it stands for itself.  In

219

   with "\" to specify that it stands for itself. In

214

220

   particular, if you want to match a backslash, you write "\\".

215

221

  </para>

216

222

  <note>

...

@@ -232,10 +238,10 @@

232

238

  <para>

233

239

   A second use of backslash provides a way of encoding

234

240

   non-printing characters in patterns in a visible manner. There

235

   is no restriction on the appearance of non-printing  characters,

241

   is no restriction on the appearance of non-printing characters,

236

242

   apart from the binary zero that terminates a pattern,

237

243

   but when a pattern is being prepared by text editing, it is

238

   usually  easier to use one of the following escape sequences

244

   usually easier to use one of the following escape sequences

239

245

   than the binary character it represents:

240

246

  </para>

241

247

  <para>

...

@@ -296,6 +302,12 @@

296

302

      <simpara>carriage return (hex 0D)</simpara>

297

303

     </listitem>

298

304

    </varlistentry>

305

    <varlistentry>

306

     <term><emphasis>\R</emphasis></term>

307

     <listitem>

308

      <simpara>line break: matches \n, \r and \r\n</simpara>

309

     </listitem>

310

    </varlistentry>

299

311

    <varlistentry>

300

312

     <term><emphasis>\t</emphasis></term>

301

313

     <listitem>

...

@@ -320,9 +332,9 @@

320

332

  </para>

321

333

  <para>

322

334

   The precise effect of "<literal>\cx</literal>" is as follows:

323

   if "<literal>x</literal>" is a lower case  letter, it is converted

335

   if "<literal>x</literal>" is a lower case letter, it is converted

324

336

   to upper case. Then bit 6 of the character (hex 40) is inverted.

325

   Thus "<literal>\cz</literal>" becomes  hex 1A, but

337

   Thus "<literal>\cz</literal>" becomes hex 1A, but

326

338

   "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"

327

339

   becomes hex 7B.

328

340

  </para>

...

@@ -338,7 +350,7 @@

338

350

  </para>

339

351

  <para>

340

352

   After "<literal>\0</literal>" up to two further octal digits are read.

341

   In  both cases,  if  there are fewer than two digits, just those that

353

   In both cases, if there are fewer than two digits, just those that

342

354

   are present are used. Thus the sequence "<literal>\0\x\07</literal>"

343

355

   specifies two binary zeros followed by a BEL character. Make sure you

344

356

   supply two digits after the initial zero if the character

...

@@ -347,20 +359,20 @@

347

359

  <para>

348

360

   The handling of a backslash followed by a digit other than 0

349

361

   is complicated. Outside a character class, PCRE reads it

350

   and any following digits as a decimal number. If the  number

351

   is  less  than  10, or if there have been at least that many

352

   previous capturing left parentheses in the  expression,  the

353

   entire  sequence is taken as a <emphasis>back reference</emphasis>. A description

354

   of how this works is given later, following  the  discussion

362

   and any following digits as a decimal number. If the number

363

   is less than 10, or if there have been at least that many

364

   previous capturing left parentheses in the expression, the

365

   entire sequence is taken as a <emphasis>back reference</emphasis>. A description

366

   of how this works is given later, following the discussion

355

367

   of parenthesized subpatterns.

356

368

  </para>

357

369

  <para>

358

   Inside a character  class,  or  if  the  decimal  number  is

370

   Inside a character class, or if the decimal number is

359

371

   greater than 9 and there have not been that many capturing

360

372

   subpatterns, PCRE re-reads up to three octal digits following

361

373

   the backslash, and generates a single byte from the

362

374

   least significant 8 bits of the value. Any subsequent digits

363

   stand for themselves.  For example:

375

   stand for themselves. For example:

364

376

  </para>

365

377

  <para>

366

378

   <variablelist>

...

@@ -428,7 +440,7 @@

428

440

   digits are ever read.

429

441

  </para>

430

442

  <para>

431

   All the sequences that define a single byte value can  be

443

   All the sequences that define a single byte value can be

432

444

   used both inside and outside character classes. In addition,

433

445

   inside a character class, the sequence "<literal>\b</literal>"

434

446

   is interpreted as the backspace character (hex 08). Outside a character

...

@@ -450,11 +462,11 @@

450

462

    </varlistentry>

451

463

    <varlistentry>

452

464

     <term><emphasis>\h</emphasis></term>

453

     <listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

465

     <listitem><simpara>any horizontal whitespace character</simpara></listitem>

454

466

    </varlistentry>

455

467

    <varlistentry>

456

468

     <term><emphasis>\H</emphasis></term>

457

     <listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

469

     <listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>

458

470

    </varlistentry>

459

471

    <varlistentry>

460

472

     <term><emphasis>\s</emphasis></term>

...

@@ -466,11 +478,11 @@

466

478

    </varlistentry>

467

479

    <varlistentry>

468

480

     <term><emphasis>\v</emphasis></term>

469

     <listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

481

     <listitem><simpara>any vertical whitespace character</simpara></listitem>

470

482

    </varlistentry>

471

483

    <varlistentry>

472

484

     <term><emphasis>\V</emphasis></term>

473

     <listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

485

     <listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>

474

486

    </varlistentry>

475

487

    <varlistentry>

476

488

     <term><emphasis>\w</emphasis></term>

...

@@ -487,9 +499,15 @@

487

499

   characters into two disjoint sets. Any given character

488

500

   matches one, and only one, of each pair.

489

501

  </para>

502

  <para>

503

   The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),

504

   and space (32). However, if locale-specific matching is happening,

505

   characters with code points in the range 128-255 may also be considered

506

   as whitespace characters, for instance, NBSP (A0).

507

  </para>

490

508

  <para>

491

509

   A "word" character is any letter or digit or the underscore

492

   character,  that  is,  any  character which can be part of a

510

   character, that is, any character which can be part of a

493

511

   Perl "<emphasis>word</emphasis>". The definition of letters and digits is

494

512

   controlled by PCRE's character tables, and may vary if locale-specific

495

513

   matching is taking place. For example, in the "fr" (French) locale, some

...

@@ -498,15 +516,15 @@

498

516

  </para>

499

517

  <para>

500

518

   These character type sequences can appear both inside and

501

   outside  character classes. They each match one character of

502

   the appropriate type. If the current matching  point is at

519

   outside character classes. They each match one character of

520

   the appropriate type. If the current matching point is at

503

521

   the end of the subject string, all of them fail, since there

504

522

   is no character to match.

505

523

  </para>

506

524

  <para>

507

   The fourth use of backslash is  for  certain  simple

525

   The fourth use of backslash is for certain simple

508

526

   assertions. An assertion specifies a condition that has to be met

509

   at a particular point in  a match, without consuming any

527

   at a particular point in a match, without consuming any

510

528

   characters from the subject string. The use of subpatterns

511

529

   for more complicated assertions is described below. The

512

530

   backslashed assertions are

...

@@ -545,7 +563,7 @@

545

563

   </variablelist>

546

564

  </para>

547

565

  <para>

548

   These assertions may not appear in  character  classes  (but

566

   These assertions may not appear in character classes (but

549

567

   note that "<literal>\b</literal>" has a different meaning, namely the backspace

550

568

   character, inside a character class).

551

569

  </para>

...

@@ -553,20 +571,20 @@

553

571

   A word boundary is a position in the subject string where

554

572

   the current character and the previous character do not both

555

573

   match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches

556

   <literal>\w</literal> and  the  other  matches

574

   <literal>\w</literal> and the other matches

557

575

   <literal>\W</literal>), or the start or end of the string if the first

558

576

   or last character matches <literal>\w</literal>, respectively.

559

577

  </para>

560

578

  <para>

561

579

   The <literal>\A</literal>, <literal>\Z</literal>, and

562

   <literal>\z</literal> assertions differ  from  the  traditional

563

   circumflex  and  dollar  (described below) in that they only

564

   ever match at the very start and end of the subject  string,

565

   whatever  options  are  set.  They  are  not affected by the

580

   <literal>\z</literal> assertions differ from the traditional

581

   circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )

582

   in that they only ever match at the very start and end of the subject string,

583

   whatever options are set. They are not affected by the

566

584

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or

567

585

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>

568

   options. The  difference  between <literal>\Z</literal> and

569

   <literal>\z</literal>  is that <literal>\Z</literal> matches before a

586

   options. The difference between <literal>\Z</literal> and

587

   <literal>\z</literal> is that <literal>\Z</literal> matches before a

570

588

   newline that is the last character of the string as well as at the end of

571

589

   the string, whereas <literal>\z</literal> matches only at the end.

572

590

  </para>

...

@@ -583,12 +601,16 @@

583

601

   regexp metacharacters in the pattern. For example:

584

602

   <literal>\w+\Q.$.\E$</literal> will match one or more word characters,

585

603

   followed by literals <literal>.$.</literal> and anchored at the end of

586

   the string.

604

   the string. Note that this does not change the behavior of 

605

   delimiters; for instance the pattern <literal>#\Q#\E#$</literal>

606

   is not valid, because the second <literal>#</literal> marks the end

607

   of the pattern, and the <literal>\E#</literal> is interpreted as invalid

608

   modifiers.

587

609

  </para>

588

610

589

611

  <para>

590

   <literal>\K</literal> can be used to reset the match start since

591

   PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches

612

   <literal>\K</literal> can be used to reset the match start. 

613

   For example, the pattern <literal>foo\Kbar</literal> matches

592

614

   "foobar", but reports that it has matched "bar". The use of

593

615

   <literal>\K</literal> does not interfere with the setting of captured

594

616

   substrings. For example, when the pattern <literal>(foo)\Kbar</literal>

...

@@ -818,7 +840,7 @@

818

840

     <row rowsep="1">

819

841

      <entry><literal>So</literal></entry>

820

842

      <entry>Other symbol</entry>

821

      <entry></entry>

843

      <entry>Includes emojis</entry>

822

844

     </row>

823

845

     <row>

824

846

      <entry><literal>Z</literal></entry>

...

@@ -844,7 +866,7 @@

844

866

   </tgroup>

845

867

  </table>

846

868

  <para>

847

   Extended properties such as "Greek" or "InMusicalSymbols" are not

869

   Extended properties such as <literal>InMusicalSymbols</literal> are not

848

870

   supported by PCRE.

849

871

  </para>

850

872

  <para>

...

@@ -852,15 +874,193 @@

852

874

   For example, <literal>\p{Lu}</literal> always matches only upper case letters.

853

875

  </para>

854

876

  <para>

855

   The <literal>\X</literal> escape matches any number of Unicode characters 

856

   that form an extended Unicode sequence. <literal>\X</literal> is equivalent 

857

   to <literal>(?>\PM\pM*)</literal>.

877

   Sets of Unicode characters are defined as belonging to certain scripts. A

878

   character from one of these sets can be matched using a script name. For

879

   example:

858

880

  </para>

881

  <itemizedlist>

882

   <listitem>

883

    <simpara><literal>\p{Greek}</literal></simpara>

884

   </listitem>

885

   <listitem>

886

    <simpara><literal>\P{Han}</literal></simpara>

887

   </listitem>

888

  </itemizedlist>

859

889

  <para>

860

   That is, it matches a character without the "mark" property, followed

861

   by zero or more characters with the "mark" property, and treats the

862

   sequence as an atomic group (see below). Characters with the "mark"

863

   property are typically accents that affect the preceding character.

890

   Those that are not part of an identified script are lumped together as

891

   <literal>Common</literal>. The current list of scripts is:

892

  </para>

893

  <table>

894

   <title>Supported scripts</title>

895

   <tgroup cols="5">

896

    <tbody>

897

     <row>

898

      <entry><literal>Arabic</literal></entry>

899

      <entry><literal>Armenian</literal></entry>

900

      <entry><literal>Avestan</literal></entry>

901

      <entry><literal>Balinese</literal></entry>

902

      <entry><literal>Bamum</literal></entry>

903

     </row>

904

     <row>

905

      <entry><literal>Batak</literal></entry>

906

      <entry><literal>Bengali</literal></entry>

907

      <entry><literal>Bopomofo</literal></entry>

908

      <entry><literal>Brahmi</literal></entry>

909

      <entry><literal>Braille</literal></entry>

910

     </row>

911

     <row>

912

      <entry><literal>Buginese</literal></entry>

913

      <entry><literal>Buhid</literal></entry>

914

      <entry><literal>Canadian_Aboriginal</literal></entry>

915

      <entry><literal>Carian</literal></entry>

916

      <entry><literal>Chakma</literal></entry>

917

     </row>

918

     <row>

919

      <entry><literal>Cham</literal></entry>

920

      <entry><literal>Cherokee</literal></entry>

921

      <entry><literal>Common</literal></entry>

922

      <entry><literal>Coptic</literal></entry>

923

      <entry><literal>Cuneiform</literal></entry>

924

     </row>

925

     <row>

926

      <entry><literal>Cypriot</literal></entry>

927

      <entry><literal>Cyrillic</literal></entry>

928

      <entry><literal>Deseret</literal></entry>

929

      <entry><literal>Devanagari</literal></entry>

930

      <entry><literal>Egyptian_Hieroglyphs</literal></entry>

931

     </row>

932

     <row>

933

      <entry><literal>Ethiopic</literal></entry>

934

      <entry><literal>Georgian</literal></entry>

935

      <entry><literal>Glagolitic</literal></entry>

936

      <entry><literal>Gothic</literal></entry>

937

      <entry><literal>Greek</literal></entry>

938

     </row>

939

     <row>

940

      <entry><literal>Gujarati</literal></entry>

941

      <entry><literal>Gurmukhi</literal></entry>

942

      <entry><literal>Han</literal></entry>

943

      <entry><literal>Hangul</literal></entry>

944

      <entry><literal>Hanunoo</literal></entry>

945

     </row>

946

     <row>

947

      <entry><literal>Hebrew</literal></entry>

948

      <entry><literal>Hiragana</literal></entry>

949

      <entry><literal>Imperial_Aramaic</literal></entry>

950

      <entry><literal>Inherited</literal></entry>

951

      <entry><literal>Inscriptional_Pahlavi</literal></entry>

952

     </row>

953

     <row>

954

      <entry><literal>Inscriptional_Parthian</literal></entry>

955

      <entry><literal>Javanese</literal></entry>

956

      <entry><literal>Kaithi</literal></entry>

957

      <entry><literal>Kannada</literal></entry>

958

      <entry><literal>Katakana</literal></entry>

959

     </row>

960

     <row>

961

      <entry><literal>Kayah_Li</literal></entry>

962

      <entry><literal>Kharoshthi</literal></entry>

963

      <entry><literal>Khmer</literal></entry>

964

      <entry><literal>Lao</literal></entry>

965

      <entry><literal>Latin</literal></entry>

966

     </row>

967

     <row>

968

      <entry><literal>Lepcha</literal></entry>

969

      <entry><literal>Limbu</literal></entry>

970

      <entry><literal>Linear_B</literal></entry>

971

      <entry><literal>Lisu</literal></entry>

972

      <entry><literal>Lycian</literal></entry>

973

     </row>

974

     <row>

975

      <entry><literal>Lydian</literal></entry>

976

      <entry><literal>Malayalam</literal></entry>

977

      <entry><literal>Mandaic</literal></entry>

978

      <entry><literal>Meetei_Mayek</literal></entry>

979

      <entry><literal>Meroitic_Cursive</literal></entry>

980

     </row>

981

     <row>

982

      <entry><literal>Meroitic_Hieroglyphs</literal></entry>

983

      <entry><literal>Miao</literal></entry>

984

      <entry><literal>Mongolian</literal></entry>

985

      <entry><literal>Myanmar</literal></entry>

986

      <entry><literal>New_Tai_Lue</literal></entry>

987

     </row>

988

     <row>

989

      <entry><literal>Nko</literal></entry>

990

      <entry><literal>Ogham</literal></entry>

991

      <entry><literal>Old_Italic</literal></entry>

992

      <entry><literal>Old_Persian</literal></entry>

993

      <entry><literal>Old_South_Arabian</literal></entry>

994

     </row>

995

     <row>

996

      <entry><literal>Old_Turkic</literal></entry>

997

      <entry><literal>Ol_Chiki</literal></entry>

998

      <entry><literal>Oriya</literal></entry>

999

      <entry><literal>Osmanya</literal></entry>

1000

      <entry><literal>Phags_Pa</literal></entry>

1001

     </row>

1002

     <row>

1003

      <entry><literal>Phoenician</literal></entry>

1004

      <entry><literal>Rejang</literal></entry>

1005

      <entry><literal>Runic</literal></entry>

1006

      <entry><literal>Samaritan</literal></entry>

1007

      <entry><literal>Saurashtra</literal></entry>

1008

     </row>

1009

     <row>

1010

      <entry><literal>Sharada</literal></entry>

1011

      <entry><literal>Shavian</literal></entry>

1012

      <entry><literal>Sinhala</literal></entry>

1013

      <entry><literal>Sora_Sompeng</literal></entry>

1014

      <entry><literal>Sundanese</literal></entry>

1015

     </row>

1016

     <row>

1017

      <entry><literal>Syloti_Nagri</literal></entry>

1018

      <entry><literal>Syriac</literal></entry>

1019

      <entry><literal>Tagalog</literal></entry>

1020

      <entry><literal>Tagbanwa</literal></entry>

1021

      <entry><literal>Tai_Le</literal></entry>

1022

     </row>

1023

     <row>

1024

      <entry><literal>Tai_Tham</literal></entry>

1025

      <entry><literal>Tai_Viet</literal></entry>

1026

      <entry><literal>Takri</literal></entry>

1027

      <entry><literal>Tamil</literal></entry>

1028

      <entry><literal>Telugu</literal></entry>

1029

     </row>

1030

     <row>

1031

      <entry><literal>Thaana</literal></entry>

1032

      <entry><literal>Thai</literal></entry>

1033

      <entry><literal>Tibetan</literal></entry>

1034

      <entry><literal>Tifinagh</literal></entry>

1035

      <entry><literal>Ugaritic</literal></entry>

1036

     </row>

1037

     <row>

1038

      <entry><literal>Vai</literal></entry>

1039

      <entry><literal>Yi</literal></entry>

1040

      <entry />

1041

      <entry />

1042

      <entry />

1043

      <entry />

1044

     </row>

1045

    </tbody>

1046

   </tgroup>

1047

  </table>

1048

  <para>

1049

   The <literal>\X</literal> escape matches a Unicode extended grapheme

1050

   cluster. An extended grapheme cluster is one or more Unicode characters

1051

   that combine to form a single glyph. In effect, this can be thought of as

1052

   the Unicode equivalent of <literal>.</literal> as it will match one

1053

   composed character, regardless of how many individual characters are

1054

   actually used to render it.

1055

  </para>

1056

  <para>

1057

   In versions of PCRE older than 8.32 (which corresponds to PHP versions

1058

   before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>

1059

   is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a

1060

   character without the "mark" property, followed by zero or more characters

1061

   with the "mark" property, and treats the sequence as an atomic group (see

1062

   below). Characters with the "mark" property are typically accents that

1063

   affect the preceding character.

864

1064

  </para>

865

1065

  <para>

866

1066

   Matching characters by Unicode property is not fast, because PCRE has

...

@@ -876,8 +1076,8 @@

876

1076

  <para>

877

1077

   Outside a character class, in the default matching mode, the

878

1078

   circumflex character (<literal>^</literal>) is an assertion which

879

   is true only if the current matching point is at the start  of

880

   the  subject string. Inside a character class, circumflex (<literal>^</literal>)

1079

   is true only if the current matching point is at the start of

1080

   the subject string. Inside a character class, circumflex (<literal>^</literal>)

881

1081

   has an entirely different meaning (see below).

882

1082

  </para>

883

1083

  <para>

...

@@ -892,12 +1092,12 @@

892

1092

  </para>

893

1093

  <para>

894

1094

   A dollar character (<literal>$</literal>) is an assertion which is

895

   &true; only if the current  matching point is at the end of the subject

896

   string, or immediately before a newline character that is  the  last

1095

   &true; only if the current matching point is at the end of the subject

1096

   string, or immediately before a newline character that is the last

897

1097

   character in the string (by default). Dollar (<literal>$</literal>)

898

   need not be the last character of the pattern if a  number  of

899

   alternatives are  involved,  but it should be the last item in any branch

900

   in which it appears. Dollar has no  special  meaning  in  a

1098

   need not be the last character of the pattern if a number of

1099

   alternatives are involved, but it should be the last item in any branch

1100

   in which it appears. Dollar has no special meaning in a

901

1101

   character class.

902

1102

  </para>

903

1103

  <para>

...

@@ -923,9 +1123,9 @@

923

1123

   set.

924

1124

  </para>

925

1125

  <para>

926

   Note that the sequences \A, \Z, and \z can be used to  match

927

   the  start  and end of the subject in both modes, and if all

928

   branches of a pattern start with \A is it  always  anchored,

1126

   Note that the sequences \A, \Z, and \z can be used to match

1127

   the start and end of the subject in both modes, and if all

1128

   branches of a pattern start with \A is it always anchored,

929

1129

   whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

930

1130

   is set or not.

931

1131

  </para>

...

@@ -934,14 +1134,14 @@

934

1134

 <section xml:id="regexp.reference.dot">

935

1135

  <title>Dot</title>

936

1136

  <para>

937

   Outside a character class, a dot in the pattern matches  any

938

   one  character  in  the  subject,  including  a non-printing

939

   character, but not (by default) newline.  If the

1137

   Outside a character class, a dot in the pattern matches any

1138

   one character in the subject, including a non-printing

1139

   character, but not (by default) newline. If the

940

1140

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

941

   option  is  set,  then dots match newlines as well. The

1141

   option is set, then dots match newlines as well. The

942

1142

   handling of dot is entirely independent of the handling of

943

   circumflex  and  dollar,  the only relationship being that they

944

   both involve newline characters.  Dot has no special meaning

1143

   circumflex and dollar, the only relationship being that they

1144

   both involve newline characters. Dot has no special meaning

945

1145

   in a character class.

946

1146

  </para>

947

1147

  <para>

...

@@ -955,29 +1155,29 @@

955

1155

  <title>Character classes</title>

956

1156

  <para>

957

1157

   An opening square bracket introduces a character class,

958

   terminated  by  a  closing  square  bracket.  A  closing square

959

   bracket on its own is  not  special.  If  a  closing  square

960

   bracket  is  required as a member of the class, it should be

1158

   terminated by a closing square bracket. A closing square

1159

   bracket on its own is not special. If a closing square

1160

   bracket is required as a member of the class, it should be

961

1161

   the first data character in the class (after an initial

962

1162

   circumflex, if present) or escaped with a backslash.

963

1163

  </para>

964

1164

  <para>

965

1165

   A character class matches a single character in the subject;

966

   the  character  must  be in the set of characters defined by

1166

   the character must be in the set of characters defined by

967

1167

   the class, unless the first character in the class is a

968

   circumflex,  in which case the subject character must not be in

969

   the set defined by the class. If a  circumflex  is  actually

970

   required  as  a  member  of  the class, ensure it is not the

1168

   circumflex, in which case the subject character must not be in

1169

   the set defined by the class. If a circumflex is actually

1170

   required as a member of the class, ensure it is not the

971

1171

   first character, or escape it with a backslash.

972

1172

  </para>

973

1173

  <para>

974

   For example, the character class [aeiou] matches  any  lower

1174

   For example, the character class [aeiou] matches any lower

975

1175

   case vowel, while [^aeiou] matches any character that is not

976

   a lower case vowel. Note that a circumflex is  just  a

977

   convenient  notation for specifying the characters which are in

978

   the class by enumerating those that are not. It  is  not  an

979

   assertion:  it  still  consumes a character from the subject

980

   string, and fails if the current pointer is at  the  end  of

1176

   a lower case vowel. Note that a circumflex is just a

1177

   convenient notation for specifying the characters which are in

1178

   the class by enumerating those that are not. It is not an

1179

   assertion: it still consumes a character from the subject

1180

   string, and fails if the current pointer is at the end of

981

1181

   the string.

982

1182

  </para>

983

1183

  <para>

...

@@ -989,61 +1189,62 @@

989

1189

  </para>

990

1190

  <para>

991

1191

   The newline character is never treated in any special way in

992

   character  classes,  whatever the setting of the <link

1192

   character classes, whatever the setting of the <link

993

1193

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

994

1194

   or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

995

1195

   options is. A class such as [^a] will always match a newline.

996

1196

  </para>

997

1197

  <para>

998

   The minus (hyphen) character can be used to specify a  range

999

   of  characters  in  a  character  class.  For example, [d-m]

1000

   matches any letter between d and m, inclusive.  If  a  minus

1001

   character  is required in a class, it must be escaped with a

1198

   The minus (hyphen) character can be used to specify a range

1199

   of characters in a character class. For example, [d-m]

1200

   matches any letter between d and m, inclusive. If a minus

1201

   character is required in a class, it must be escaped with a

1002

1202

   backslash or appear in a position where it cannot be

1003

1203

   interpreted as indicating a range, typically as the first or last

1004

1204

   character in the class.

1005

1205

  </para>

1006

1206

  <para>

1007

   It is not possible to have the literal character "]" as  the

1008

   end  character  of  a  range.  A  pattern such as [W-]46] is

1207

   It is not possible to have the literal character "]" as the

1208

   end character of a range. A pattern such as [W-]46] is

1009

1209

   interpreted as a class of two characters ("W" and "-")

1010

1210

   followed by a literal string "46]", so it would match "W46]" or

1011

   "-46]". However, if the "]" is escaped with a  backslash  it

1012

   is  interpreted  as  the end of range, so [W-\]46] is

1013

   interpreted as a single class containing a range followed by  two

1211

   "-46]". However, if the "]" is escaped with a backslash it

1212

   is interpreted as the end of range, so [W-\]46] is

1213

   interpreted as a single class containing a range followed by two

1014

1214

   separate characters. The octal or hexadecimal representation

1015

1215

   of "]" can also be used to end a range.

1016

1216

  </para>

1017

1217

  <para>

1018

1218

   Ranges operate in ASCII collating sequence. They can also be

1019

   used  for  characters  specified  numerically,  for  example

1020

   [\000-\037]. If a range that includes letters is  used  when

1021

   case-insensitive (caseless)  matching  is set, it matches the

1022

   letters in either case. For example, [W-c] is equivalent  to

1219

   used for characters specified numerically, for example

1220

   [\000-\037]. If a range that includes letters is used when

1221

   case-insensitive (caseless) matching is set, it matches the

1222

   letters in either case. For example, [W-c] is equivalent to

1023

1223

   [][\^_`wxyzabc], matched case-insensitively, and if character

1024

1224

   tables for the "fr" locale are in use, [\xc8-\xcb] matches

1025

1225

   accented E characters in both cases.

1026

1226

  </para>

1027

1227

  <para>

1028

   The character types \d, \D, \s, \S,  \w,  and  \W  may  also

1029

   appear  in  a  character  class, and add the characters that

1228

   The character types \d, \D, \s, \S, \w, and \W may also

1229

   appear in a character class, and add the characters that

1030

1230

   they match to the class. For example, [\dABCDEF] matches any

1031

   hexadecimal  digit.  A  circumflex  can conveniently be used

1032

   with the upper case character types to specify a  more

1231

   hexadecimal digit. A circumflex can conveniently be used

1232

   with the upper case character types to specify a more

1033

1233

   restricted set of characters than the matching lower case type.

1034

   For example, the class [^\W_] matches any letter  or  digit,

1234

   For example, the class [^\W_] matches any letter or digit,

1035

1235

   but not underscore.

1036

1236

  </para>

1037

1237

  <para>

1038

   All non-alphanumeric characters other than \,  -,  ^  (at  the

1039

   start)  and  the  terminating ] are non-special in character

1238

   All non-alphanumeric characters other than \, -, ^ (at the

1239

   start) and the terminating ] are non-special in character

1040

1240

   classes, but it does no harm if they are escaped. The pattern

1041

1241

   terminator is always special and must be escaped when used

1042

1242

   within an expression.

1043

1243

  </para>

1044

1244

  <para>

1045

1245

   Perl supports the POSIX notation for character classes. This uses names

1046

   enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also

1246

   enclosed by <literal>[:</literal> and <literal>:]</literal> within

1247

   the enclosing square brackets. PCRE also

1047

1248

   supports this notation. For example, <literal>[01[:alpha:]%]</literal>

1048

1249

   matches "0", "1", any alphabetic character, or "%". The supported class

1049

1250

   names are:

...

@@ -1082,22 +1283,32 @@

1082

1283

  <para>

1083

1284

   In UTF-8 mode, characters with values greater than 128 do not match any

1084

1285

   of the POSIX character classes.

1286

   As of libpcre 8.10 some character classes are changed to use

1287

   Unicode character properties, in which case the mentioned restriction does

1288

   not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>

1289

   for details.

1290

  </para>

1291

  <para>

1292

   Unicode character properties can appear inside a character class. They can

1293

   not be part of a range. The minus (hyphen) character after a Unicode

1294

   character class will match literally. Trying to end a range with a Unicode

1295

   character property will result in a warning.

1085

1296

  </para>

1086

1297

 </section>

1087

1298

1088

1299

 <section xml:id="regexp.reference.alternation">

1089

1300

  <title>Alternation</title>

1090

1301

  <para>

1091

   Vertical bar characters are  used  to  separate  alternative

1302

   Vertical bar characters are used to separate alternative

1092

1303

   patterns. For example, the pattern

1093

1304

   <literal>gilbert|sullivan</literal>

1094

1305

   matches either "gilbert" or "sullivan". Any number of alternatives

1095

   may  appear,  and an empty alternative is permitted

1096

   (matching the empty string).   The  matching  process  tries

1097

   each  alternative in turn, from left to right, and the first

1098

   one that succeeds is used. If the alternatives are within  a

1099

   subpattern  (defined  below),  "succeeds" means matching the

1100

   rest of the main pattern as well as the alternative  in  the

1306

   may appear, and an empty alternative is permitted

1307

   (matching the empty string). The matching process tries

1308

   each alternative in turn, from left to right, and the first

1309

   one that succeeds is used. If the alternatives are within a

1310

   subpattern (defined below), "succeeds" means matching the

1311

   rest of the main pattern as well as the alternative in the

1101

1312

   subpattern.

1102

1313

  </para>

1103

1314

 </section>

...

@@ -1112,7 +1323,7 @@

1112

1323

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,

1113

1324

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1114

1325

   and PCRE_DUPNAMES can be changed from within the pattern by

1115

   a sequence of Perl option letters enclosed between "(?"  and

1326

   a sequence of Perl option letters enclosed between "(?" and

1116

1327

   ")". The option letters are:

1117

1328

1118

1329

   <table>

...

@@ -1141,7 +1352,8 @@

1141

1352

      </row>

1142

1353

      <row>

1143

1354

       <entry><literal>X</literal></entry>

1144

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>

1355

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>

1356

        (no longer supported as of PHP 7.3.0)</entry>

1145

1357

      </row>

1146

1358

      <row>

1147

1359

       <entry><literal>J</literal></entry>

...

@@ -1152,16 +1364,16 @@

1152

1364

   </table>

1153

1365

  </para>

1154

1366

  <para>

1155

   For example, (?im) sets case-insensitive (caseless), multiline matching. It  is

1367

   For example, (?im) sets case-insensitive (caseless), multiline matching. It is

1156

1368

   also possible to unset these options by preceding the letter

1157

   with a hyphen, and a combined setting and unsetting such  as

1158

   (?im-sx),  which sets <link

1369

   with a hyphen, and a combined setting and unsetting such as

1370

   (?im-sx), which sets <link

1159

1371

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and

1160

1372

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1161

1373

   while unsetting <link

1162

1374

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and

1163

1375

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,

1164

   is also  permitted. If  a  letter  appears both before and after the

1376

   is also permitted. If a letter appears both before and after the

1165

1377

   hyphen, the option is unset.

1166

1378

  </para>

1167

1379

  <para>

...

@@ -1171,14 +1383,14 @@

1171

1383

   and "abC".

1172

1384

  </para>

1173

1385

  <para>

1174

   If an option change occurs inside a subpattern,  the  effect

1175

   is  different.  This is a change of behaviour in Perl 5.005.

1176

   An option change inside a subpattern affects only that  part

1386

   If an option change occurs inside a subpattern, the effect

1387

   is different. This is a change of behaviour in Perl 5.005.

1388

   An option change inside a subpattern affects only that part

1177

1389

   of the subpattern that follows it, so

1178

1390

1179

1391

   <literal>(a(?i)b)c</literal>

1180

1392

1181

   matches  abc  and  aBc  and  no  other   strings   (assuming <link

1393

   matches "abc" and "aBc" and no other strings (assuming <link

1182

1394

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not

1183

1395

   used). By this means, options can be made to have different settings in

1184

1396

   different parts of the pattern. Any changes made in one alternative do

...

@@ -1187,18 +1399,18 @@

1187

1399

1188

1400

   <literal>(a(?i)b|c)</literal>

1189

1401

1190

   matches "ab", "aB", "c", and "C", even though when  matching

1402

   matches "ab", "aB", "c", and "C", even though when matching

1191

1403

   "C" the first branch is abandoned before the option setting.

1192

   This is because the effects of  option  settings  happen  at

1193

   compile  time. There would be some very weird behaviour otherwise.

1404

   This is because the effects of option settings happen at

1405

   compile time. There would be some very weird behaviour otherwise.

1194

1406

  </para>

1195

1407

  <para>

1196

1408

   The PCRE-specific options <link

1197

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and  

1198

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>   can

1409

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and

1410

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can

1199

1411

   be changed in the same way as the Perl-compatible options by

1200

   using the characters U and X  respectively.  The  (?X)  flag

1201

   setting  is  special in that it must always occur earlier in

1412

   using the characters U and X respectively. The (?X) flag

1413

   setting is special in that it must always occur earlier in

1202

1414

   the pattern than any of the additional features it turns on,

1203

1415

   even when it is at top level. It is best put at the start.

1204

1416

  </para>

...

@@ -1207,8 +1419,8 @@

1207

1419

 <section xml:id="regexp.reference.subpatterns">

1208

1420

  <title>Subpatterns</title>

1209

1421

  <para>

1210

   Subpatterns are delimited by parentheses  (round  brackets),

1211

   which can be nested.  Marking part of a pattern as a subpattern

1422

   Subpatterns are delimited by parentheses (round brackets),

1423

   which can be nested. Marking part of a pattern as a subpattern

1212

1424

   does two things:

1213

1425

  </para>

1214

1426

  <orderedlist>

...

@@ -1237,30 +1449,30 @@

1237

1449

1238

1450

   <literal>the ((red|white) (king|queen))</literal>

1239

1451

1240

   the captured substrings are "red king", "red",  and  "king",

1452

   the captured substrings are "red king", "red", and "king",

1241

1453

   and are numbered 1, 2, and 3.

1242

1454

  </para>

1243

1455

  <para>

1244

   The fact that plain parentheses fulfill two functions is  not

1245

   always  helpful.  There are often times when a grouping subpattern

1246

   is required without a capturing requirement.  If  an

1456

   The fact that plain parentheses fulfill two functions is not

1457

   always helpful. There are often times when a grouping subpattern

1458

   is required without a capturing requirement. If an

1247

1459

   opening parenthesis is followed by "?:", the subpattern does

1248

   not do any capturing, and is not counted when computing  the

1460

   not do any capturing, and is not counted when computing the

1249

1461

   number of any subsequent capturing subpatterns. For example,

1250

   if the string "the  white  queen"  is  matched  against  the

1462

   if the string "the white queen" is matched against the

1251

1463

   pattern

1252

1464

1253

1465

   <literal>the ((?:red|white) (king|queen))</literal>

1254

1466

1255

   the captured substrings are "white queen" and  "queen",  and

1256

   are  numbered  1  and 2. The maximum number of captured substrings

1257

   is 99, and the maximum number  of  all  subpatterns,

1258

   both capturing and non-capturing, is 200.

1467

   the captured substrings are "white queen" and "queen", and

1468

   are numbered 1 and 2. The maximum number of captured substrings

1469

   is 65535. It may not be possible to compile such large patterns,

1470

   however, depending on the configuration options of libpcre.

1259

1471

  </para>

1260

1472

  <para>

1261

   As a  convenient  shorthand,  if  any  option  settings  are

1262

   required  at  the  start  of a non-capturing subpattern, the

1263

   option letters may appear between the "?" and the ":".  Thus

1473

   As a convenient shorthand, if any option settings are

1474

   required at the start of a non-capturing subpattern, the

1475

   option letters may appear between the "?" and the ":". Thus

1264

1476

   the two patterns

1265

1477

  </para>

1266

1478

...

@@ -1274,10 +1486,10 @@

1274

1486

  </informalexample>

1275

1487

1276

1488

  <para>

1277

   match exactly the same set of strings.  Because  alternative

1278

   branches  are  tried from left to right, and options are not

1279

   reset until the end of the subpattern is reached, an  option

1280

   setting  in  one  branch does affect subsequent branches, so

1489

   match exactly the same set of strings. Because alternative

1490

   branches are tried from left to right, and options are not

1491

   reset until the end of the subpattern is reached, an option

1492

   setting in one branch does affect subsequent branches, so

1281

1493

   the above patterns match "SUNDAY" as well as "Saturday".

1282

1494

  </para>

1283

1495

...

@@ -1285,7 +1497,7 @@

1285

1497

   It is possible to name a subpattern using the syntax

1286

1498

   <literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then

1287

1499

   be indexed in the matches array by its normal numeric position and

1288

   also by name. PHP 5.2.2 introduced two alternative syntaxes 

1500

   also by name. There are two alternative syntaxes

1289

1501

   <literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.

1290

1502

  </para>

1291

1503

...

@@ -1306,9 +1518,10 @@

1306

1518

1307

1519

  <para>

1308

1520

   Here <literal>Sun</literal> is stored in backreference 2, while

1309

   backreference 1 is empty. Matching yields <literal>Sat</literal> in

1310

   backreference 1 while backreference 2 does not exist. Changing the pattern

1311

   to use the <literal>(?|</literal> fixes this problem:

1521

   backreference 1 is empty. Matching <literal>Saturday</literal> yields

1522

   <literal>Sat</literal> in backreference 1 while backreference 2 does

1523

   not exist. Changing the pattern to use the <literal>(?|</literal> fixes

1524

   this problem:

1312

1525

  </para>

1313

1526

1314

1527

  <informalexample>

...

@@ -1334,45 +1547,56 @@

1334

1547

    <listitem><simpara>the . metacharacter</simpara></listitem>

1335

1548

    <listitem><simpara>a character class</simpara></listitem>

1336

1549

    <listitem><simpara>a back reference (see next section)</simpara></listitem>

1337

    <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -

1550

    <listitem><simpara>a parenthesized subpattern (unless it is an assertion -

1338

1551

     see below)</simpara></listitem>

1339

1552

   </itemizedlist>

1340

1553

  </para>

1341

1554

  <para>

1342

   The general repetition quantifier specifies  a  minimum  and

1343

   maximum  number  of  permitted  matches,  by  giving the two

1344

   numbers in curly brackets (braces), separated  by  a  comma.

1345

   The  numbers  must be less than 65536, and the first must be

1555

   The general repetition quantifier specifies a minimum and

1556

   maximum number of permitted matches, by giving the two

1557

   numbers in curly brackets (braces), separated by a comma.

1558

   The numbers must be less than 65536, and the first must be

1346

1559

   less than or equal to the second. For example:

1347

1560

1348

1561

   <literal>z{2,4}</literal>

1349

1562

1350

   matches "zz", "zzz", or "zzzz". A closing brace on  its  own

1563

   matches "zz", "zzz", or "zzzz". A closing brace on its own

1351

1564

   is not a special character. If the second number is omitted,

1352

   but the comma is present, there is no upper  limit;  if  the

1565

   but the comma is present, there is no upper limit; if the

1353

1566

   second number and the comma are both omitted, the quantifier

1354

1567

   specifies an exact number of required matches. Thus

1355

1568

1356

1569

   <literal>[aeiou]{3,}</literal>

1357

1570

1358

   matches at least 3 successive vowels,  but  may  match  many

1571

   matches at least 3 successive vowels, but may match many

1359

1572

   more, while

1360

1573

1361

1574

   <literal>\d{8}</literal>

1362

1575

1363

   matches exactly 8 digits.  An  opening  curly  bracket  that

1364

   appears  in a position where a quantifier is not allowed, or

1365

   one that does not match the syntax of a quantifier, is taken

1366

   as  a literal character. For example, {,6} is not a quantifier,

1367

   but a literal string of four characters.

1576

   matches exactly 8 digits.

1577

1368

1578

  </para>

1579

  <simpara>

1580

   Prior to PHP 8.4.0, an opening curly bracket that

1581

   appears in a position where a quantifier is not allowed, or

1582

   one that does not match the syntax of a quantifier, is taken

1583

   as a literal character. For example, <literal>{,6}</literal>

1584

   is not a quantifier, but a literal string of four characters.

1585

1586

   As of PHP 8.4.0, the PCRE extension is bundled with PCRE2 version 10.44,

1587

   which allows patterns such as <literal>\d{,8}</literal> and they are

1588

   interpreted as <literal>\d{0,8}</literal>.

1589

1590

   Further, as of PHP 8.4.0, space characters around quantifiers such as

1591

   <literal>\d{0 , 8}</literal> and <literal>\d{ 0 , 8 }</literal> are allowed.

1592

  </simpara>

1369

1593

  <para>

1370

   The quantifier {0} is permitted, causing the  expression  to

1371

   behave  as  if the previous item and the quantifier were not

1594

   The quantifier {0} is permitted, causing the expression to

1595

   behave as if the previous item and the quantifier were not

1372

1596

   present.

1373

1597

  </para>

1374

1598

  <para>

1375

   For convenience (and  historical  compatibility)  the  three

1599

   For convenience (and historical compatibility) the three

1376

1600

   most common quantifiers have single-character abbreviations:

1377

1601

1378

1602

   <table>

...

@@ -1396,63 +1620,63 @@

1396

1620

   </table>

1397

1621

  </para>

1398

1622

  <para>

1399

   It is possible to construct infinite loops  by  following  a

1400

   subpattern  that  can  match no characters with a quantifier

1623

   It is possible to construct infinite loops by following a

1624

   subpattern that can match no characters with a quantifier

1401

1625

   that has no upper limit, for example:

1402

1626

1403

1627

   <literal>(a?)*</literal>

1404

1628

  </para>

1405

1629

  <para>

1406

   Earlier versions of Perl and PCRE used to give an  error  at

1407

   compile  time  for such patterns. However, because there are

1408

   cases where this  can  be  useful,  such  patterns  are  now

1409

   accepted,  but  if  any repetition of the subpattern does in

1630

   Earlier versions of Perl and PCRE used to give an error at

1631

   compile time for such patterns. However, because there are

1632

   cases where this can be useful, such patterns are now

1633

   accepted, but if any repetition of the subpattern does in

1410

1634

   fact match no characters, the loop is forcibly broken.

1411

1635

  </para>

1412

1636

  <para>

1413

   By default, the quantifiers  are  "greedy",  that  is,  they

1414

   match  as much as possible (up to the maximum number of permitted

1415

   times), without causing the rest of  the  pattern  to

1637

   By default, the quantifiers are "greedy", that is, they

1638

   match as much as possible (up to the maximum number of permitted

1639

   times), without causing the rest of the pattern to

1416

1640

   fail. The classic example of where this gives problems is in

1417

1641

   trying to match comments in C programs. These appear between

1418

   the  sequences /* and */ and within the sequence, individual

1419

   * and / characters may appear. An attempt to  match  C  comments

1642

   the sequences /* and */ and within the sequence, individual

1643

   * and / characters may appear. An attempt to match C comments

1420

1644

   by applying the pattern

1421

1645

1422

1646

   <literal>/\*.*\*/</literal>

1423

1647

1424

1648

   to the string

1425

1649

1426

   <literal>/* first comment */  not comment  /* second comment */</literal>

1650

   <literal>/* first comment */ not comment /* second comment */</literal>

1427

1651

1428

   fails, because it matches  the  entire  string  due  to  the

1429

   greediness of the .*  item.

1652

   fails, because it matches the entire string due to the

1653

   greediness of the .* item.

1430

1654

  </para>

1431

1655

  <para>

1432

   However, if a quantifier is followed  by  a  question  mark,

1656

   However, if a quantifier is followed by a question mark,

1433

1657

   then it becomes lazy, and instead matches the minimum

1434

1658

   number of times possible, so the pattern

1435

1659

1436

1660

   <literal>/\*.*?\*/</literal>

1437

1661

1438

1662

   does the right thing with the C comments. The meaning of the

1439

   various  quantifiers is not otherwise changed, just the preferred

1440

   number of matches.  Do not confuse this use of

1441

   question  mark  with  its  use as a quantifier in its own right.

1663

   various quantifiers is not otherwise changed, just the preferred

1664

   number of matches. Do not confuse this use of

1665

   question mark with its use as a quantifier in its own right.

1442

1666

   Because it has two uses, it can sometimes appear doubled, as

1443

1667

in

1444

1668

1445

1669

   <literal>\d??\d</literal>

1446

1670

1447

   which matches one digit by preference, but can match two  if

1671

   which matches one digit by preference, but can match two if

1448

1672

   that is the only way the rest of the pattern matches.

1449

1673

  </para>

1450

1674

  <para>

1451

1675

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>

1452

   option is set (an option which  is  not

1453

   available  in  Perl)  then the quantifiers are not greedy by

1676

   option is set (an option which is not

1677

   available in Perl) then the quantifiers are not greedy by

1454

1678

   default, but individual ones can be made greedy by following

1455

   them  with  a  question mark. In other words, it inverts the

1679

   them with a question mark. In other words, it inverts the

1456

1680

   default behaviour.

1457

1681

  </para>

1458

1682

  <para>

...

@@ -1464,41 +1688,41 @@

1464

1688

  </para>

1465

1689

  <para>

1466

1690

   When a parenthesized subpattern is quantified with a minimum

1467

   repeat  count  that is greater than 1 or with a limited maximum,

1468

   more store is required for the  compiled  pattern,  in

1691

   repeat count that is greater than 1 or with a limited maximum,

1692

   more store is required for the compiled pattern, in

1469

1693

   proportion to the size of the minimum or maximum.

1470

1694

  </para>

1471

1695

  <para>

1472

   If a pattern starts with .* or  .{0,}  and  the  <link 

1696

   If a pattern starts with .* or .{0,} and the <link

1473

1697

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1474

1698

   option (equivalent to Perl's /s) is set, thus allowing the .

1475

   to match newlines, then the pattern is implicitly  anchored,

1699

   to match newlines, then the pattern is implicitly anchored,

1476

1700

   because whatever follows will be tried against every character

1477

   position in the subject string, so there is no point  in

1478

   retrying  the overall match at any position after the first.

1701

   position in the subject string, so there is no point in

1702

   retrying the overall match at any position after the first.

1479

1703

   PCRE treats such a pattern as though it were preceded by \A.

1480

   In  cases where it is known that the subject string contains

1704

   In cases where it is known that the subject string contains

1481

1705

   no newlines, it is worth setting <link

1482

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the  

1706

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the

1483

1707

   pattern begins with .* in order to

1484

1708

   obtain this optimization, or

1485

1709

   alternatively using ^ to indicate anchoring explicitly.

1486

1710

  </para>

1487

1711

  <para>

1488

   When a capturing subpattern is repeated, the value  captured

1712

   When a capturing subpattern is repeated, the value captured

1489

1713

   is the substring that matched the final iteration. For example, after

1490

1714

1491

1715

   <literal>(tweedle[dume]{3}\s*)+</literal>

1492

1716

1493

   has matched "tweedledum tweedledee" the value  of  the  captured

1494

   substring  is  "tweedledee".  However,  if  there are

1495

   nested capturing  subpatterns,  the  corresponding  captured

1496

   values  may  have been set in previous iterations. For example,

1717

   has matched "tweedledum tweedledee" the value of the captured

1718

   substring is "tweedledee". However, if there are

1719

   nested capturing subpatterns, the corresponding captured

1720

   values may have been set in previous iterations. For example,

1497

1721

   after

1498

1722

1499

1723

   <literal>/(a|(b))+/</literal>

1500

1724

1501

   matches "aba" the value of the second captured substring  is

1725

   matches "aba" the value of the second captured substring is

1502

1726

   "b".

1503

1727

  </para>

1504

1728

 </section>

...

@@ -1506,78 +1730,78 @@

1506

1730

 <section xml:id="regexp.reference.back-references">

1507

1731

  <title>Back references</title>

1508

1732

  <para>

1509

   Outside a character class, a backslash followed by  a  digit

1510

   greater  than  0  (and  possibly  further  digits) is a back

1511

   reference to a capturing subpattern  earlier  (i.e.  to  its

1512

   left)  in  the  pattern,  provided there have been that many

1733

   Outside a character class, a backslash followed by a digit

1734

   greater than 0 (and possibly further digits) is a back

1735

   reference to a capturing subpattern earlier (i.e. to its

1736

   left) in the pattern, provided there have been that many

1513

1737

   previous capturing left parentheses.

1514

1738

  </para>

1515

1739

  <para>

1516

   However, if the decimal number following  the  backslash  is

1517

   less  than  10,  it is always taken as a back reference, and

1518

   causes an error only if there are not  that  many  capturing

1519

   left  parentheses in the entire pattern. In other words, the

1520

   parentheses that are referenced need not be to the  left  of

1521

   the  reference  for  numbers  less  than 10. 

1740

   However, if the decimal number following the backslash is

1741

   less than 10, it is always taken as a back reference, and

1742

   causes an error only if there are not that many capturing

1743

   left parentheses in the entire pattern. In other words, the

1744

   parentheses that are referenced need not be to the left of

1745

   the reference for numbers less than 10.

1522

1746

   A "forward back reference" can make sense when a repetition

1523

1747

   is involved and the subpattern to the right has participated

1524

1748

   in an earlier iteration. See the section

1525

   entitled "Backslash" above for further details of  the  handling

1749

   <link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling

1526

1750

   of digits following a backslash.

1527

1751

  </para>

1528

1752

  <para>

1529

   A back reference matches whatever actually matched the  capturing

1753

   A back reference matches whatever actually matched the capturing

1530

1754

   subpattern in the current subject string, rather than

1531

1755

   anything matching the subpattern itself. So the pattern

1532

1756

1533

1757

   <literal>(sens|respons)e and \1ibility</literal>

1534

1758

1535

   matches "sense and sensibility" and "response and  responsibility",

1536

   but  not  "sense  and  responsibility". If case-sensitive (caseful)

1759

   matches "sense and sensibility" and "response and responsibility",

1760

   but not "sense and responsibility". If case-sensitive (caseful)

1537

1761

   matching is in force at the time of the back reference, then

1538

1762

   the case of letters is relevant. For example,

1539

1763

1540

1764

   <literal>((?i)rah)\s+\1</literal>

1541

1765

1542

   matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even

1543

   though  the  original  capturing subpattern is matched

1766

   matches "rah rah" and "RAH RAH", but not "RAH rah", even

1767

   though the original capturing subpattern is matched

1544

1768

   case-insensitively (caselessly).

1545

1769

  </para>

1546

1770

  <para>

1547

   There may be more than one back reference to the  same  subpattern.

1548

   If  a  subpattern  has not actually been used in a

1549

   particular match, then any  back  references  to  it  always

1771

   There may be more than one back reference to the same subpattern.

1772

   If a subpattern has not actually been used in a

1773

   particular match, then any back references to it always

1550

1774

   fail. For example, the pattern

1551

1775

1552

1776

   <literal>(a|(bc))\2</literal>

1553

1777

1554

   always fails if it starts to match  "a"  rather  than  "bc".

1555

   Because  there  may  be up to 99 back references, all digits

1556

   following the backslash are taken as  part  of  a  potential

1778

   always fails if it starts to match "a" rather than "bc".

1779

   Because there may be up to 99 back references, all digits

1780

   following the backslash are taken as part of a potential

1557

1781

   back reference number. If the pattern continues with a digit

1558

1782

   character, then some delimiter must be used to terminate the

1559

1783

   back reference. If the <link

1560

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>  option 

1561

   is set, this can be whitespace.  Otherwise an empty comment can be used.

1784

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option

1785

   is set, this can be whitespace. Otherwise an empty comment can be used.

1562

1786

  </para>

1563

1787

  <para>

1564

1788

   A back reference that occurs inside the parentheses to which

1565

   it  refers  fails when the subpattern is first used, so, for

1566

   example, (a\1) never matches.  However, such references  can

1789

   it refers fails when the subpattern is first used, so, for

1790

   example, (a\1) never matches. However, such references can

1567

1791

   be useful inside repeated subpatterns. For example, the pattern

1568

1792

1569

1793

   <literal>(a|b\1)+</literal>

1570

1794

1571

   matches any number of "a"s and also "aba", "ababba" etc.  At

1795

   matches any number of "a"s and also "aba", "ababba" etc. At

1572

1796

   each iteration of the subpattern, the back reference matches

1573

   the character string corresponding to  the  previous  iteration.

1797

   the character string corresponding to the previous iteration.

1574

1798

   In order for this to work, the pattern must be such

1575

   that the first iteration does not need  to  match  the  back

1576

   reference.  This  can  be  done using alternation, as in the

1799

   that the first iteration does not need to match the back

1800

   reference. This can be done using alternation, as in the

1577

1801

   example above, or by a quantifier with a minimum of zero.

1578

1802

  </para>

1579

1803

  <para>

1580

   As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be 

1804

   The <literal>\g</literal> escape sequence can be

1581

1805

   used for absolute and relative referencing of subpatterns.

1582

1806

   This escape sequence must be followed by an unsigned number or a negative

1583

1807

   number, optionally enclosed in braces. The sequences <literal>\1</literal>,

...

@@ -1598,28 +1822,28 @@

1598

1822

  </para>

1599

1823

  <para>

1600

1824

   Back references to the named subpatterns can be achieved by

1601

   <literal>(?P=name)</literal> or, since PHP 5.2.2, also by

1602

   <literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>. 

1603

   Additionally PHP 5.2.4 added support for <literal>\k{name}</literal> 

1604

   and <literal>\g{name}</literal>.

1825

   <literal>(?P=name)</literal>,

1826

   <literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,

1827

   <literal>\k{name}</literal>, <literal>\g{name}</literal>,

1828

   <literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.

1605

1829

  </para>

1606

1830

 </section>

1607

1831

1608

1832

 <section xml:id="regexp.reference.assertions">

1609

1833

  <title>Assertions</title>

1610

1834

  <para>

1611

   An assertion is  a  test  on  the  characters  following  or

1612

   preceding  the current matching point that does not actually

1613

   consume any characters. The simple assertions coded  as  \b,

1614

   \B,  \A,  \Z,  \z, ^ and $ are described above. More complicated

1615

   assertions are coded as  subpatterns.  There  are  two

1616

   kinds:  those that <emphasis>look ahead</emphasis> of the current position in the

1835

   An assertion is a test on the characters following or

1836

   preceding the current matching point that does not actually

1837

   consume any characters. The simple assertions coded as \b,

1838

   \B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1839

   assertions are coded as subpatterns. There are two

1840

   kinds: those that <emphasis>look ahead</emphasis> of the current position in the

1617

1841

   subject string, and those that <emphasis>look behind</emphasis> it.

1618

1842

  </para>

1619

1843

  <para>

1620

1844

   An assertion subpattern is matched in the normal way, except

1621

   that  it  does not cause the current matching position to be

1622

   changed. <emphasis>Lookahead</emphasis> assertions start with  (?=  for  positive

1845

   that it does not cause the current matching position to be

1846

   changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive

1623

1847

   assertions and (?! for negative assertions. For example,

1624

1848

1625

1849

   <literal>\w+(?=;)</literal>

...

@@ -1629,27 +1853,27 @@

1629

1853

1630

1854

   <literal>foo(?!bar)</literal>

1631

1855

1632

   matches any occurrence of "foo"  that  is  not  followed  by

1856

   matches any occurrence of "foo" that is not followed by

1633

1857

   "bar". Note that the apparently similar pattern

1634

1858

1635

1859

   <literal>(?!foo)bar</literal>

1636

1860

1637

   does not find an occurrence of "bar"  that  is  preceded  by

1861

   does not find an occurrence of "bar" that is preceded by

1638

1862

   something other than "foo"; it finds any occurrence of "bar"

1639

   whatsoever, because the assertion  (?!foo)  is  always  &true;

1640

   when  the  next  three  characters  are  "bar". A lookbehind

1863

   whatsoever, because the assertion (?!foo) is always &true;

1864

   when the next three characters are "bar". A lookbehind

1641

1865

   assertion is needed to achieve this effect.

1642

1866

  </para>

1643

1867

  <para>

1644

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;=  for  positive  assertions

1868

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions

1645

1869

   and (?&lt;! for negative assertions. For example,

1646

1870

1647

1871

   <literal>(?&lt;!foo)bar</literal>

1648

1872

1649

   does find an occurrence of "bar" that  is  not  preceded  by

1873

   does find an occurrence of "bar" that is not preceded by

1650

1874

   "foo". The contents of a lookbehind assertion are restricted

1651

   such that all the strings  it  matches  must  have  a  fixed

1652

   length.  However, if there are several alternatives, they do

1875

   such that all the strings it matches must have a fixed

1876

   length. However, if there are several alternatives, they do

1653

1877

   not all have to have the same fixed length. Thus

1654

1878

1655

1879

   <literal>(?&lt;=bullock|donkey)</literal>

...

@@ -1658,51 +1882,51 @@

1658

1882

1659

1883

   <literal>(?&lt;!dogs?|cats?)</literal>

1660

1884

1661

   causes an error at compile time. Branches  that  match  different

1885

   causes an error at compile time. Branches that match different

1662

1886

   length strings are permitted only at the top level of

1663

   a lookbehind assertion. This is an extension  compared  with

1664

   Perl  5.005,  which  requires all branches to match the same

1887

   a lookbehind assertion. This is an extension compared with

1888

   Perl 5.005, which requires all branches to match the same

1665

1889

   length of string. An assertion such as

1666

1890

1667

1891

   <literal>(?&lt;=ab(c|de))</literal>

1668

1892

1669

   is not permitted, because its single  top-level  branch  can

1893

   is not permitted, because its single top-level branch can

1670

1894

   match two different lengths, but it is acceptable if rewritten

1671

1895

   to use two top-level branches:

1672

1896

1673

1897

   <literal>(?&lt;=abc|abde)</literal>

1674

1898

1675

   The implementation of lookbehind  assertions  is,  for  each

1676

   alternative,  to  temporarily move the current position back

1677

   by the fixed width and then  try  to  match.  If  there  are

1678

   insufficient  characters  before  the  current position, the

1679

   match is deemed to fail.  Lookbehinds  in  conjunction  with

1680

   once-only  subpatterns can be particularly useful for matching

1681

   at the ends of strings; an example is given at  the  end

1899

   The implementation of lookbehind assertions is, for each

1900

   alternative, to temporarily move the current position back

1901

   by the fixed width and then try to match. If there are

1902

   insufficient characters before the current position, the

1903

   match is deemed to fail. Lookbehinds in conjunction with

1904

   once-only subpatterns can be particularly useful for matching

1905

   at the ends of strings; an example is given at the end

1682

1906

   of the section on once-only subpatterns.

1683

1907

  </para>

1684

1908

  <para>

1685

   Several assertions (of any sort) may  occur  in  succession.

1909

   Several assertions (of any sort) may occur in succession.

1686

1910

   For example,

1687

1911

1688

1912

   <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>

1689

1913

1690

   matches "foo" preceded by three digits that are  not  "999".

1691

   Notice  that each of the assertions is applied independently

1692

   at the same point in the subject string. First  there  is  a

1693

   check  that  the  previous  three characters are all digits,

1914

   matches "foo" preceded by three digits that are not "999".

1915

   Notice that each of the assertions is applied independently

1916

   at the same point in the subject string. First there is a

1917

   check that the previous three characters are all digits,

1694

1918

   then there is a check that the same three characters are not

1695

   "999".   This  pattern  does not match "foo" preceded by six

1919

   "999". This pattern does not match "foo" preceded by six

1696

1920

   characters, the first of which are digits and the last three

1697

   of  which  are  not  "999".  For  example,  it doesn't match

1921

   of which are not "999". For example, it doesn't match

1698

1922

   "123abcfoo". A pattern to do that is

1699

1923

1700

1924

   <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>

1701

1925

  </para>

1702

1926

  <para>

1703

   This time the first assertion looks  at  the  preceding  six

1704

   characters,  checking  that  the first three are digits, and

1705

   then the second assertion checks that  the  preceding  three

1927

   This time the first assertion looks at the preceding six

1928

   characters, checking that the first three are digits, and

1929

   then the second assertion checks that the preceding three

1706

1930

   characters are not "999".

1707

1931

  </para>

1708

1932

  <para>

...

@@ -1710,26 +1934,26 @@

1710

1934

1711

1935

   <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>

1712

1936

1713

   matches an occurrence of "baz" that  is  preceded  by  "bar"

1937

   matches an occurrence of "baz" that is preceded by "bar"

1714

1938

   which in turn is not preceded by "foo", while

1715

1939

1716

1940

   <literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>

1717

1941

1718

   is another pattern which matches  "foo"  preceded  by  three

1942

   is another pattern which matches "foo" preceded by three

1719

1943

   digits and any three characters that are not "999".

1720

1944

  </para>

1721

1945

  <para>

1722

1946

   Assertion subpatterns are not capturing subpatterns, and may

1723

   not  be  repeated,  because  it makes no sense to assert the

1724

   same thing several times. If any kind of assertion  contains

1725

   capturing  subpatterns  within it, these are counted for the

1947

   not be repeated, because it makes no sense to assert the

1948

   same thing several times. If any kind of assertion contains

1949

   capturing subpatterns within it, these are counted for the

1726

1950

   purposes of numbering the capturing subpatterns in the whole

1727

   pattern.   However,  substring capturing is carried out only

1728

   for positive assertions, because it does not make sense  for

1951

   pattern. However, substring capturing is carried out only

1952

   for positive assertions, because it does not make sense for

1729

1953

   negative assertions.

1730

1954

  </para>

1731

1955

  <para>

1732

   Assertions count towards the maximum  of  200  parenthesized

1956

   Assertions count towards the maximum of 200 parenthesized

1733

1957

   subpatterns.

1734

1958

  </para>

1735

1959

 </section>

...

@@ -1737,17 +1961,17 @@

1737

1961

 <section xml:id="regexp.reference.onlyonce">

1738

1962

  <title>Once-only subpatterns</title>

1739

1963

  <para>

1740

   With both maximizing and minimizing repetition,  failure  of

1741

   what  follows  normally  causes  the repeated item to be

1964

   With both maximizing and minimizing repetition, failure of

1965

   what follows normally causes the repeated item to be

1742

1966

   re-evaluated to see if a different number of repeats allows the

1743

   rest  of  the  pattern  to  match. Sometimes it is useful to

1744

   prevent this, either to change the nature of the  match,  or

1745

   to  cause  it fail earlier than it otherwise might, when the

1746

   author of the pattern knows there is no  point  in  carrying

1967

   rest of the pattern to match. Sometimes it is useful to

1968

   prevent this, either to change the nature of the match, or

1969

   to cause it fail earlier than it otherwise might, when the

1970

   author of the pattern knows there is no point in carrying

1747

1971

on.

1748

1972

  </para>

1749

1973

  <para>

1750

   Consider, for example, the pattern \d+foo  when  applied  to

1974

   Consider, for example, the pattern \d+foo when applied to

1751

1975

   the subject line

1752

1976

1753

1977

   <literal>123456bar</literal>

...

@@ -1755,108 +1979,108 @@

1755

1979

  <para>

1756

1980

   After matching all 6 digits and then failing to match "foo",

1757

1981

   the normal action of the matcher is to try again with only 5

1758

   digits matching the \d+ item, and then with 4,  and  so  on,

1982

   digits matching the \d+ item, and then with 4, and so on,

1759

1983

   before ultimately failing. Once-only subpatterns provide the

1760

   means for specifying that once a portion of the pattern  has

1761

   matched,  it  is  not to be re-evaluated in this way, so the

1762

   matcher would give up immediately on failing to match  "foo"

1763

   the  first  time.  The  notation  is another kind of special

1984

   means for specifying that once a portion of the pattern has

1985

   matched, it is not to be re-evaluated in this way, so the

1986

   matcher would give up immediately on failing to match "foo"

1987

   the first time. The notation is another kind of special

1764

1988

   parenthesis, starting with (?&gt; as in this example:

1765

1989

1766

1990

   <literal>(?&gt;\d+)bar</literal>

1767

1991

  </para>

1768

1992

  <para>

1769

   This kind of parenthesis "locks up" the  part of the pattern

1770

   it  contains once it has matched, and a failure further into

1771

   the pattern is prevented from backtracking  into  it.

1772

   Backtracking  past  it to previous items, however, works as normal.

1993

   This kind of parenthesis "locks up" the part of the pattern

1994

   it contains once it has matched, and a failure further into

1995

   the pattern is prevented from backtracking into it.

1996

   Backtracking past it to previous items, however, works as normal.

1773

1997

  </para>

1774

1998

  <para>

1775

1999

   An alternative description is that a subpattern of this type

1776

   matches  the  string  of  characters that an identical standalone

2000

   matches the string of characters that an identical standalone

1777

2001

   pattern would match, if anchored at the current point

1778

2002

   in the subject string.

1779

2003

  </para>

1780

2004

  <para>

1781

   Once-only subpatterns are not capturing subpatterns.  Simple

1782

   cases  such as the above example can be thought of as a maximizing

1783

   repeat that must  swallow  everything  it  can.  So,

2005

   Once-only subpatterns are not capturing subpatterns. Simple

2006

   cases such as the above example can be thought of as a maximizing

2007

   repeat that must swallow everything it can. So,

1784

2008

   while both \d+ and \d+? are prepared to adjust the number of

1785

   digits they match in order to make the rest of  the  pattern

2009

   digits they match in order to make the rest of the pattern

1786

2010

   match, (?&gt;\d+) can only match an entire sequence of digits.

1787

2011

  </para>

1788

2012

  <para>

1789

   This construction can of course contain arbitrarily  complicated

2013

   This construction can of course contain arbitrarily complicated

1790

2014

   subpatterns, and it can be nested.

1791

2015

  </para>

1792

2016

  <para>

1793

2017

   Once-only subpatterns can be used in conjunction with

1794

   lookbehind assertions  to specify efficient matching at the end

2018

   lookbehind assertions to specify efficient matching at the end

1795

2019

   of the subject string. Consider a simple pattern such as

1796

2020

1797

2021

   <literal>abcd$</literal>

1798

2022

1799

   when applied to a long string which does not match.  Because

1800

   matching  proceeds  from  left  to right, PCRE will look for

2023

   when applied to a long string which does not match. Because

2024

   matching proceeds from left to right, PCRE will look for

1801

2025

   each "a" in the subject and then see if what follows matches

1802

2026

   the rest of the pattern. If the pattern is specified as

1803

2027

1804

2028

   <literal>^.*abcd$</literal>

1805

2029

1806

   then the initial .* matches the entire string at first,  but

1807

   when  this  fails  (because  there  is no following "a"), it

2030

   then the initial .* matches the entire string at first, but

2031

   when this fails (because there is no following "a"), it

1808

2032

   backtracks to match all but the last character, then all but

1809

   the  last  two  characters, and so on. Once again the search

1810

   for "a" covers the entire string, from right to left, so  we

2033

   the last two characters, and so on. Once again the search

2034

   for "a" covers the entire string, from right to left, so we

1811

2035

   are no better off. However, if the pattern is written as

1812

2036

1813

2037

   <literal>^(?>.*)(?&lt;=abcd)</literal>

1814

2038

1815

   then there can be no backtracking for the .*  item;  it  can

1816

   match  only  the  entire  string.  The subsequent lookbehind

2039

   then there can be no backtracking for the .* item; it can

2040

   match only the entire string. The subsequent lookbehind

1817

2041

   assertion does a single test on the last four characters. If

1818

   it  fails,  the  match  fails immediately. For long strings,

2042

   it fails, the match fails immediately. For long strings,

1819

2043

   this approach makes a significant difference to the processing time.

1820

2044

  </para>

1821

2045

  <para>

1822

2046

   When a pattern contains an unlimited repeat inside a subpattern

1823

2047

   that can itself be repeated an unlimited number of

1824

   times, the use of a once-only subpattern is the only way  to

1825

   avoid  some  failing matches taking a very long time indeed.

2048

   times, the use of a once-only subpattern is the only way to

2049

   avoid some failing matches taking a very long time indeed.

1826

2050

   The pattern

1827

2051

1828

2052

   <literal>(\D+|&lt;\d+>)*[!?]</literal>

1829

2053

1830

   matches an unlimited number of substrings that  either  consist

1831

   of  non-digits,  or digits enclosed in &lt;>, followed by

2054

   matches an unlimited number of substrings that either consist

2055

   of non-digits, or digits enclosed in &lt;>, followed by

1832

2056

   either ! or ?. When it matches, it runs quickly. However, if

1833

2057

   it is applied to

1834

2058

1835

2059

   <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>

1836

2060

1837

   it takes a long  time  before  reporting  failure.  This  is

2061

   it takes a long time before reporting failure. This is

1838

2062

   because the string can be divided between the two repeats in

1839

2063

   a large number of ways, and all have to be tried. (The example

1840

   used  [!?]  rather  than a single character at the end,

1841

   because both PCRE and Perl have an optimization that  allows

1842

   for  fast  failure  when  a  single  character is used. They

1843

   remember the last single character that is  required  for  a

1844

   match,  and  fail early if it is not present in the string.)

2064

   used [!?] rather than a single character at the end,

2065

   because both PCRE and Perl have an optimization that allows

2066

   for fast failure when a single character is used. They

2067

   remember the last single character that is required for a

2068

   match, and fail early if it is not present in the string.)

1845

2069

   If the pattern is changed to

1846

2070

1847

2071

   <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>

1848

2072

1849

   sequences of non-digits cannot be broken, and  failure  happens quickly.

2073

   sequences of non-digits cannot be broken, and failure happens quickly.

1850

2074

  </para>

1851

2075

 </section>

1852

2076

1853

2077

 <section xml:id="regexp.reference.conditional">

1854

2078

  <title>Conditional subpatterns</title>

1855

2079

  <para>

1856

   It is possible to cause the matching process to obey a  subpattern 

1857

   conditionally  or to choose between two alternative

1858

   subpatterns, depending on the result  of  an  assertion,  or

1859

   whether  a previous capturing subpattern matched or not. The

2080

   It is possible to cause the matching process to obey a subpattern

2081

   conditionally or to choose between two alternative

2082

   subpatterns, depending on the result of an assertion, or

2083

   whether a previous capturing subpattern matched or not. The

1860

2084

   two possible forms of conditional subpattern are

1861

2085

  </para>

1862

2086

...

@@ -1870,34 +2094,39 @@

1870

2094

  </informalexample>

1871

2095

  <para>

1872

2096

   If the condition is satisfied, the yes-pattern is used; otherwise

1873

   the  no-pattern  (if  present) is used. If there are

2097

   the no-pattern (if present) is used. If there are

1874

2098

   more than two alternatives in the subpattern, a compile-time

1875

2099

   error occurs.

1876

2100

  </para>

1877

2101

  <para>

1878

   There are two kinds of condition. If the  text  between  the

1879

   parentheses  consists  of  a  sequence  of  digits, then the

1880

   condition is satisfied if the capturing subpattern  of  that

1881

   number  has  previously matched. Consider the following pattern,

1882

   which contains non-significant white space to make  it

1883

   more  readable  (assume  the  <link 

2102

   There are two kinds of condition. If the text between the

2103

   parentheses consists of a sequence of digits, then the

2104

   condition is satisfied if the capturing subpattern of that

2105

   number has previously matched. Consider the following pattern,

2106

   which contains non-significant white space to make it

2107

   more readable (assume the <link

1884

2108

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1885

   option)  and to divide it into three parts for ease of discussion:

1886

1887

   <literal>( \( )?    [^()]+    (?(1) \) )</literal>

1888

  </para>

1889

  <para>

1890

   The first part matches an optional opening parenthesis,  and

1891

   if  that character is present, sets it as the first captured

1892

   substring. The second part matches one  or  more  characters

1893

   that  are  not  parentheses. The third part is a conditional

1894

   subpattern that tests whether the first set  of  parentheses

1895

   matched  or  not.  If  they did, that is, if subject started

1896

   with an opening parenthesis, the condition is &true;,  and  so

1897

   the  yes-pattern  is  executed  and a closing parenthesis is

1898

   required. Otherwise, since no-pattern is  not  present,  the

1899

   subpattern  matches  nothing.  In  other words, this pattern

1900

   matches a sequence of non-parentheses,  optionally  enclosed

2109

   option) and to divide it into three parts for ease of discussion:

2110

  </para>

2111

  <informalexample>

2112

   <programlisting>

2113

<![CDATA[

2114

( \( )? [^()]+ (?(1) \) )

2115

]]>

2116

   </programlisting>

2117

  </informalexample>

2118

  <para>

2119

   The first part matches an optional opening parenthesis, and

2120

   if that character is present, sets it as the first captured

2121

   substring. The second part matches one or more characters

2122

   that are not parentheses. The third part is a conditional

2123

   subpattern that tests whether the first set of parentheses

2124

   matched or not. If they did, that is, if subject started

2125

   with an opening parenthesis, the condition is &true;, and so

2126

   the yes-pattern is executed and a closing parenthesis is

2127

   required. Otherwise, since no-pattern is not present, the

2128

   subpattern matches nothing. In other words, this pattern

2129

   matches a sequence of non-parentheses, optionally enclosed

1901

2130

   in parentheses.

1902

2131

  </para>

1903

2132

  <para>

...

@@ -1906,10 +2135,10 @@

1906

2135

   level", the condition is false.

1907

2136

  </para>

1908

2137

  <para>

1909

   If the condition is not a sequence of digits or (R), it must be  an

1910

   assertion.  This  may be a positive or negative lookahead or

1911

   lookbehind assertion. Consider this pattern, again  containing

1912

   non-significant  white space, and with the two alternatives on

2138

   If the condition is not a sequence of digits or (R), it must be an

2139

   assertion. This may be a positive or negative lookahead or

2140

   lookbehind assertion. Consider this pattern, again containing

2141

   non-significant white space, and with the two alternatives on

1913

2142

   the second line:

1914

2143

  </para>

1915

2144

...

@@ -1917,18 +2146,18 @@

1917

2146

   <programlisting>

1918

2147

<![CDATA[

1919

2148

(?(?=[^a-z]*[a-z])

1920

\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

2149

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

1921

2150

]]>

1922

2151

   </programlisting>

1923

2152

  </informalexample>

1924

2153

  <para>

1925

2154

   The condition is a positive lookahead assertion that matches

1926

2155

   an optional sequence of non-letters followed by a letter. In

1927

   other words, it tests for  the  presence  of  at  least  one

1928

   letter  in the subject. If a letter is found, the subject is

1929

   matched against  the  first  alternative;  otherwise  it  is

1930

   matched  against the second. This pattern matches strings in

1931

   one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are

2156

   other words, it tests for the presence of at least one

2157

   letter in the subject. If a letter is found, the subject is

2158

   matched against the first alternative; otherwise it is

2159

   matched against the second. This pattern matches strings in

2160

   one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1932

2161

   letters and dd are digits.

1933

2162

  </para>

1934

2163

 </section>

...

@@ -1936,31 +2165,66 @@

1936

2165

 <section xml:id="regexp.reference.comments">

1937

2166

  <title>Comments</title>

1938

2167

  <para>

1939

   The  sequence  (?#  marks  the  start  of  a  comment  which

1940

   continues   up  to  the  next  closing  parenthesis.  Nested

2168

   The sequence (?# marks the start of a comment which

2169

   continues up to the next closing parenthesis. Nested

1941

2170

   parentheses are not permitted. The characters that make up a

1942

2171

   comment play no part in the pattern matching at all.

1943

2172

  </para>

1944

2173

  <para>

1945

2174

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1946

   option is set, an unescaped # character outside  a character class 

2175

   option is set, an unescaped # character outside a character class

1947

2176

   introduces a comment that continues up to the next newline character

1948

2177

   in the pattern.

1949

2178

  </para>

2179

  <para>

2180

   <example>

2181

    <title>Usage of comments in PCRE pattern</title>

2182

    <programlisting role="php">

2183

<![CDATA[

2184

<?php

2185

2186

$subject = 'test';

2187

2188

/* (?# can be used to add comments without enabling PCRE_EXTENDED */

2189

$match = preg_match('/te(?# this is a comment)st/', $subject);

2190

var_dump($match);

2191

2192

/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */

2193

$match = preg_match('/te   #~~~~

2194

st/', $subject);

2195

var_dump($match);

2196

2197

/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything

2198

   that follows an unescaped # on the same line is ignored */

2199

$match = preg_match('/te    #~~~~

2200

st/x', $subject);

2201

var_dump($match);

2202

]]>

2203

    </programlisting>

2204

    &example.outputs;

2205

    <screen>

2206

<![CDATA[

2207

int(1)

2208

int(0)

2209

int(1)

2210

]]>

2211

    </screen>

2212

   </example>

2213

  </para>

1950

2214

 </section>

1951

2215

1952

2216

 <section xml:id="regexp.reference.recursive">

1953

2217

  <title>Recursive patterns</title>

1954

2218

  <para>

1955

   Consider the problem of matching a  string  in  parentheses,

1956

   allowing  for  unlimited nested parentheses. Without the use

1957

   of recursion, the best that can be done is to use a  pattern

1958

   that  matches  up  to some fixed depth of nesting. It is not

1959

   possible to handle an arbitrary nesting depth. Perl 5.6  has

1960

   provided   an  experimental  facility  that  allows  regular

1961

   expressions to recurse (among other things).  The  special 

1962

   item (?R) is  provided for  the specific  case of recursion. 

1963

   This PCRE  pattern  solves the  parentheses  problem (assume 

2219

   Consider the problem of matching a string in parentheses,

2220

   allowing for unlimited nested parentheses. Without the use

2221

   of recursion, the best that can be done is to use a pattern

2222

   that matches up to some fixed depth of nesting. It is not

2223

   possible to handle an arbitrary nesting depth. Perl 5.6 has

2224

   provided an experimental facility that allows regular

2225

   expressions to recurse (among other things). The special

2226

   item (?R) is provided for the specific case of recursion.

2227

   This PCRE pattern solves the parentheses problem (assume

1964

2228

   the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1965

2229

   option is set so that white space is

1966

2230

   ignored):

...

@@ -1969,45 +2233,45 @@

1969

2233

  </para>

1970

2234

  <para>

1971

2235

   First it matches an opening parenthesis. Then it matches any

1972

   number  of substrings which can either be a sequence of

1973

   non-parentheses, or a recursive  match  of  the  pattern  itself

2236

   number of substrings which can either be a sequence of

2237

   non-parentheses, or a recursive match of the pattern itself

1974

2238

   (i.e. a correctly parenthesized substring). Finally there is

1975

2239

   a closing parenthesis.

1976

2240

  </para>

1977

2241

  <para>

1978

   This particular example pattern  contains  nested  unlimited

2242

   This particular example pattern contains nested unlimited

1979

2243

   repeats, and so the use of a once-only subpattern for matching

1980

   strings of non-parentheses is  important  when  applying

1981

   the  pattern to strings that do not match. For example, when

2244

   strings of non-parentheses is important when applying

2245

   the pattern to strings that do not match. For example, when

1982

2246

   it is applied to

1983

2247

1984

2248

   <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>

1985

2249

1986

   it yields "no match" quickly. However, if a  once-only  subpattern

1987

   is  not  used,  the match runs for a very long time

1988

   indeed because there are so many different ways the + and  *

1989

   repeats  can carve up the subject, and all have to be tested

2250

   it yields "no match" quickly. However, if a once-only subpattern

2251

   is not used, the match runs for a very long time

2252

   indeed because there are so many different ways the + and *

2253

   repeats can carve up the subject, and all have to be tested

1990

2254

   before failure can be reported.

1991

2255

  </para>

1992

2256

  <para>

1993

   The values set for any capturing subpatterns are those  from

2257

   The values set for any capturing subpatterns are those from

1994

2258

   the outermost level of the recursion at which the subpattern

1995

2259

   value is set. If the pattern above is matched against

1996

2260

1997

2261

   <literal>(ab(cd)ef)</literal>

1998

2262

1999

   the value for the capturing parentheses is  "ef",  which  is

2000

   the  last  value  taken  on  at the top level. If additional

2263

   the value for the capturing parentheses is "ef", which is

2264

   the last value taken on at the top level. If additional

2001

2265

   parentheses are added, giving

2002

2266

2003

2267

   <literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>

2004

2268

   then the string they capture

2005

2269

   is "ab(cd)ef", the contents of the top level parentheses. If

2006

   there are more than 15 capturing parentheses in  a  pattern,

2007

   PCRE  has  to  obtain  extra  memory  to store data during a

2008

   recursion, which it does by using  pcre_malloc,  freeing  it

2009

   via  pcre_free  afterwards. If no memory can be obtained, it

2010

   saves data for the first 15 capturing parentheses  only,  as

2270

   there are more than 15 capturing parentheses in a pattern,

2271

   PCRE has to obtain extra memory to store data during a

2272

   recursion, which it does by using pcre_malloc, freeing it

2273

   via pcre_free afterwards. If no memory can be obtained, it

2274

   saves data for the first 15 capturing parentheses only, as

2011

2275

   there is no way to give an out-of-memory error from within a

2012

2276

   recursion.

2013

2277

  </para>

...

@@ -2016,7 +2280,7 @@

2016

2280

   <literal>(?1)</literal>, <literal>(?2)</literal> and so on

2017

2281

   can be used for recursive subpatterns too. It is also possible to use named

2018

2282

   subpatterns: <literal>(?P&gt;name)</literal> or

2019

   <literal>(?P&amp;name)</literal>.

2283

   <literal>(?&amp;name)</literal>.

2020

2284

  </para>

2021

2285

  <para>

2022

2286

   If the syntax for a recursive subpattern reference (either by number or

...

@@ -2046,75 +2310,75 @@

2046

2310

  <title>Performance</title>

2047

2311

  <para>

2048

2312

   Certain items that may appear in patterns are more efficient

2049

   than  others.  It is more efficient to use a character class

2313

   than others. It is more efficient to use a character class

2050

2314

   like [aeiou] than a set of alternatives such as (a|e|i|o|u).

2051

   In  general,  the  simplest  construction  that provides the

2052

   required behaviour is usually the  most  efficient.  Jeffrey

2053

   Friedl's  book contains a lot of discussion about optimizing

2315

   In general, the simplest construction that provides the

2316

   required behaviour is usually the most efficient. Jeffrey

2317

   Friedl's book contains a lot of discussion about optimizing

2054

2318

   regular expressions for efficient performance.

2055

2319

  </para>

2056

2320

  <para>

2057

2321

   When a pattern begins with .* and the <link

2058

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is

2059

   set,  the  pattern  is implicitly anchored by PCRE, since it

2322

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is

2323

   set, the pattern is implicitly anchored by PCRE, since it

2060

2324

   can match only at the start of a subject string. However, if

2061

2325

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

2062

2326

   is not set, PCRE cannot make this optimization,

2063

   because the . metacharacter does not then match  a  newline,

2327

   because the . metacharacter does not then match a newline,

2064

2328

   and if the subject string contains newlines, the pattern may

2065

   match from the character immediately following one  of  them

2329

   match from the character immediately following one of them

2066

2330

   instead of from the very start. For example, the pattern

2067

2331

2068

2332

   <literal>(.*) second</literal>

2069

2333

2070

2334

   matches the subject "first\nand second" (where \n stands for

2071

2335

   a newline character) with the first captured substring being

2072

   "and". In order to do this, PCRE  has  to  retry  the  match

2336

   "and". In order to do this, PCRE has to retry the match

2073

2337

   starting after every newline in the subject.

2074

2338

  </para>

2075

2339

  <para>

2076

2340

   If you are using such a pattern with subject strings that do

2077

   not  contain  newlines,  the best performance is obtained by

2341

   not contain newlines, the best performance is obtained by

2078

2342

   setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,

2079

   or starting the  pattern  with  ^.*  to

2080

   indicate  explicit anchoring. That saves PCRE from having to

2343

   or starting the pattern with ^.* to

2344

   indicate explicit anchoring. That saves PCRE from having to

2081

2345

   scan along the subject looking for a newline to restart at.

2082

2346

  </para>

2083

2347

  <para>

2084

   Beware of patterns that contain nested  indefinite  repeats.

2085

   These  can  take a long time to run when applied to a string

2348

   Beware of patterns that contain nested indefinite repeats.

2349

   These can take a long time to run when applied to a string

2086

2350

   that does not match. Consider the pattern fragment

2087

2351

2088

2352

   <literal>(a+)*</literal>

2089

2353

  </para>

2090

2354

  <para>

2091

   This can match "aaaa" in 33 different ways, and this  number

2092

   increases  very  rapidly  as  the string gets longer. (The *

2093

   repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of

2094

   those  cases other than 0, the + repeats can match different

2355

   This can match "aaaa" in 33 different ways, and this number

2356

   increases very rapidly as the string gets longer. (The *

2357

   repeat can match 0, 1, 2, 3, or 4 times, and for each of

2358

   those cases other than 0, the + repeats can match different

2095

2359

   numbers of times.) When the remainder of the pattern is such

2096

   that  the entire match is going to fail, PCRE has in principle

2097

   to try every possible variation, and this  can  take  an

2360

   that the entire match is going to fail, PCRE has in principle

2361

   to try every possible variation, and this can take an

2098

2362

   extremely long time.

2099

2363

  </para>

2100

2364

  <para>

2101

   An optimization catches some of the more simple  cases  such

2365

   An optimization catches some of the more simple cases such

2102

2366

as

2103

2367

2104

2368

   <literal>(a+)*b</literal>

2105

2369

2106

   where a literal character follows. Before embarking  on  the

2370

   where a literal character follows. Before embarking on the

2107

2371

   standard matching procedure, PCRE checks that there is a "b"

2108

   later in the subject string, and if there is not,  it  fails

2109

   the  match  immediately. However, when there is no following

2110

   literal this optimization cannot be used. You  can  see  the

2372

   later in the subject string, and if there is not, it fails

2373

   the match immediately. However, when there is no following

2374

   literal this optimization cannot be used. You can see the

2111

2375

   difference by comparing the behaviour of

2112

2376

2113

2377

   <literal>(a+)*\d</literal>

2114

2378

2115

   with the pattern above. The former gives  a  failure  almost

2116

   instantly  when  applied  to a whole line of "a" characters,

2117

   whereas the latter takes an appreciable  time  with  strings

2379

   with the pattern above. The former gives a failure almost

2380

   instantly when applied to a whole line of "a" characters,

2381

   whereas the latter takes an appreciable time with strings

2118

2382

   longer than about 20 characters.

2119

2383

  </para>

2120

2384

 </section>

2121

2385

Generated: 14 Jul 2025 21:02:32

Translation status