PHP: Documentation Tools

reference/pcre/pattern.syntax.xml
77fe733a1ba9c961424adcb7c9af00c1f5443a77

...

@@ -1,28 +1,28 @@

<?xml version="1.0" encoding="utf-8"?>

<!-- $Revision$ -->

<!-- splitted from ./en/functions/pcre.xml, last change in rev 1.2 -->

<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook">

<chapter xml:id="reference.pcre.pattern.syntax" xmlns="http://docbook.org/ns/docbook" xmlns:xlink="http://www.w3.org/1999/xlink">

 <title>Pattern Syntax</title>

 <titleabbrev>PCRE regex syntax</titleabbrev>

 <section xml:id="regexp.introduction">

  <title>Introduction</title>

  <para>

   The syntax and semantics of  the  regular  expressions

   supported  by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number  of

   other  books,  some  of which have copious examples. Jeffrey

   Friedl's  "Mastering  Regular  Expressions",  published   by

   O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.

   The syntax and semantics of the regular expressions

   supported by PCRE are described below. Regular expressions are

   also described in the Perl documentation and in a number of

   other books, some of which have copious examples. Jeffrey

   Friedl's "Mastering Regular Expressions", published by

   O'Reilly (ISBN 1-56592-257-3), covers them in great detail.

   The description here is intended as reference documentation.

  </para>

  <para>

   A regular expression is a pattern that is matched against  a

   A regular expression is a pattern that is matched against a

   subject string from left to right. Most characters stand for

   themselves in a pattern, and match the corresponding

   characters in the subject. As a trivial example, the pattern

   <literal>The quick brown fox</literal>

   matches a portion of a subject string that is  identical  to

   matches a portion of a subject string that is identical to

   itself.

  </para>

 </section>

...

@@ -32,6 +32,7 @@

   When using the PCRE functions, it is required that the pattern is enclosed

   by <emphasis>delimiters</emphasis>. A delimiter can be any non-alphanumeric,

   non-backslash, non-whitespace character.

   Leading whitespace before a valid delimiter is silently ignored.

  </para>

  <para>

   Often used delimiters are forward slashes (<literal>/</literal>), hash

...

@@ -49,6 +50,26 @@

   </informalexample>

  </para>

  <para>

   It is also possible to use

   bracket style delimiters where the opening and closing brackets are the

   starting and ending delimiter, respectively. <literal>()</literal>,

   <literal>{}</literal>, <literal>[]</literal> and <literal>&lt;&gt;</literal>

   are all valid bracket style delimiter pairs.

   <informalexample>

    <programlisting>

<![CDATA[

(this [is] a (pattern))

{this [is] a (pattern)}

[this [is] a (pattern)]

<this [is] a (pattern)>

]]>

    </programlisting>

   </informalexample>

   Bracket style delimiters do not need to be escaped when they are used as meta

   characters within the pattern, but as with other delimiters they must be

   escaped when they are used as literal characters.

  </para>

  <para>

   If the delimiter needs to be matched inside the pattern it must be

   escaped using a backslash. If the delimiter appears often inside the

   pattern, it is a good idea to choose another delimiter in order to increase

...

@@ -66,18 +87,6 @@

   to specify the delimiter to be escaped.

  </para>

  <para>

   In addition to the aforementioned delimiters, it is also possible to use

   bracket style delimiters where the opening and closing brackets are the

   starting and ending delimiter, respectively.

   <informalexample>

    <programlisting>

<![CDATA[

{this is a pattern}

]]>

    </programlisting>

   </informalexample>

  </para>

  <para>

   You may add <link linkend="reference.pcre.pattern.modifiers">pattern

   modifiers</link> after the ending delimiter. The following is an example

   of case-insensitive matching:

...

@@ -93,103 +102,100 @@

102

 <section xml:id="regexp.reference.meta">

103

  <title>Meta-characters</title>

104

  <para>

   The  power  of  regular  expressions comes from the

105

   The power of regular expressions comes from the

106

   ability to include alternatives and repetitions in the

   pattern.  These  are encoded in the pattern by the use of 

   <emphasis>meta-characters</emphasis>, which do not stand for  themselves  but  instead

107

   pattern. These are encoded in the pattern by the use of

108

   <emphasis>meta-characters</emphasis>, which do not stand for themselves but instead

100

109

   are interpreted in some special way.

101

110

  </para>

102

111

  <para>

103

   There are two different sets of meta-characters: those  that

104

   are  recognized anywhere in the pattern except within square

112

   There are two different sets of meta-characters: those that

113

   are recognized anywhere in the pattern except within square

105

114

   brackets, and those that are recognized in square brackets.

106

115

   Outside square brackets, the meta-characters are as follows:

107

   <variablelist>

108

    <varlistentry>

109

     <term><emphasis>\</emphasis></term>

110

     <listitem><simpara>general escape character with several uses</simpara></listitem>

111

    </varlistentry>

112

    <varlistentry>

113

     <term><emphasis>^</emphasis></term>

114

     <listitem><simpara>assert start of subject (or line, in multiline mode)</simpara></listitem>

115

    </varlistentry>

116

    <varlistentry>

117

     <term><emphasis>$</emphasis></term>

118

     <listitem><simpara>assert end of subject (or line, in multiline mode)</simpara></listitem>

119

    </varlistentry>

120

    <varlistentry>

121

     <term><emphasis>.</emphasis></term>

122

     <listitem><simpara>match any character except newline (by default)</simpara></listitem>

123

    </varlistentry>

124

    <varlistentry>

125

     <term><emphasis>[</emphasis></term>

126

     <listitem><simpara>start character class definition</simpara></listitem>

127

    </varlistentry>

128

    <varlistentry>

129

     <term><emphasis>]</emphasis></term>

130

     <listitem><simpara>end character class definition</simpara></listitem>

131

    </varlistentry>

132

    <varlistentry>

133

     <term><emphasis>|</emphasis></term>

134

     <listitem><simpara>start of alternative branch</simpara></listitem>

135

    </varlistentry>

136

    <varlistentry>

137

     <term><emphasis>(</emphasis></term>

138

     <listitem><simpara>start subpattern</simpara></listitem>

139

    </varlistentry>

140

    <varlistentry>

141

     <term><emphasis>)</emphasis></term>

142

     <listitem><simpara>end subpattern</simpara></listitem>

143

    </varlistentry>

144

    <varlistentry>

145

     <term><emphasis>?</emphasis></term>

146

     <listitem>

147

      <simpara>

148

       extends the meaning of (, also 0 or 1 quantifier, also makes greedy

149

       quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)

150

      </simpara>

151

     </listitem>

152

    </varlistentry>

153

    <varlistentry>

154

     <term><emphasis>*</emphasis></term>

155

     <listitem><simpara>0 or more quantifier</simpara></listitem>

156

    </varlistentry>

157

    <varlistentry>

158

     <term><emphasis>+</emphasis></term>

159

     <listitem><simpara>1 or more quantifier</simpara></listitem>

160

    </varlistentry>

161

    <varlistentry>

162

     <term><emphasis>{</emphasis></term>

163

     <listitem><simpara>start min/max quantifier</simpara></listitem>

164

    </varlistentry>

165

    <varlistentry>

166

     <term><emphasis>}</emphasis></term>

167

     <listitem><simpara>end min/max quantifier</simpara></listitem>

168

    </varlistentry>

169

   </variablelist>

116

117

   <table>

118

     <title>Meta-characters outside square brackets</title>

119

    <tgroup cols="2">

120

     <thead>

121

      <row>

122

       <entry>Meta-character</entry><entry>Description</entry>

123

      </row>

124

     </thead>

125

     <tbody>

126

      <row>

127

       <entry>\</entry><entry>general escape character with several uses</entry>

128

      </row>

129

      <row>

130

       <entry>^</entry><entry>assert start of subject (or line, in multiline mode)</entry>

131

      </row>

132

      <row>

133

       <entry>$</entry><entry>assert end of subject or before a terminating newline (or

134

        end of line, in multiline mode)</entry>

135

      </row>

136

      <row>

137

       <entry>.</entry><entry>match any character except newline (by default)</entry>

138

      </row>

139

      <row>

140

       <entry>[</entry><entry>start character class definition</entry>

141

      </row>

142

      <row>

143

       <entry>]</entry><entry>end character class definition</entry>

144

      </row>

145

      <row>

146

       <entry>|</entry><entry>start of alternative branch</entry>

147

      </row>

148

      <row>

149

       <entry>(</entry><entry>start subpattern</entry>

150

      </row>

151

      <row>

152

       <entry>)</entry><entry>end subpattern</entry>

153

      </row>

154

      <row>

155

       <entry>?</entry><entry>extends the meaning of (, also 0 or 1 quantifier, also makes greedy

156

        quantifiers lazy (see <link linkend="regexp.reference.repetition">repetition</link>)</entry>

157

      </row>

158

      <row>

159

       <entry>*</entry><entry>0 or more quantifier</entry>

160

      </row>

161

      <row>

162

       <entry>+</entry><entry>1 or more quantifier</entry>

163

      </row>

164

      <row>

165

       <entry>{</entry><entry>start min/max quantifier</entry>

166

      </row>

167

      <row>

168

       <entry>}</entry><entry>end min/max quantifier</entry>

169

      </row>

170

     </tbody>

171

    </tgroup>

172

   </table>

170

173

171

174

   Part of a pattern that is in square brackets is called a

172

   "character class". In a character class the only

175

   <link linkend="regexp.reference.character-classes">character class</link>. In a character class the only

173

176

   meta-characters are:

174

177

175

   <variablelist>

176

    <varlistentry>

177

     <term><emphasis>\</emphasis></term>

178

     <listitem><simpara>general escape character</simpara></listitem>

179

    </varlistentry>

180

    <varlistentry>

181

     <term><emphasis>^</emphasis></term>

182

     <listitem><simpara>negate the class, but only if the first character</simpara></listitem>

183

    </varlistentry>

184

    <varlistentry>

185

     <term><emphasis>-</emphasis></term>

186

     <listitem><simpara>indicates character range</simpara></listitem>

187

    </varlistentry>

188

    <varlistentry>

189

     <term><emphasis>]</emphasis></term>

190

     <listitem><simpara>terminates the character class</simpara></listitem>

191

    </varlistentry>

192

   </variablelist>

178

   <table>

179

     <title>Meta-characters inside square brackets (<emphasis>character classes</emphasis>)</title>

180

    <tgroup cols="2">

181

     <thead>

182

      <row>

183

       <entry>Meta-character</entry><entry>Description</entry>

184

      </row>

185

     </thead>

186

     <tbody>

187

      <row>

188

       <entry>\</entry><entry>general escape character</entry>

189

      </row>

190

      <row>

191

       <entry>^</entry><entry>negate the class, but only if the first character</entry>

192

      </row>

193

      <row>

194

       <entry>-</entry><entry>indicates character range</entry>

195

      </row>

196

     </tbody>

197

    </tgroup>

198

   </table>

193

199

194

200

   The following sections describe the use of each of the

195

201

   meta-characters.

...

@@ -199,9 +205,9 @@

199

205

 <section xml:id="regexp.reference.escape">

200

206

  <title>Escape sequences</title>

201

207

  <para>

202

   The backslash character has several uses. Firstly, if it  is

208

   The backslash character has several uses. Firstly, if it is

203

209

   followed by a non-alphanumeric character, it takes away any

204

   special  meaning that character may have. This use of

210

   special meaning that character may have. This use of

205

211

   backslash as an escape character applies both inside and

206

212

   outside character classes.

207

213

  </para>

...

@@ -210,7 +216,7 @@

210

216

   "\*" in the pattern. This applies whether or not the

211

217

   following character would otherwise be interpreted as a

212

218

   meta-character, so it is always safe to precede a non-alphanumeric

213

   with "\" to specify that it stands for itself.  In

219

   with "\" to specify that it stands for itself. In

214

220

   particular, if you want to match a backslash, you write "\\".

215

221

  </para>

216

222

  <note>

...

@@ -232,10 +238,10 @@

232

238

  <para>

233

239

   A second use of backslash provides a way of encoding

234

240

   non-printing characters in patterns in a visible manner. There

235

   is no restriction on the appearance of non-printing  characters,

241

   is no restriction on the appearance of non-printing characters,

236

242

   apart from the binary zero that terminates a pattern,

237

243

   but when a pattern is being prepared by text editing, it is

238

   usually  easier to use one of the following escape sequences

244

   usually easier to use one of the following escape sequences

239

245

   than the binary character it represents:

240

246

  </para>

241

247

  <para>

...

@@ -297,6 +303,12 @@

297

303

     </listitem>

298

304

    </varlistentry>

299

305

    <varlistentry>

306

     <term><emphasis>\R</emphasis></term>

307

     <listitem>

308

      <simpara>line break: matches \n, \r and \r\n</simpara>

309

     </listitem>

310

    </varlistentry>

311

    <varlistentry>

300

312

     <term><emphasis>\t</emphasis></term>

301

313

     <listitem>

302

314

      <simpara>tab (hex 09)</simpara>

...

@@ -320,9 +332,9 @@

320

332

  </para>

321

333

  <para>

322

334

   The precise effect of "<literal>\cx</literal>" is as follows:

323

   if "<literal>x</literal>" is a lower case  letter, it is converted

335

   if "<literal>x</literal>" is a lower case letter, it is converted

324

336

   to upper case. Then bit 6 of the character (hex 40) is inverted.

325

   Thus "<literal>\cz</literal>" becomes  hex 1A, but

337

   Thus "<literal>\cz</literal>" becomes hex 1A, but

326

338

   "<literal>\c{</literal>" becomes hex 3B, while "<literal>\c;</literal>"

327

339

   becomes hex 7B.

328

340

  </para>

...

@@ -338,7 +350,7 @@

338

350

  </para>

339

351

  <para>

340

352

   After "<literal>\0</literal>" up to two further octal digits are read.

341

   In  both cases,  if  there are fewer than two digits, just those that

353

   In both cases, if there are fewer than two digits, just those that

342

354

   are present are used. Thus the sequence "<literal>\0\x\07</literal>"

343

355

   specifies two binary zeros followed by a BEL character. Make sure you

344

356

   supply two digits after the initial zero if the character

...

@@ -347,20 +359,20 @@

347

359

  <para>

348

360

   The handling of a backslash followed by a digit other than 0

349

361

   is complicated. Outside a character class, PCRE reads it

350

   and any following digits as a decimal number. If the  number

351

   is  less  than  10, or if there have been at least that many

352

   previous capturing left parentheses in the  expression,  the

353

   entire  sequence is taken as a <emphasis>back reference</emphasis>. A description

354

   of how this works is given later, following  the  discussion

362

   and any following digits as a decimal number. If the number

363

   is less than 10, or if there have been at least that many

364

   previous capturing left parentheses in the expression, the

365

   entire sequence is taken as a <emphasis>back reference</emphasis>. A description

366

   of how this works is given later, following the discussion

355

367

   of parenthesized subpatterns.

356

368

  </para>

357

369

  <para>

358

   Inside a character  class,  or  if  the  decimal  number  is

370

   Inside a character class, or if the decimal number is

359

371

   greater than 9 and there have not been that many capturing

360

372

   subpatterns, PCRE re-reads up to three octal digits following

361

373

   the backslash, and generates a single byte from the

362

374

   least significant 8 bits of the value. Any subsequent digits

363

   stand for themselves.  For example:

375

   stand for themselves. For example:

364

376

  </para>

365

377

  <para>

366

378

   <variablelist>

...

@@ -428,7 +440,7 @@

428

440

   digits are ever read.

429

441

  </para>

430

442

  <para>

431

   All the sequences that define a single byte value can  be

443

   All the sequences that define a single byte value can be

432

444

   used both inside and outside character classes. In addition,

433

445

   inside a character class, the sequence "<literal>\b</literal>"

434

446

   is interpreted as the backspace character (hex 08). Outside a character

...

@@ -450,11 +462,11 @@

450

462

    </varlistentry>

451

463

    <varlistentry>

452

464

     <term><emphasis>\h</emphasis></term>

453

     <listitem><simpara>any horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

465

     <listitem><simpara>any horizontal whitespace character</simpara></listitem>

454

466

    </varlistentry>

455

467

    <varlistentry>

456

468

     <term><emphasis>\H</emphasis></term>

457

     <listitem><simpara>any character that is not a horizontal whitespace character (since PHP 5.2.4)</simpara></listitem>

469

     <listitem><simpara>any character that is not a horizontal whitespace character</simpara></listitem>

458

470

    </varlistentry>

459

471

    <varlistentry>

460

472

     <term><emphasis>\s</emphasis></term>

...

@@ -466,11 +478,11 @@

466

478

    </varlistentry>

467

479

    <varlistentry>

468

480

     <term><emphasis>\v</emphasis></term>

469

     <listitem><simpara>any vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

481

     <listitem><simpara>any vertical whitespace character</simpara></listitem>

470

482

    </varlistentry>

471

483

    <varlistentry>

472

484

     <term><emphasis>\V</emphasis></term>

473

     <listitem><simpara>any character that is not a vertical whitespace character (since PHP 5.2.4)</simpara></listitem>

485

     <listitem><simpara>any character that is not a vertical whitespace character</simpara></listitem>

474

486

    </varlistentry>

475

487

    <varlistentry>

476

488

     <term><emphasis>\w</emphasis></term>

...

@@ -488,8 +500,14 @@

488

500

   matches one, and only one, of each pair.

489

501

  </para>

490

502

  <para>

503

   The "whitespace" characters are HT (9), LF (10), FF (12), CR (13),

504

   and space (32). However, if locale-specific matching is happening,

505

   characters with code points in the range 128-255 may also be considered

506

   as whitespace characters, for instance, NBSP (A0).

507

  </para>

508

  <para>

491

509

   A "word" character is any letter or digit or the underscore

492

   character,  that  is,  any  character which can be part of a

510

   character, that is, any character which can be part of a

493

511

   Perl "<emphasis>word</emphasis>". The definition of letters and digits is

494

512

   controlled by PCRE's character tables, and may vary if locale-specific

495

513

   matching is taking place. For example, in the "fr" (French) locale, some

...

@@ -498,15 +516,15 @@

498

516

  </para>

499

517

  <para>

500

518

   These character type sequences can appear both inside and

501

   outside  character classes. They each match one character of

502

   the appropriate type. If the current matching  point is at

519

   outside character classes. They each match one character of

520

   the appropriate type. If the current matching point is at

503

521

   the end of the subject string, all of them fail, since there

504

522

   is no character to match.

505

523

  </para>

506

524

  <para>

507

   The fourth use of backslash is  for  certain  simple

525

   The fourth use of backslash is for certain simple

508

526

   assertions. An assertion specifies a condition that has to be met

509

   at a particular point in  a match, without consuming any

527

   at a particular point in a match, without consuming any

510

528

   characters from the subject string. The use of subpatterns

511

529

   for more complicated assertions is described below. The

512

530

   backslashed assertions are

...

@@ -545,7 +563,7 @@

545

563

   </variablelist>

546

564

  </para>

547

565

  <para>

548

   These assertions may not appear in  character  classes  (but

566

   These assertions may not appear in character classes (but

549

567

   note that "<literal>\b</literal>" has a different meaning, namely the backspace

550

568

   character, inside a character class).

551

569

  </para>

...

@@ -553,20 +571,20 @@

553

571

   A word boundary is a position in the subject string where

554

572

   the current character and the previous character do not both

555

573

   match <literal>\w</literal> or <literal>\W</literal> (i.e. one matches

556

   <literal>\w</literal> and  the  other  matches

574

   <literal>\w</literal> and the other matches

557

575

   <literal>\W</literal>), or the start or end of the string if the first

558

576

   or last character matches <literal>\w</literal>, respectively.

559

577

  </para>

560

578

  <para>

561

579

   The <literal>\A</literal>, <literal>\Z</literal>, and

562

   <literal>\z</literal> assertions differ  from  the  traditional

563

   circumflex  and  dollar  (described below) in that they only

564

   ever match at the very start and end of the subject  string,

565

   whatever  options  are  set.  They  are  not affected by the

580

   <literal>\z</literal> assertions differ from the traditional

581

   circumflex and dollar (described in <link linkend="regexp.reference.anchors">anchors</link> )

582

   in that they only ever match at the very start and end of the subject string,

583

   whatever options are set. They are not affected by the

566

584

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link> or

567

585

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOLLAR_ENDONLY</link>

568

   options. The  difference  between <literal>\Z</literal> and

569

   <literal>\z</literal>  is that <literal>\Z</literal> matches before a

586

   options. The difference between <literal>\Z</literal> and

587

   <literal>\z</literal> is that <literal>\Z</literal> matches before a

570

588

   newline that is the last character of the string as well as at the end of

571

589

   the string, whereas <literal>\z</literal> matches only at the end.

572

590

  </para>

...

@@ -583,12 +601,16 @@

583

601

   regexp metacharacters in the pattern. For example:

584

602

   <literal>\w+\Q.$.\E$</literal> will match one or more word characters,

585

603

   followed by literals <literal>.$.</literal> and anchored at the end of

586

   the string.

604

   the string. Note that this does not change the behavior of 

605

   delimiters; for instance the pattern <literal>#\Q#\E#$</literal>

606

   is not valid, because the second <literal>#</literal> marks the end

607

   of the pattern, and the <literal>\E#</literal> is interpreted as invalid

608

   modifiers.

587

609

  </para>

588

610

589

611

  <para>

590

   <literal>\K</literal> can be used to reset the match start since

591

   PHP 5.2.4. For example, the pattern <literal>foo\Kbar</literal> matches

612

   <literal>\K</literal> can be used to reset the match start. 

613

   For example, the pattern <literal>foo\Kbar</literal> matches

592

614

   "foobar", but reports that it has matched "bar". The use of

593

615

   <literal>\K</literal> does not interfere with the setting of captured

594

616

   substrings. For example, when the pattern <literal>(foo)\Kbar</literal>

...

@@ -844,7 +866,7 @@

844

866

   </tgroup>

845

867

  </table>

846

868

  <para>

847

   Extended properties such as "Greek" or "InMusicalSymbols" are not

869

   Extended properties such as <literal>InMusicalSymbols</literal> are not

848

870

   supported by PCRE.

849

871

  </para>

850

872

  <para>

...

@@ -852,15 +874,193 @@

852

874

   For example, <literal>\p{Lu}</literal> always matches only upper case letters.

853

875

  </para>

854

876

  <para>

855

   The <literal>\X</literal> escape matches any number of Unicode characters 

856

   that form an extended Unicode sequence. <literal>\X</literal> is equivalent 

857

   to <literal>(?>\PM\pM*)</literal>.

877

   Sets of Unicode characters are defined as belonging to certain scripts. A

878

   character from one of these sets can be matched using a script name. For

879

   example:

880

  </para>

881

  <itemizedlist>

882

   <listitem>

883

    <simpara><literal>\p{Greek}</literal></simpara>

884

   </listitem>

885

   <listitem>

886

    <simpara><literal>\P{Han}</literal></simpara>

887

   </listitem>

888

  </itemizedlist>

889

  <para>

890

   Those that are not part of an identified script are lumped together as

891

   <literal>Common</literal>. The current list of scripts is:

892

  </para>

893

  <table>

894

   <title>Supported scripts</title>

895

   <tgroup cols="5">

896

    <tbody>

897

     <row>

898

      <entry><literal>Arabic</literal></entry>

899

      <entry><literal>Armenian</literal></entry>

900

      <entry><literal>Avestan</literal></entry>

901

      <entry><literal>Balinese</literal></entry>

902

      <entry><literal>Bamum</literal></entry>

903

     </row>

904

     <row>

905

      <entry><literal>Batak</literal></entry>

906

      <entry><literal>Bengali</literal></entry>

907

      <entry><literal>Bopomofo</literal></entry>

908

      <entry><literal>Brahmi</literal></entry>

909

      <entry><literal>Braille</literal></entry>

910

     </row>

911

     <row>

912

      <entry><literal>Buginese</literal></entry>

913

      <entry><literal>Buhid</literal></entry>

914

      <entry><literal>Canadian_Aboriginal</literal></entry>

915

      <entry><literal>Carian</literal></entry>

916

      <entry><literal>Chakma</literal></entry>

917

     </row>

918

     <row>

919

      <entry><literal>Cham</literal></entry>

920

      <entry><literal>Cherokee</literal></entry>

921

      <entry><literal>Common</literal></entry>

922

      <entry><literal>Coptic</literal></entry>

923

      <entry><literal>Cuneiform</literal></entry>

924

     </row>

925

     <row>

926

      <entry><literal>Cypriot</literal></entry>

927

      <entry><literal>Cyrillic</literal></entry>

928

      <entry><literal>Deseret</literal></entry>

929

      <entry><literal>Devanagari</literal></entry>

930

      <entry><literal>Egyptian_Hieroglyphs</literal></entry>

931

     </row>

932

     <row>

933

      <entry><literal>Ethiopic</literal></entry>

934

      <entry><literal>Georgian</literal></entry>

935

      <entry><literal>Glagolitic</literal></entry>

936

      <entry><literal>Gothic</literal></entry>

937

      <entry><literal>Greek</literal></entry>

938

     </row>

939

     <row>

940

      <entry><literal>Gujarati</literal></entry>

941

      <entry><literal>Gurmukhi</literal></entry>

942

      <entry><literal>Han</literal></entry>

943

      <entry><literal>Hangul</literal></entry>

944

      <entry><literal>Hanunoo</literal></entry>

945

     </row>

946

     <row>

947

      <entry><literal>Hebrew</literal></entry>

948

      <entry><literal>Hiragana</literal></entry>

949

      <entry><literal>Imperial_Aramaic</literal></entry>

950

      <entry><literal>Inherited</literal></entry>

951

      <entry><literal>Inscriptional_Pahlavi</literal></entry>

952

     </row>

953

     <row>

954

      <entry><literal>Inscriptional_Parthian</literal></entry>

955

      <entry><literal>Javanese</literal></entry>

956

      <entry><literal>Kaithi</literal></entry>

957

      <entry><literal>Kannada</literal></entry>

958

      <entry><literal>Katakana</literal></entry>

959

     </row>

960

     <row>

961

      <entry><literal>Kayah_Li</literal></entry>

962

      <entry><literal>Kharoshthi</literal></entry>

963

      <entry><literal>Khmer</literal></entry>

964

      <entry><literal>Lao</literal></entry>

965

      <entry><literal>Latin</literal></entry>

966

     </row>

967

     <row>

968

      <entry><literal>Lepcha</literal></entry>

969

      <entry><literal>Limbu</literal></entry>

970

      <entry><literal>Linear_B</literal></entry>

971

      <entry><literal>Lisu</literal></entry>

972

      <entry><literal>Lycian</literal></entry>

973

     </row>

974

     <row>

975

      <entry><literal>Lydian</literal></entry>

976

      <entry><literal>Malayalam</literal></entry>

977

      <entry><literal>Mandaic</literal></entry>

978

      <entry><literal>Meetei_Mayek</literal></entry>

979

      <entry><literal>Meroitic_Cursive</literal></entry>

980

     </row>

981

     <row>

982

      <entry><literal>Meroitic_Hieroglyphs</literal></entry>

983

      <entry><literal>Miao</literal></entry>

984

      <entry><literal>Mongolian</literal></entry>

985

      <entry><literal>Myanmar</literal></entry>

986

      <entry><literal>New_Tai_Lue</literal></entry>

987

     </row>

988

     <row>

989

      <entry><literal>Nko</literal></entry>

990

      <entry><literal>Ogham</literal></entry>

991

      <entry><literal>Old_Italic</literal></entry>

992

      <entry><literal>Old_Persian</literal></entry>

993

      <entry><literal>Old_South_Arabian</literal></entry>

994

     </row>

995

     <row>

996

      <entry><literal>Old_Turkic</literal></entry>

997

      <entry><literal>Ol_Chiki</literal></entry>

998

      <entry><literal>Oriya</literal></entry>

999

      <entry><literal>Osmanya</literal></entry>

1000

      <entry><literal>Phags_Pa</literal></entry>

1001

     </row>

1002

     <row>

1003

      <entry><literal>Phoenician</literal></entry>

1004

      <entry><literal>Rejang</literal></entry>

1005

      <entry><literal>Runic</literal></entry>

1006

      <entry><literal>Samaritan</literal></entry>

1007

      <entry><literal>Saurashtra</literal></entry>

1008

     </row>

1009

     <row>

1010

      <entry><literal>Sharada</literal></entry>

1011

      <entry><literal>Shavian</literal></entry>

1012

      <entry><literal>Sinhala</literal></entry>

1013

      <entry><literal>Sora_Sompeng</literal></entry>

1014

      <entry><literal>Sundanese</literal></entry>

1015

     </row>

1016

     <row>

1017

      <entry><literal>Syloti_Nagri</literal></entry>

1018

      <entry><literal>Syriac</literal></entry>

1019

      <entry><literal>Tagalog</literal></entry>

1020

      <entry><literal>Tagbanwa</literal></entry>

1021

      <entry><literal>Tai_Le</literal></entry>

1022

     </row>

1023

     <row>

1024

      <entry><literal>Tai_Tham</literal></entry>

1025

      <entry><literal>Tai_Viet</literal></entry>

1026

      <entry><literal>Takri</literal></entry>

1027

      <entry><literal>Tamil</literal></entry>

1028

      <entry><literal>Telugu</literal></entry>

1029

     </row>

1030

     <row>

1031

      <entry><literal>Thaana</literal></entry>

1032

      <entry><literal>Thai</literal></entry>

1033

      <entry><literal>Tibetan</literal></entry>

1034

      <entry><literal>Tifinagh</literal></entry>

1035

      <entry><literal>Ugaritic</literal></entry>

1036

     </row>

1037

     <row>

1038

      <entry><literal>Vai</literal></entry>

1039

      <entry><literal>Yi</literal></entry>

1040

      <entry />

1041

      <entry />

1042

      <entry />

1043

      <entry />

1044

     </row>

1045

    </tbody>

1046

   </tgroup>

1047

  </table>

1048

  <para>

1049

   The <literal>\X</literal> escape matches a Unicode extended grapheme

1050

   cluster. An extended grapheme cluster is one or more Unicode characters

1051

   that combine to form a single glyph. In effect, this can be thought of as

1052

   the Unicode equivalent of <literal>.</literal> as it will match one

1053

   composed character, regardless of how many individual characters are

1054

   actually used to render it.

858

1055

  </para>

859

1056

  <para>

860

   That is, it matches a character without the "mark" property, followed

861

   by zero or more characters with the "mark" property, and treats the

862

   sequence as an atomic group (see below). Characters with the "mark"

863

   property are typically accents that affect the preceding character.

1057

   In versions of PCRE older than 8.32 (which corresponds to PHP versions

1058

   before 5.4.14 when using the bundled PCRE library), <literal>\X</literal>

1059

   is equivalent to <literal>(?>\PM\pM*)</literal>. That is, it matches a

1060

   character without the "mark" property, followed by zero or more characters

1061

   with the "mark" property, and treats the sequence as an atomic group (see

1062

   below). Characters with the "mark" property are typically accents that

1063

   affect the preceding character.

864

1064

  </para>

865

1065

  <para>

866

1066

   Matching characters by Unicode property is not fast, because PCRE has

...

@@ -876,8 +1076,8 @@

876

1076

  <para>

877

1077

   Outside a character class, in the default matching mode, the

878

1078

   circumflex character (<literal>^</literal>) is an assertion which

879

   is true only if the current matching point is at the start  of

880

   the  subject string. Inside a character class, circumflex (<literal>^</literal>)

1079

   is true only if the current matching point is at the start of

1080

   the subject string. Inside a character class, circumflex (<literal>^</literal>)

881

1081

   has an entirely different meaning (see below).

882

1082

  </para>

883

1083

  <para>

...

@@ -892,12 +1092,12 @@

892

1092

  </para>

893

1093

  <para>

894

1094

   A dollar character (<literal>$</literal>) is an assertion which is

895

   &true; only if the current  matching point is at the end of the subject

896

   string, or immediately before a newline character that is  the  last

1095

   &true; only if the current matching point is at the end of the subject

1096

   string, or immediately before a newline character that is the last

897

1097

   character in the string (by default). Dollar (<literal>$</literal>)

898

   need not be the last character of the pattern if a  number  of

899

   alternatives are  involved,  but it should be the last item in any branch

900

   in which it appears. Dollar has no  special  meaning  in  a

1098

   need not be the last character of the pattern if a number of

1099

   alternatives are involved, but it should be the last item in any branch

1100

   in which it appears. Dollar has no special meaning in a

901

1101

   character class.

902

1102

  </para>

903

1103

  <para>

...

@@ -923,9 +1123,9 @@

923

1123

   set.

924

1124

  </para>

925

1125

  <para>

926

   Note that the sequences \A, \Z, and \z can be used to  match

927

   the  start  and end of the subject in both modes, and if all

928

   branches of a pattern start with \A is it  always  anchored,

1126

   Note that the sequences \A, \Z, and \z can be used to match

1127

   the start and end of the subject in both modes, and if all

1128

   branches of a pattern start with \A is it always anchored,

929

1129

   whether <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

930

1130

   is set or not.

931

1131

  </para>

...

@@ -934,14 +1134,14 @@

934

1134

 <section xml:id="regexp.reference.dot">

935

1135

  <title>Dot</title>

936

1136

  <para>

937

   Outside a character class, a dot in the pattern matches  any

938

   one  character  in  the  subject,  including  a non-printing

939

   character, but not (by default) newline.  If the

1137

   Outside a character class, a dot in the pattern matches any

1138

   one character in the subject, including a non-printing

1139

   character, but not (by default) newline. If the

940

1140

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

941

   option  is  set,  then dots match newlines as well. The

1141

   option is set, then dots match newlines as well. The

942

1142

   handling of dot is entirely independent of the handling of

943

   circumflex  and  dollar,  the only relationship being that they

944

   both involve newline characters.  Dot has no special meaning

1143

   circumflex and dollar, the only relationship being that they

1144

   both involve newline characters. Dot has no special meaning

945

1145

   in a character class.

946

1146

  </para>

947

1147

  <para>

...

@@ -955,29 +1155,29 @@

955

1155

  <title>Character classes</title>

956

1156

  <para>

957

1157

   An opening square bracket introduces a character class,

958

   terminated  by  a  closing  square  bracket.  A  closing square

959

   bracket on its own is  not  special.  If  a  closing  square

960

   bracket  is  required as a member of the class, it should be

1158

   terminated by a closing square bracket. A closing square

1159

   bracket on its own is not special. If a closing square

1160

   bracket is required as a member of the class, it should be

961

1161

   the first data character in the class (after an initial

962

1162

   circumflex, if present) or escaped with a backslash.

963

1163

  </para>

964

1164

  <para>

965

1165

   A character class matches a single character in the subject;

966

   the  character  must  be in the set of characters defined by

1166

   the character must be in the set of characters defined by

967

1167

   the class, unless the first character in the class is a

968

   circumflex,  in which case the subject character must not be in

969

   the set defined by the class. If a  circumflex  is  actually

970

   required  as  a  member  of  the class, ensure it is not the

1168

   circumflex, in which case the subject character must not be in

1169

   the set defined by the class. If a circumflex is actually

1170

   required as a member of the class, ensure it is not the

971

1171

   first character, or escape it with a backslash.

972

1172

  </para>

973

1173

  <para>

974

   For example, the character class [aeiou] matches  any  lower

1174

   For example, the character class [aeiou] matches any lower

975

1175

   case vowel, while [^aeiou] matches any character that is not

976

   a lower case vowel. Note that a circumflex is  just  a

977

   convenient  notation for specifying the characters which are in

978

   the class by enumerating those that are not. It  is  not  an

979

   assertion:  it  still  consumes a character from the subject

980

   string, and fails if the current pointer is at  the  end  of

1176

   a lower case vowel. Note that a circumflex is just a

1177

   convenient notation for specifying the characters which are in

1178

   the class by enumerating those that are not. It is not an

1179

   assertion: it still consumes a character from the subject

1180

   string, and fails if the current pointer is at the end of

981

1181

   the string.

982

1182

  </para>

983

1183

  <para>

...

@@ -989,61 +1189,62 @@

989

1189

  </para>

990

1190

  <para>

991

1191

   The newline character is never treated in any special way in

992

   character  classes,  whatever the setting of the <link

1192

   character classes, whatever the setting of the <link

993

1193

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

994

1194

   or <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

995

1195

   options is. A class such as [^a] will always match a newline.

996

1196

  </para>

997

1197

  <para>

998

   The minus (hyphen) character can be used to specify a  range

999

   of  characters  in  a  character  class.  For example, [d-m]

1000

   matches any letter between d and m, inclusive.  If  a  minus

1001

   character  is required in a class, it must be escaped with a

1198

   The minus (hyphen) character can be used to specify a range

1199

   of characters in a character class. For example, [d-m]

1200

   matches any letter between d and m, inclusive. If a minus

1201

   character is required in a class, it must be escaped with a

1002

1202

   backslash or appear in a position where it cannot be

1003

1203

   interpreted as indicating a range, typically as the first or last

1004

1204

   character in the class.

1005

1205

  </para>

1006

1206

  <para>

1007

   It is not possible to have the literal character "]" as  the

1008

   end  character  of  a  range.  A  pattern such as [W-]46] is

1207

   It is not possible to have the literal character "]" as the

1208

   end character of a range. A pattern such as [W-]46] is

1009

1209

   interpreted as a class of two characters ("W" and "-")

1010

1210

   followed by a literal string "46]", so it would match "W46]" or

1011

   "-46]". However, if the "]" is escaped with a  backslash  it

1012

   is  interpreted  as  the end of range, so [W-\]46] is

1013

   interpreted as a single class containing a range followed by  two

1211

   "-46]". However, if the "]" is escaped with a backslash it

1212

   is interpreted as the end of range, so [W-\]46] is

1213

   interpreted as a single class containing a range followed by two

1014

1214

   separate characters. The octal or hexadecimal representation

1015

1215

   of "]" can also be used to end a range.

1016

1216

  </para>

1017

1217

  <para>

1018

1218

   Ranges operate in ASCII collating sequence. They can also be

1019

   used  for  characters  specified  numerically,  for  example

1020

   [\000-\037]. If a range that includes letters is  used  when

1021

   case-insensitive (caseless)  matching  is set, it matches the

1022

   letters in either case. For example, [W-c] is equivalent  to

1219

   used for characters specified numerically, for example

1220

   [\000-\037]. If a range that includes letters is used when

1221

   case-insensitive (caseless) matching is set, it matches the

1222

   letters in either case. For example, [W-c] is equivalent to

1023

1223

   [][\^_`wxyzabc], matched case-insensitively, and if character

1024

1224

   tables for the "fr" locale are in use, [\xc8-\xcb] matches

1025

1225

   accented E characters in both cases.

1026

1226

  </para>

1027

1227

  <para>

1028

   The character types \d, \D, \s, \S,  \w,  and  \W  may  also

1029

   appear  in  a  character  class, and add the characters that

1228

   The character types \d, \D, \s, \S, \w, and \W may also

1229

   appear in a character class, and add the characters that

1030

1230

   they match to the class. For example, [\dABCDEF] matches any

1031

   hexadecimal  digit.  A  circumflex  can conveniently be used

1032

   with the upper case character types to specify a  more

1231

   hexadecimal digit. A circumflex can conveniently be used

1232

   with the upper case character types to specify a more

1033

1233

   restricted set of characters than the matching lower case type.

1034

   For example, the class [^\W_] matches any letter  or  digit,

1234

   For example, the class [^\W_] matches any letter or digit,

1035

1235

   but not underscore.

1036

1236

  </para>

1037

1237

  <para>

1038

   All non-alphanumeric characters other than \,  -,  ^  (at  the

1039

   start)  and  the  terminating ] are non-special in character

1238

   All non-alphanumeric characters other than \, -, ^ (at the

1239

   start) and the terminating ] are non-special in character

1040

1240

   classes, but it does no harm if they are escaped. The pattern

1041

1241

   terminator is always special and must be escaped when used

1042

1242

   within an expression.

1043

1243

  </para>

1044

1244

  <para>

1045

1245

   Perl supports the POSIX notation for character classes. This uses names

1046

   enclosed by <literal>[:</literal> and <literal>:]</literal> within the enclosing square brackets. PCRE also

1246

   enclosed by <literal>[:</literal> and <literal>:]</literal> within

1247

   the enclosing square brackets. PCRE also

1047

1248

   supports this notation. For example, <literal>[01[:alpha:]%]</literal>

1048

1249

   matches "0", "1", any alphabetic character, or "%". The supported class

1049

1250

   names are:

...

@@ -1082,22 +1283,32 @@

1082

1283

  <para>

1083

1284

   In UTF-8 mode, characters with values greater than 128 do not match any

1084

1285

   of the POSIX character classes.

1286

   As of libpcre 8.10 some character classes are changed to use

1287

   Unicode character properties, in which case the mentioned restriction does

1288

   not apply. Refer to the <link xlink:href="&url.pcre.man;">PCRE(3) manual</link>

1289

   for details.

1290

  </para>

1291

  <para>

1292

   Unicode character properties can appear inside a character class. They can

1293

   not be part of a range. The minus (hyphen) character after a Unicode

1294

   character class will match literally. Trying to end a range with a Unicode

1295

   character property will result in a warning.

1085

1296

  </para>

1086

1297

 </section>

1087

1298

1088

1299

 <section xml:id="regexp.reference.alternation">

1089

1300

  <title>Alternation</title>

1090

1301

  <para>

1091

   Vertical bar characters are  used  to  separate  alternative

1302

   Vertical bar characters are used to separate alternative

1092

1303

   patterns. For example, the pattern

1093

1304

   <literal>gilbert|sullivan</literal>

1094

1305

   matches either "gilbert" or "sullivan". Any number of alternatives

1095

   may  appear,  and an empty alternative is permitted

1096

   (matching the empty string).   The  matching  process  tries

1097

   each  alternative in turn, from left to right, and the first

1098

   one that succeeds is used. If the alternatives are within  a

1099

   subpattern  (defined  below),  "succeeds" means matching the

1100

   rest of the main pattern as well as the alternative  in  the

1306

   may appear, and an empty alternative is permitted

1307

   (matching the empty string). The matching process tries

1308

   each alternative in turn, from left to right, and the first

1309

   one that succeeds is used. If the alternatives are within a

1310

   subpattern (defined below), "succeeds" means matching the

1311

   rest of the main pattern as well as the alternative in the

1101

1312

   subpattern.

1102

1313

  </para>

1103

1314

 </section>

...

@@ -1112,7 +1323,7 @@

1112

1323

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>,

1113

1324

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1114

1325

   and PCRE_DUPNAMES can be changed from within the pattern by

1115

   a sequence of Perl option letters enclosed between "(?"  and

1326

   a sequence of Perl option letters enclosed between "(?" and

1116

1327

   ")". The option letters are:

1117

1328

1118

1329

   <table>

...

@@ -1141,7 +1352,8 @@

1141

1352

      </row>

1142

1353

      <row>

1143

1354

       <entry><literal>X</literal></entry>

1144

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link></entry>

1355

       <entry>for <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>

1356

        (no longer supported as of PHP 7.3.0)</entry>

1145

1357

      </row>

1146

1358

      <row>

1147

1359

       <entry><literal>J</literal></entry>

...

@@ -1152,16 +1364,16 @@

1152

1364

   </table>

1153

1365

  </para>

1154

1366

  <para>

1155

   For example, (?im) sets case-insensitive (caseless), multiline matching. It  is

1367

   For example, (?im) sets case-insensitive (caseless), multiline matching. It is

1156

1368

   also possible to unset these options by preceding the letter

1157

   with a hyphen, and a combined setting and unsetting such  as

1158

   (?im-sx),  which sets <link

1369

   with a hyphen, and a combined setting and unsetting such as

1370

   (?im-sx), which sets <link

1159

1371

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> and

1160

1372

   <link linkend="reference.pcre.pattern.modifiers">PCRE_MULTILINE</link>

1161

1373

   while unsetting <link

1162

1374

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> and

1163

1375

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>,

1164

   is also  permitted. If  a  letter  appears both before and after the

1376

   is also permitted. If a letter appears both before and after the

1165

1377

   hyphen, the option is unset.

1166

1378

  </para>

1167

1379

  <para>

...

@@ -1171,14 +1383,14 @@

1171

1383

   and "abC".

1172

1384

  </para>

1173

1385

  <para>

1174

   If an option change occurs inside a subpattern,  the  effect

1175

   is  different.  This is a change of behaviour in Perl 5.005.

1176

   An option change inside a subpattern affects only that  part

1386

   If an option change occurs inside a subpattern, the effect

1387

   is different. This is a change of behaviour in Perl 5.005.

1388

   An option change inside a subpattern affects only that part

1177

1389

   of the subpattern that follows it, so

1178

1390

1179

1391

   <literal>(a(?i)b)c</literal>

1180

1392

1181

   matches  abc  and  aBc  and  no  other   strings   (assuming <link

1393

   matches "abc" and "aBc" and no other strings (assuming <link

1182

1394

   linkend="reference.pcre.pattern.modifiers">PCRE_CASELESS</link> is not

1183

1395

   used). By this means, options can be made to have different settings in

1184

1396

   different parts of the pattern. Any changes made in one alternative do

...

@@ -1187,18 +1399,18 @@

1187

1399

1188

1400

   <literal>(a(?i)b|c)</literal>

1189

1401

1190

   matches "ab", "aB", "c", and "C", even though when  matching

1402

   matches "ab", "aB", "c", and "C", even though when matching

1191

1403

   "C" the first branch is abandoned before the option setting.

1192

   This is because the effects of  option  settings  happen  at

1193

   compile  time. There would be some very weird behaviour otherwise.

1404

   This is because the effects of option settings happen at

1405

   compile time. There would be some very weird behaviour otherwise.

1194

1406

  </para>

1195

1407

  <para>

1196

1408

   The PCRE-specific options <link

1197

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>  and  

1198

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link>   can

1409

   linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link> and

1410

   <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTRA</link> can

1199

1411

   be changed in the same way as the Perl-compatible options by

1200

   using the characters U and X  respectively.  The  (?X)  flag

1201

   setting  is  special in that it must always occur earlier in

1412

   using the characters U and X respectively. The (?X) flag

1413

   setting is special in that it must always occur earlier in

1202

1414

   the pattern than any of the additional features it turns on,

1203

1415

   even when it is at top level. It is best put at the start.

1204

1416

  </para>

...

@@ -1207,8 +1419,8 @@

1207

1419

 <section xml:id="regexp.reference.subpatterns">

1208

1420

  <title>Subpatterns</title>

1209

1421

  <para>

1210

   Subpatterns are delimited by parentheses  (round  brackets),

1211

   which can be nested.  Marking part of a pattern as a subpattern

1422

   Subpatterns are delimited by parentheses (round brackets),

1423

   which can be nested. Marking part of a pattern as a subpattern

1212

1424

   does two things:

1213

1425

  </para>

1214

1426

  <orderedlist>

...

@@ -1237,30 +1449,30 @@

1237

1449

1238

1450

   <literal>the ((red|white) (king|queen))</literal>

1239

1451

1240

   the captured substrings are "red king", "red",  and  "king",

1452

   the captured substrings are "red king", "red", and "king",

1241

1453

   and are numbered 1, 2, and 3.

1242

1454

  </para>

1243

1455

  <para>

1244

   The fact that plain parentheses fulfill two functions is  not

1245

   always  helpful.  There are often times when a grouping subpattern

1246

   is required without a capturing requirement.  If  an

1456

   The fact that plain parentheses fulfill two functions is not

1457

   always helpful. There are often times when a grouping subpattern

1458

   is required without a capturing requirement. If an

1247

1459

   opening parenthesis is followed by "?:", the subpattern does

1248

   not do any capturing, and is not counted when computing  the

1460

   not do any capturing, and is not counted when computing the

1249

1461

   number of any subsequent capturing subpatterns. For example,

1250

   if the string "the  white  queen"  is  matched  against  the

1462

   if the string "the white queen" is matched against the

1251

1463

   pattern

1252

1464

1253

1465

   <literal>the ((?:red|white) (king|queen))</literal>

1254

1466

1255

   the captured substrings are "white queen" and  "queen",  and

1256

   are  numbered  1  and 2. The maximum number of captured substrings

1257

   is 99, and the maximum number  of  all  subpatterns,

1258

   both capturing and non-capturing, is 200.

1467

   the captured substrings are "white queen" and "queen", and

1468

   are numbered 1 and 2. The maximum number of captured substrings

1469

   is 65535. It may not be possible to compile such large patterns,

1470

   however, depending on the configuration options of libpcre.

1259

1471

  </para>

1260

1472

  <para>

1261

   As a  convenient  shorthand,  if  any  option  settings  are

1262

   required  at  the  start  of a non-capturing subpattern, the

1263

   option letters may appear between the "?" and the ":".  Thus

1473

   As a convenient shorthand, if any option settings are

1474

   required at the start of a non-capturing subpattern, the

1475

   option letters may appear between the "?" and the ":". Thus

1264

1476

   the two patterns

1265

1477

  </para>

1266

1478

...

@@ -1274,10 +1486,10 @@

1274

1486

  </informalexample>

1275

1487

1276

1488

  <para>

1277

   match exactly the same set of strings.  Because  alternative

1278

   branches  are  tried from left to right, and options are not

1279

   reset until the end of the subpattern is reached, an  option

1280

   setting  in  one  branch does affect subsequent branches, so

1489

   match exactly the same set of strings. Because alternative

1490

   branches are tried from left to right, and options are not

1491

   reset until the end of the subpattern is reached, an option

1492

   setting in one branch does affect subsequent branches, so

1281

1493

   the above patterns match "SUNDAY" as well as "Saturday".

1282

1494

  </para>

1283

1495

...

@@ -1285,7 +1497,7 @@

1285

1497

   It is possible to name a subpattern using the syntax

1286

1498

   <literal>(?P&lt;name&gt;pattern)</literal>. This subpattern will then

1287

1499

   be indexed in the matches array by its normal numeric position and

1288

   also by name. PHP 5.2.2 introduced two alternative syntaxes 

1500

   also by name. There are two alternative syntaxes

1289

1501

   <literal>(?&lt;name&gt;pattern)</literal> and <literal>(?'name'pattern)</literal>.

1290

1502

  </para>

1291

1503

...

@@ -1306,9 +1518,10 @@

1306

1518

1307

1519

  <para>

1308

1520

   Here <literal>Sun</literal> is stored in backreference 2, while

1309

   backreference 1 is empty. Matching yields <literal>Sat</literal> in

1310

   backreference 1 while backreference 2 does not exist. Changing the pattern

1311

   to use the <literal>(?|</literal> fixes this problem:

1521

   backreference 1 is empty. Matching <literal>Saturday</literal> yields

1522

   <literal>Sat</literal> in backreference 1 while backreference 2 does

1523

   not exist. Changing the pattern to use the <literal>(?|</literal> fixes

1524

   this problem:

1312

1525

  </para>

1313

1526

1314

1527

  <informalexample>

...

@@ -1334,45 +1547,45 @@

1334

1547

    <listitem><simpara>the . metacharacter</simpara></listitem>

1335

1548

    <listitem><simpara>a character class</simpara></listitem>

1336

1549

    <listitem><simpara>a back reference (see next section)</simpara></listitem>

1337

    <listitem><simpara>a parenthesized subpattern (unless it is  an  assertion  -

1550

    <listitem><simpara>a parenthesized subpattern (unless it is an assertion -

1338

1551

     see below)</simpara></listitem>

1339

1552

   </itemizedlist>

1340

1553

  </para>

1341

1554

  <para>

1342

   The general repetition quantifier specifies  a  minimum  and

1343

   maximum  number  of  permitted  matches,  by  giving the two

1344

   numbers in curly brackets (braces), separated  by  a  comma.

1345

   The  numbers  must be less than 65536, and the first must be

1555

   The general repetition quantifier specifies a minimum and

1556

   maximum number of permitted matches, by giving the two

1557

   numbers in curly brackets (braces), separated by a comma.

1558

   The numbers must be less than 65536, and the first must be

1346

1559

   less than or equal to the second. For example:

1347

1560

1348

1561

   <literal>z{2,4}</literal>

1349

1562

1350

   matches "zz", "zzz", or "zzzz". A closing brace on  its  own

1563

   matches "zz", "zzz", or "zzzz". A closing brace on its own

1351

1564

   is not a special character. If the second number is omitted,

1352

   but the comma is present, there is no upper  limit;  if  the

1565

   but the comma is present, there is no upper limit; if the

1353

1566

   second number and the comma are both omitted, the quantifier

1354

1567

   specifies an exact number of required matches. Thus

1355

1568

1356

1569

   <literal>[aeiou]{3,}</literal>

1357

1570

1358

   matches at least 3 successive vowels,  but  may  match  many

1571

   matches at least 3 successive vowels, but may match many

1359

1572

   more, while

1360

1573

1361

1574

   <literal>\d{8}</literal>

1362

1575

1363

   matches exactly 8 digits.  An  opening  curly  bracket  that

1364

   appears  in a position where a quantifier is not allowed, or

1576

   matches exactly 8 digits. An opening curly bracket that

1577

   appears in a position where a quantifier is not allowed, or

1365

1578

   one that does not match the syntax of a quantifier, is taken

1366

   as  a literal character. For example, {,6} is not a quantifier,

1579

   as a literal character. For example, {,6} is not a quantifier,

1367

1580

   but a literal string of four characters.

1368

1581

  </para>

1369

1582

  <para>

1370

   The quantifier {0} is permitted, causing the  expression  to

1371

   behave  as  if the previous item and the quantifier were not

1583

   The quantifier {0} is permitted, causing the expression to

1584

   behave as if the previous item and the quantifier were not

1372

1585

   present.

1373

1586

  </para>

1374

1587

  <para>

1375

   For convenience (and  historical  compatibility)  the  three

1588

   For convenience (and historical compatibility) the three

1376

1589

   most common quantifiers have single-character abbreviations:

1377

1590

1378

1591

   <table>

...

@@ -1396,63 +1609,63 @@

1396

1609

   </table>

1397

1610

  </para>

1398

1611

  <para>

1399

   It is possible to construct infinite loops  by  following  a

1400

   subpattern  that  can  match no characters with a quantifier

1612

   It is possible to construct infinite loops by following a

1613

   subpattern that can match no characters with a quantifier

1401

1614

   that has no upper limit, for example:

1402

1615

1403

1616

   <literal>(a?)*</literal>

1404

1617

  </para>

1405

1618

  <para>

1406

   Earlier versions of Perl and PCRE used to give an  error  at

1407

   compile  time  for such patterns. However, because there are

1408

   cases where this  can  be  useful,  such  patterns  are  now

1409

   accepted,  but  if  any repetition of the subpattern does in

1619

   Earlier versions of Perl and PCRE used to give an error at

1620

   compile time for such patterns. However, because there are

1621

   cases where this can be useful, such patterns are now

1622

   accepted, but if any repetition of the subpattern does in

1410

1623

   fact match no characters, the loop is forcibly broken.

1411

1624

  </para>

1412

1625

  <para>

1413

   By default, the quantifiers  are  "greedy",  that  is,  they

1414

   match  as much as possible (up to the maximum number of permitted

1415

   times), without causing the rest of  the  pattern  to

1626

   By default, the quantifiers are "greedy", that is, they

1627

   match as much as possible (up to the maximum number of permitted

1628

   times), without causing the rest of the pattern to

1416

1629

   fail. The classic example of where this gives problems is in

1417

1630

   trying to match comments in C programs. These appear between

1418

   the  sequences /* and */ and within the sequence, individual

1419

   * and / characters may appear. An attempt to  match  C  comments

1631

   the sequences /* and */ and within the sequence, individual

1632

   * and / characters may appear. An attempt to match C comments

1420

1633

   by applying the pattern

1421

1634

1422

1635

   <literal>/\*.*\*/</literal>

1423

1636

1424

1637

   to the string

1425

1638

1426

   <literal>/* first comment */  not comment  /* second comment */</literal>

1639

   <literal>/* first comment */ not comment /* second comment */</literal>

1427

1640

1428

   fails, because it matches  the  entire  string  due  to  the

1429

   greediness of the .*  item.

1641

   fails, because it matches the entire string due to the

1642

   greediness of the .* item.

1430

1643

  </para>

1431

1644

  <para>

1432

   However, if a quantifier is followed  by  a  question  mark,

1645

   However, if a quantifier is followed by a question mark,

1433

1646

   then it becomes lazy, and instead matches the minimum

1434

1647

   number of times possible, so the pattern

1435

1648

1436

1649

   <literal>/\*.*?\*/</literal>

1437

1650

1438

1651

   does the right thing with the C comments. The meaning of the

1439

   various  quantifiers is not otherwise changed, just the preferred

1440

   number of matches.  Do not confuse this use of

1441

   question  mark  with  its  use as a quantifier in its own right.

1652

   various quantifiers is not otherwise changed, just the preferred

1653

   number of matches. Do not confuse this use of

1654

   question mark with its use as a quantifier in its own right.

1442

1655

   Because it has two uses, it can sometimes appear doubled, as

1443

1656

in

1444

1657

1445

1658

   <literal>\d??\d</literal>

1446

1659

1447

   which matches one digit by preference, but can match two  if

1660

   which matches one digit by preference, but can match two if

1448

1661

   that is the only way the rest of the pattern matches.

1449

1662

  </para>

1450

1663

  <para>

1451

1664

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_UNGREEDY</link>

1452

   option is set (an option which  is  not

1453

   available  in  Perl)  then the quantifiers are not greedy by

1665

   option is set (an option which is not

1666

   available in Perl) then the quantifiers are not greedy by

1454

1667

   default, but individual ones can be made greedy by following

1455

   them  with  a  question mark. In other words, it inverts the

1668

   them with a question mark. In other words, it inverts the

1456

1669

   default behaviour.

1457

1670

  </para>

1458

1671

  <para>

...

@@ -1464,41 +1677,41 @@

1464

1677

  </para>

1465

1678

  <para>

1466

1679

   When a parenthesized subpattern is quantified with a minimum

1467

   repeat  count  that is greater than 1 or with a limited maximum,

1468

   more store is required for the  compiled  pattern,  in

1680

   repeat count that is greater than 1 or with a limited maximum,

1681

   more store is required for the compiled pattern, in

1469

1682

   proportion to the size of the minimum or maximum.

1470

1683

  </para>

1471

1684

  <para>

1472

   If a pattern starts with .* or  .{0,}  and  the  <link 

1685

   If a pattern starts with .* or .{0,} and the <link

1473

1686

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

1474

1687

   option (equivalent to Perl's /s) is set, thus allowing the .

1475

   to match newlines, then the pattern is implicitly  anchored,

1688

   to match newlines, then the pattern is implicitly anchored,

1476

1689

   because whatever follows will be tried against every character

1477

   position in the subject string, so there is no point  in

1478

   retrying  the overall match at any position after the first.

1690

   position in the subject string, so there is no point in

1691

   retrying the overall match at any position after the first.

1479

1692

   PCRE treats such a pattern as though it were preceded by \A.

1480

   In  cases where it is known that the subject string contains

1693

   In cases where it is known that the subject string contains

1481

1694

   no newlines, it is worth setting <link

1482

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  when  the  

1695

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> when the

1483

1696

   pattern begins with .* in order to

1484

1697

   obtain this optimization, or

1485

1698

   alternatively using ^ to indicate anchoring explicitly.

1486

1699

  </para>

1487

1700

  <para>

1488

   When a capturing subpattern is repeated, the value  captured

1701

   When a capturing subpattern is repeated, the value captured

1489

1702

   is the substring that matched the final iteration. For example, after

1490

1703

1491

1704

   <literal>(tweedle[dume]{3}\s*)+</literal>

1492

1705

1493

   has matched "tweedledum tweedledee" the value  of  the  captured

1494

   substring  is  "tweedledee".  However,  if  there are

1495

   nested capturing  subpatterns,  the  corresponding  captured

1496

   values  may  have been set in previous iterations. For example,

1706

   has matched "tweedledum tweedledee" the value of the captured

1707

   substring is "tweedledee". However, if there are

1708

   nested capturing subpatterns, the corresponding captured

1709

   values may have been set in previous iterations. For example,

1497

1710

   after

1498

1711

1499

1712

   <literal>/(a|(b))+/</literal>

1500

1713

1501

   matches "aba" the value of the second captured substring  is

1714

   matches "aba" the value of the second captured substring is

1502

1715

   "b".

1503

1716

  </para>

1504

1717

 </section>

...

@@ -1506,78 +1719,78 @@

1506

1719

 <section xml:id="regexp.reference.back-references">

1507

1720

  <title>Back references</title>

1508

1721

  <para>

1509

   Outside a character class, a backslash followed by  a  digit

1510

   greater  than  0  (and  possibly  further  digits) is a back

1511

   reference to a capturing subpattern  earlier  (i.e.  to  its

1512

   left)  in  the  pattern,  provided there have been that many

1722

   Outside a character class, a backslash followed by a digit

1723

   greater than 0 (and possibly further digits) is a back

1724

   reference to a capturing subpattern earlier (i.e. to its

1725

   left) in the pattern, provided there have been that many

1513

1726

   previous capturing left parentheses.

1514

1727

  </para>

1515

1728

  <para>

1516

   However, if the decimal number following  the  backslash  is

1517

   less  than  10,  it is always taken as a back reference, and

1518

   causes an error only if there are not  that  many  capturing

1519

   left  parentheses in the entire pattern. In other words, the

1520

   parentheses that are referenced need not be to the  left  of

1521

   the  reference  for  numbers  less  than 10. 

1729

   However, if the decimal number following the backslash is

1730

   less than 10, it is always taken as a back reference, and

1731

   causes an error only if there are not that many capturing

1732

   left parentheses in the entire pattern. In other words, the

1733

   parentheses that are referenced need not be to the left of

1734

   the reference for numbers less than 10.

1522

1735

   A "forward back reference" can make sense when a repetition

1523

1736

   is involved and the subpattern to the right has participated

1524

1737

   in an earlier iteration. See the section

1525

   entitled "Backslash" above for further details of  the  handling

1738

   <link linkend="regexp.reference.escape">escape sequences</link> for further details of the handling

1526

1739

   of digits following a backslash.

1527

1740

  </para>

1528

1741

  <para>

1529

   A back reference matches whatever actually matched the  capturing

1742

   A back reference matches whatever actually matched the capturing

1530

1743

   subpattern in the current subject string, rather than

1531

1744

   anything matching the subpattern itself. So the pattern

1532

1745

1533

1746

   <literal>(sens|respons)e and \1ibility</literal>

1534

1747

1535

   matches "sense and sensibility" and "response and  responsibility",

1536

   but  not  "sense  and  responsibility". If case-sensitive (caseful)

1748

   matches "sense and sensibility" and "response and responsibility",

1749

   but not "sense and responsibility". If case-sensitive (caseful)

1537

1750

   matching is in force at the time of the back reference, then

1538

1751

   the case of letters is relevant. For example,

1539

1752

1540

1753

   <literal>((?i)rah)\s+\1</literal>

1541

1754

1542

   matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even

1543

   though  the  original  capturing subpattern is matched

1755

   matches "rah rah" and "RAH RAH", but not "RAH rah", even

1756

   though the original capturing subpattern is matched

1544

1757

   case-insensitively (caselessly).

1545

1758

  </para>

1546

1759

  <para>

1547

   There may be more than one back reference to the  same  subpattern.

1548

   If  a  subpattern  has not actually been used in a

1549

   particular match, then any  back  references  to  it  always

1760

   There may be more than one back reference to the same subpattern.

1761

   If a subpattern has not actually been used in a

1762

   particular match, then any back references to it always

1550

1763

   fail. For example, the pattern

1551

1764

1552

1765

   <literal>(a|(bc))\2</literal>

1553

1766

1554

   always fails if it starts to match  "a"  rather  than  "bc".

1555

   Because  there  may  be up to 99 back references, all digits

1556

   following the backslash are taken as  part  of  a  potential

1767

   always fails if it starts to match "a" rather than "bc".

1768

   Because there may be up to 99 back references, all digits

1769

   following the backslash are taken as part of a potential

1557

1770

   back reference number. If the pattern continues with a digit

1558

1771

   character, then some delimiter must be used to terminate the

1559

1772

   back reference. If the <link

1560

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>  option 

1561

   is set, this can be whitespace.  Otherwise an empty comment can be used.

1773

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link> option

1774

   is set, this can be whitespace. Otherwise an empty comment can be used.

1562

1775

  </para>

1563

1776

  <para>

1564

1777

   A back reference that occurs inside the parentheses to which

1565

   it  refers  fails when the subpattern is first used, so, for

1566

   example, (a\1) never matches.  However, such references  can

1778

   it refers fails when the subpattern is first used, so, for

1779

   example, (a\1) never matches. However, such references can

1567

1780

   be useful inside repeated subpatterns. For example, the pattern

1568

1781

1569

1782

   <literal>(a|b\1)+</literal>

1570

1783

1571

   matches any number of "a"s and also "aba", "ababba" etc.  At

1784

   matches any number of "a"s and also "aba", "ababba" etc. At

1572

1785

   each iteration of the subpattern, the back reference matches

1573

   the character string corresponding to  the  previous  iteration.

1786

   the character string corresponding to the previous iteration.

1574

1787

   In order for this to work, the pattern must be such

1575

   that the first iteration does not need  to  match  the  back

1576

   reference.  This  can  be  done using alternation, as in the

1788

   that the first iteration does not need to match the back

1789

   reference. This can be done using alternation, as in the

1577

1790

   example above, or by a quantifier with a minimum of zero.

1578

1791

  </para>

1579

1792

  <para>

1580

   As of PHP 5.2.2, the <literal>\g</literal> escape sequence can be 

1793

   The <literal>\g</literal> escape sequence can be

1581

1794

   used for absolute and relative referencing of subpatterns.

1582

1795

   This escape sequence must be followed by an unsigned number or a negative

1583

1796

   number, optionally enclosed in braces. The sequences <literal>\1</literal>,

...

@@ -1598,28 +1811,28 @@

1598

1811

  </para>

1599

1812

  <para>

1600

1813

   Back references to the named subpatterns can be achieved by

1601

   <literal>(?P=name)</literal> or, since PHP 5.2.2, also by

1602

   <literal>\k&lt;name&gt;</literal> or <literal>\k'name'</literal>. 

1603

   Additionally PHP 5.2.4 added support for <literal>\k{name}</literal> 

1604

   and <literal>\g{name}</literal>.

1814

   <literal>(?P=name)</literal>,

1815

   <literal>\k&lt;name&gt;</literal>, <literal>\k'name'</literal>,

1816

   <literal>\k{name}</literal>, <literal>\g{name}</literal>,

1817

   <literal>\g&lt;name&gt;</literal> or <literal>\g'name'</literal>.

1605

1818

  </para>

1606

1819

 </section>

1607

1820

1608

1821

 <section xml:id="regexp.reference.assertions">

1609

1822

  <title>Assertions</title>

1610

1823

  <para>

1611

   An assertion is  a  test  on  the  characters  following  or

1612

   preceding  the current matching point that does not actually

1613

   consume any characters. The simple assertions coded  as  \b,

1614

   \B,  \A,  \Z,  \z, ^ and $ are described above. More complicated

1615

   assertions are coded as  subpatterns.  There  are  two

1616

   kinds:  those that <emphasis>look ahead</emphasis> of the current position in the

1824

   An assertion is a test on the characters following or

1825

   preceding the current matching point that does not actually

1826

   consume any characters. The simple assertions coded as \b,

1827

   \B, \A, \Z, \z, ^ and $ are described in <link linkend="regexp.reference.escape">escape sequences</link>. More complicated

1828

   assertions are coded as subpatterns. There are two

1829

   kinds: those that <emphasis>look ahead</emphasis> of the current position in the

1617

1830

   subject string, and those that <emphasis>look behind</emphasis> it.

1618

1831

  </para>

1619

1832

  <para>

1620

1833

   An assertion subpattern is matched in the normal way, except

1621

   that  it  does not cause the current matching position to be

1622

   changed. <emphasis>Lookahead</emphasis> assertions start with  (?=  for  positive

1834

   that it does not cause the current matching position to be

1835

   changed. <emphasis>Lookahead</emphasis> assertions start with (?= for positive

1623

1836

   assertions and (?! for negative assertions. For example,

1624

1837

1625

1838

   <literal>\w+(?=;)</literal>

...

@@ -1629,27 +1842,27 @@

1629

1842

1630

1843

   <literal>foo(?!bar)</literal>

1631

1844

1632

   matches any occurrence of "foo"  that  is  not  followed  by

1845

   matches any occurrence of "foo" that is not followed by

1633

1846

   "bar". Note that the apparently similar pattern

1634

1847

1635

1848

   <literal>(?!foo)bar</literal>

1636

1849

1637

   does not find an occurrence of "bar"  that  is  preceded  by

1850

   does not find an occurrence of "bar" that is preceded by

1638

1851

   something other than "foo"; it finds any occurrence of "bar"

1639

   whatsoever, because the assertion  (?!foo)  is  always  &true;

1640

   when  the  next  three  characters  are  "bar". A lookbehind

1852

   whatsoever, because the assertion (?!foo) is always &true;

1853

   when the next three characters are "bar". A lookbehind

1641

1854

   assertion is needed to achieve this effect.

1642

1855

  </para>

1643

1856

  <para>

1644

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;=  for  positive  assertions

1857

   <emphasis>Lookbehind</emphasis> assertions start with (?&lt;= for positive assertions

1645

1858

   and (?&lt;! for negative assertions. For example,

1646

1859

1647

1860

   <literal>(?&lt;!foo)bar</literal>

1648

1861

1649

   does find an occurrence of "bar" that  is  not  preceded  by

1862

   does find an occurrence of "bar" that is not preceded by

1650

1863

   "foo". The contents of a lookbehind assertion are restricted

1651

   such that all the strings  it  matches  must  have  a  fixed

1652

   length.  However, if there are several alternatives, they do

1864

   such that all the strings it matches must have a fixed

1865

   length. However, if there are several alternatives, they do

1653

1866

   not all have to have the same fixed length. Thus

1654

1867

1655

1868

   <literal>(?&lt;=bullock|donkey)</literal>

...

@@ -1658,51 +1871,51 @@

1658

1871

1659

1872

   <literal>(?&lt;!dogs?|cats?)</literal>

1660

1873

1661

   causes an error at compile time. Branches  that  match  different

1874

   causes an error at compile time. Branches that match different

1662

1875

   length strings are permitted only at the top level of

1663

   a lookbehind assertion. This is an extension  compared  with

1664

   Perl  5.005,  which  requires all branches to match the same

1876

   a lookbehind assertion. This is an extension compared with

1877

   Perl 5.005, which requires all branches to match the same

1665

1878

   length of string. An assertion such as

1666

1879

1667

1880

   <literal>(?&lt;=ab(c|de))</literal>

1668

1881

1669

   is not permitted, because its single  top-level  branch  can

1882

   is not permitted, because its single top-level branch can

1670

1883

   match two different lengths, but it is acceptable if rewritten

1671

1884

   to use two top-level branches:

1672

1885

1673

1886

   <literal>(?&lt;=abc|abde)</literal>

1674

1887

1675

   The implementation of lookbehind  assertions  is,  for  each

1676

   alternative,  to  temporarily move the current position back

1677

   by the fixed width and then  try  to  match.  If  there  are

1678

   insufficient  characters  before  the  current position, the

1679

   match is deemed to fail.  Lookbehinds  in  conjunction  with

1680

   once-only  subpatterns can be particularly useful for matching

1681

   at the ends of strings; an example is given at  the  end

1888

   The implementation of lookbehind assertions is, for each

1889

   alternative, to temporarily move the current position back

1890

   by the fixed width and then try to match. If there are

1891

   insufficient characters before the current position, the

1892

   match is deemed to fail. Lookbehinds in conjunction with

1893

   once-only subpatterns can be particularly useful for matching

1894

   at the ends of strings; an example is given at the end

1682

1895

   of the section on once-only subpatterns.

1683

1896

  </para>

1684

1897

  <para>

1685

   Several assertions (of any sort) may  occur  in  succession.

1898

   Several assertions (of any sort) may occur in succession.

1686

1899

   For example,

1687

1900

1688

1901

   <literal>(?&lt;=\d{3})(?&lt;!999)foo</literal>

1689

1902

1690

   matches "foo" preceded by three digits that are  not  "999".

1691

   Notice  that each of the assertions is applied independently

1692

   at the same point in the subject string. First  there  is  a

1693

   check  that  the  previous  three characters are all digits,

1903

   matches "foo" preceded by three digits that are not "999".

1904

   Notice that each of the assertions is applied independently

1905

   at the same point in the subject string. First there is a

1906

   check that the previous three characters are all digits,

1694

1907

   then there is a check that the same three characters are not

1695

   "999".   This  pattern  does not match "foo" preceded by six

1908

   "999". This pattern does not match "foo" preceded by six

1696

1909

   characters, the first of which are digits and the last three

1697

   of  which  are  not  "999".  For  example,  it doesn't match

1910

   of which are not "999". For example, it doesn't match

1698

1911

   "123abcfoo". A pattern to do that is

1699

1912

1700

1913

   <literal>(?&lt;=\d{3}...)(?&lt;!999)foo</literal>

1701

1914

  </para>

1702

1915

  <para>

1703

   This time the first assertion looks  at  the  preceding  six

1704

   characters,  checking  that  the first three are digits, and

1705

   then the second assertion checks that  the  preceding  three

1916

   This time the first assertion looks at the preceding six

1917

   characters, checking that the first three are digits, and

1918

   then the second assertion checks that the preceding three

1706

1919

   characters are not "999".

1707

1920

  </para>

1708

1921

  <para>

...

@@ -1710,26 +1923,26 @@

1710

1923

1711

1924

   <literal>(?&lt;=(?&lt;!foo)bar)baz</literal>

1712

1925

1713

   matches an occurrence of "baz" that  is  preceded  by  "bar"

1926

   matches an occurrence of "baz" that is preceded by "bar"

1714

1927

   which in turn is not preceded by "foo", while

1715

1928

1716

1929

   <literal>(?&lt;=\d{3}...(?&lt;!999))foo</literal>

1717

1930

1718

   is another pattern which matches  "foo"  preceded  by  three

1931

   is another pattern which matches "foo" preceded by three

1719

1932

   digits and any three characters that are not "999".

1720

1933

  </para>

1721

1934

  <para>

1722

1935

   Assertion subpatterns are not capturing subpatterns, and may

1723

   not  be  repeated,  because  it makes no sense to assert the

1724

   same thing several times. If any kind of assertion  contains

1725

   capturing  subpatterns  within it, these are counted for the

1936

   not be repeated, because it makes no sense to assert the

1937

   same thing several times. If any kind of assertion contains

1938

   capturing subpatterns within it, these are counted for the

1726

1939

   purposes of numbering the capturing subpatterns in the whole

1727

   pattern.   However,  substring capturing is carried out only

1728

   for positive assertions, because it does not make sense  for

1940

   pattern. However, substring capturing is carried out only

1941

   for positive assertions, because it does not make sense for

1729

1942

   negative assertions.

1730

1943

  </para>

1731

1944

  <para>

1732

   Assertions count towards the maximum  of  200  parenthesized

1945

   Assertions count towards the maximum of 200 parenthesized

1733

1946

   subpatterns.

1734

1947

  </para>

1735

1948

 </section>

...

@@ -1737,17 +1950,17 @@

1737

1950

 <section xml:id="regexp.reference.onlyonce">

1738

1951

  <title>Once-only subpatterns</title>

1739

1952

  <para>

1740

   With both maximizing and minimizing repetition,  failure  of

1741

   what  follows  normally  causes  the repeated item to be

1953

   With both maximizing and minimizing repetition, failure of

1954

   what follows normally causes the repeated item to be

1742

1955

   re-evaluated to see if a different number of repeats allows the

1743

   rest  of  the  pattern  to  match. Sometimes it is useful to

1744

   prevent this, either to change the nature of the  match,  or

1745

   to  cause  it fail earlier than it otherwise might, when the

1746

   author of the pattern knows there is no  point  in  carrying

1956

   rest of the pattern to match. Sometimes it is useful to

1957

   prevent this, either to change the nature of the match, or

1958

   to cause it fail earlier than it otherwise might, when the

1959

   author of the pattern knows there is no point in carrying

1747

1960

on.

1748

1961

  </para>

1749

1962

  <para>

1750

   Consider, for example, the pattern \d+foo  when  applied  to

1963

   Consider, for example, the pattern \d+foo when applied to

1751

1964

   the subject line

1752

1965

1753

1966

   <literal>123456bar</literal>

...

@@ -1755,108 +1968,108 @@

1755

1968

  <para>

1756

1969

   After matching all 6 digits and then failing to match "foo",

1757

1970

   the normal action of the matcher is to try again with only 5

1758

   digits matching the \d+ item, and then with 4,  and  so  on,

1971

   digits matching the \d+ item, and then with 4, and so on,

1759

1972

   before ultimately failing. Once-only subpatterns provide the

1760

   means for specifying that once a portion of the pattern  has

1761

   matched,  it  is  not to be re-evaluated in this way, so the

1762

   matcher would give up immediately on failing to match  "foo"

1763

   the  first  time.  The  notation  is another kind of special

1973

   means for specifying that once a portion of the pattern has

1974

   matched, it is not to be re-evaluated in this way, so the

1975

   matcher would give up immediately on failing to match "foo"

1976

   the first time. The notation is another kind of special

1764

1977

   parenthesis, starting with (?&gt; as in this example:

1765

1978

1766

1979

   <literal>(?&gt;\d+)bar</literal>

1767

1980

  </para>

1768

1981

  <para>

1769

   This kind of parenthesis "locks up" the  part of the pattern

1770

   it  contains once it has matched, and a failure further into

1771

   the pattern is prevented from backtracking  into  it.

1772

   Backtracking  past  it to previous items, however, works as normal.

1982

   This kind of parenthesis "locks up" the part of the pattern

1983

   it contains once it has matched, and a failure further into

1984

   the pattern is prevented from backtracking into it.

1985

   Backtracking past it to previous items, however, works as normal.

1773

1986

  </para>

1774

1987

  <para>

1775

1988

   An alternative description is that a subpattern of this type

1776

   matches  the  string  of  characters that an identical standalone

1989

   matches the string of characters that an identical standalone

1777

1990

   pattern would match, if anchored at the current point

1778

1991

   in the subject string.

1779

1992

  </para>

1780

1993

  <para>

1781

   Once-only subpatterns are not capturing subpatterns.  Simple

1782

   cases  such as the above example can be thought of as a maximizing

1783

   repeat that must  swallow  everything  it  can.  So,

1994

   Once-only subpatterns are not capturing subpatterns. Simple

1995

   cases such as the above example can be thought of as a maximizing

1996

   repeat that must swallow everything it can. So,

1784

1997

   while both \d+ and \d+? are prepared to adjust the number of

1785

   digits they match in order to make the rest of  the  pattern

1998

   digits they match in order to make the rest of the pattern

1786

1999

   match, (?&gt;\d+) can only match an entire sequence of digits.

1787

2000

  </para>

1788

2001

  <para>

1789

   This construction can of course contain arbitrarily  complicated

2002

   This construction can of course contain arbitrarily complicated

1790

2003

   subpatterns, and it can be nested.

1791

2004

  </para>

1792

2005

  <para>

1793

2006

   Once-only subpatterns can be used in conjunction with

1794

   lookbehind assertions  to specify efficient matching at the end

2007

   lookbehind assertions to specify efficient matching at the end

1795

2008

   of the subject string. Consider a simple pattern such as

1796

2009

1797

2010

   <literal>abcd$</literal>

1798

2011

1799

   when applied to a long string which does not match.  Because

1800

   matching  proceeds  from  left  to right, PCRE will look for

2012

   when applied to a long string which does not match. Because

2013

   matching proceeds from left to right, PCRE will look for

1801

2014

   each "a" in the subject and then see if what follows matches

1802

2015

   the rest of the pattern. If the pattern is specified as

1803

2016

1804

2017

   <literal>^.*abcd$</literal>

1805

2018

1806

   then the initial .* matches the entire string at first,  but

1807

   when  this  fails  (because  there  is no following "a"), it

2019

   then the initial .* matches the entire string at first, but

2020

   when this fails (because there is no following "a"), it

1808

2021

   backtracks to match all but the last character, then all but

1809

   the  last  two  characters, and so on. Once again the search

1810

   for "a" covers the entire string, from right to left, so  we

2022

   the last two characters, and so on. Once again the search

2023

   for "a" covers the entire string, from right to left, so we

1811

2024

   are no better off. However, if the pattern is written as

1812

2025

1813

2026

   <literal>^(?>.*)(?&lt;=abcd)</literal>

1814

2027

1815

   then there can be no backtracking for the .*  item;  it  can

1816

   match  only  the  entire  string.  The subsequent lookbehind

2028

   then there can be no backtracking for the .* item; it can

2029

   match only the entire string. The subsequent lookbehind

1817

2030

   assertion does a single test on the last four characters. If

1818

   it  fails,  the  match  fails immediately. For long strings,

2031

   it fails, the match fails immediately. For long strings,

1819

2032

   this approach makes a significant difference to the processing time.

1820

2033

  </para>

1821

2034

  <para>

1822

2035

   When a pattern contains an unlimited repeat inside a subpattern

1823

2036

   that can itself be repeated an unlimited number of

1824

   times, the use of a once-only subpattern is the only way  to

1825

   avoid  some  failing matches taking a very long time indeed.

2037

   times, the use of a once-only subpattern is the only way to

2038

   avoid some failing matches taking a very long time indeed.

1826

2039

   The pattern

1827

2040

1828

2041

   <literal>(\D+|&lt;\d+>)*[!?]</literal>

1829

2042

1830

   matches an unlimited number of substrings that  either  consist

1831

   of  non-digits,  or digits enclosed in &lt;>, followed by

2043

   matches an unlimited number of substrings that either consist

2044

   of non-digits, or digits enclosed in &lt;>, followed by

1832

2045

   either ! or ?. When it matches, it runs quickly. However, if

1833

2046

   it is applied to

1834

2047

1835

2048

   <literal>aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa</literal>

1836

2049

1837

   it takes a long  time  before  reporting  failure.  This  is

2050

   it takes a long time before reporting failure. This is

1838

2051

   because the string can be divided between the two repeats in

1839

2052

   a large number of ways, and all have to be tried. (The example

1840

   used  [!?]  rather  than a single character at the end,

1841

   because both PCRE and Perl have an optimization that  allows

1842

   for  fast  failure  when  a  single  character is used. They

1843

   remember the last single character that is  required  for  a

1844

   match,  and  fail early if it is not present in the string.)

2053

   used [!?] rather than a single character at the end,

2054

   because both PCRE and Perl have an optimization that allows

2055

   for fast failure when a single character is used. They

2056

   remember the last single character that is required for a

2057

   match, and fail early if it is not present in the string.)

1845

2058

   If the pattern is changed to

1846

2059

1847

2060

   <literal>((?>\D+)|&lt;\d+>)*[!?]</literal>

1848

2061

1849

   sequences of non-digits cannot be broken, and  failure  happens quickly.

2062

   sequences of non-digits cannot be broken, and failure happens quickly.

1850

2063

  </para>

1851

2064

 </section>

1852

2065

1853

2066

 <section xml:id="regexp.reference.conditional">

1854

2067

  <title>Conditional subpatterns</title>

1855

2068

  <para>

1856

   It is possible to cause the matching process to obey a  subpattern 

1857

   conditionally  or to choose between two alternative

1858

   subpatterns, depending on the result  of  an  assertion,  or

1859

   whether  a previous capturing subpattern matched or not. The

2069

   It is possible to cause the matching process to obey a subpattern

2070

   conditionally or to choose between two alternative

2071

   subpatterns, depending on the result of an assertion, or

2072

   whether a previous capturing subpattern matched or not. The

1860

2073

   two possible forms of conditional subpattern are

1861

2074

  </para>

1862

2075

...

@@ -1870,34 +2083,39 @@

1870

2083

  </informalexample>

1871

2084

  <para>

1872

2085

   If the condition is satisfied, the yes-pattern is used; otherwise

1873

   the  no-pattern  (if  present) is used. If there are

2086

   the no-pattern (if present) is used. If there are

1874

2087

   more than two alternatives in the subpattern, a compile-time

1875

2088

   error occurs.

1876

2089

  </para>

1877

2090

  <para>

1878

   There are two kinds of condition. If the  text  between  the

1879

   parentheses  consists  of  a  sequence  of  digits, then the

1880

   condition is satisfied if the capturing subpattern  of  that

1881

   number  has  previously matched. Consider the following pattern,

1882

   which contains non-significant white space to make  it

1883

   more  readable  (assume  the  <link 

2091

   There are two kinds of condition. If the text between the

2092

   parentheses consists of a sequence of digits, then the

2093

   condition is satisfied if the capturing subpattern of that

2094

   number has previously matched. Consider the following pattern,

2095

   which contains non-significant white space to make it

2096

   more readable (assume the <link

1884

2097

   linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1885

   option)  and to divide it into three parts for ease of discussion:

1886

1887

   <literal>( \( )?    [^()]+    (?(1) \) )</literal>

1888

  </para>

1889

  <para>

1890

   The first part matches an optional opening parenthesis,  and

1891

   if  that character is present, sets it as the first captured

1892

   substring. The second part matches one  or  more  characters

1893

   that  are  not  parentheses. The third part is a conditional

1894

   subpattern that tests whether the first set  of  parentheses

1895

   matched  or  not.  If  they did, that is, if subject started

1896

   with an opening parenthesis, the condition is &true;,  and  so

1897

   the  yes-pattern  is  executed  and a closing parenthesis is

1898

   required. Otherwise, since no-pattern is  not  present,  the

1899

   subpattern  matches  nothing.  In  other words, this pattern

1900

   matches a sequence of non-parentheses,  optionally  enclosed

2098

   option) and to divide it into three parts for ease of discussion:

2099

  </para>

2100

  <informalexample>

2101

   <programlisting>

2102

<![CDATA[

2103

( \( )? [^()]+ (?(1) \) )

2104

]]>

2105

   </programlisting>

2106

  </informalexample>

2107

  <para>

2108

   The first part matches an optional opening parenthesis, and

2109

   if that character is present, sets it as the first captured

2110

   substring. The second part matches one or more characters

2111

   that are not parentheses. The third part is a conditional

2112

   subpattern that tests whether the first set of parentheses

2113

   matched or not. If they did, that is, if subject started

2114

   with an opening parenthesis, the condition is &true;, and so

2115

   the yes-pattern is executed and a closing parenthesis is

2116

   required. Otherwise, since no-pattern is not present, the

2117

   subpattern matches nothing. In other words, this pattern

2118

   matches a sequence of non-parentheses, optionally enclosed

1901

2119

   in parentheses.

1902

2120

  </para>

1903

2121

  <para>

...

@@ -1906,10 +2124,10 @@

1906

2124

   level", the condition is false.

1907

2125

  </para>

1908

2126

  <para>

1909

   If the condition is not a sequence of digits or (R), it must be  an

1910

   assertion.  This  may be a positive or negative lookahead or

1911

   lookbehind assertion. Consider this pattern, again  containing

1912

   non-significant  white space, and with the two alternatives on

2127

   If the condition is not a sequence of digits or (R), it must be an

2128

   assertion. This may be a positive or negative lookahead or

2129

   lookbehind assertion. Consider this pattern, again containing

2130

   non-significant white space, and with the two alternatives on

1913

2131

   the second line:

1914

2132

  </para>

1915

2133

...

@@ -1917,18 +2135,18 @@

1917

2135

   <programlisting>

1918

2136

<![CDATA[

1919

2137

(?(?=[^a-z]*[a-z])

1920

\d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )

2138

\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )

1921

2139

]]>

1922

2140

   </programlisting>

1923

2141

  </informalexample>

1924

2142

  <para>

1925

2143

   The condition is a positive lookahead assertion that matches

1926

2144

   an optional sequence of non-letters followed by a letter. In

1927

   other words, it tests for  the  presence  of  at  least  one

1928

   letter  in the subject. If a letter is found, the subject is

1929

   matched against  the  first  alternative;  otherwise  it  is

1930

   matched  against the second. This pattern matches strings in

1931

   one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are

2145

   other words, it tests for the presence of at least one

2146

   letter in the subject. If a letter is found, the subject is

2147

   matched against the first alternative; otherwise it is

2148

   matched against the second. This pattern matches strings in

2149

   one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are

1932

2150

   letters and dd are digits.

1933

2151

  </para>

1934

2152

 </section>

...

@@ -1936,31 +2154,66 @@

1936

2154

 <section xml:id="regexp.reference.comments">

1937

2155

  <title>Comments</title>

1938

2156

  <para>

1939

   The  sequence  (?#  marks  the  start  of  a  comment  which

1940

   continues   up  to  the  next  closing  parenthesis.  Nested

2157

   The sequence (?# marks the start of a comment which

2158

   continues up to the next closing parenthesis. Nested

1941

2159

   parentheses are not permitted. The characters that make up a

1942

2160

   comment play no part in the pattern matching at all.

1943

2161

  </para>

1944

2162

  <para>

1945

2163

   If the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1946

   option is set, an unescaped # character outside  a character class 

2164

   option is set, an unescaped # character outside a character class

1947

2165

   introduces a comment that continues up to the next newline character

1948

2166

   in the pattern.

1949

2167

  </para>

2168

  <para>

2169

   <example>

2170

    <title>Usage of comments in PCRE pattern</title>

2171

    <programlisting role="php">

2172

<![CDATA[

2173

<?php

2174

2175

$subject = 'test';

2176

2177

/* (?# can be used to add comments without enabling PCRE_EXTENDED */

2178

$match = preg_match('/te(?# this is a comment)st/', $subject);

2179

var_dump($match);

2180

2181

/* Whitespace and # is treated as part of the pattern unless PCRE_EXTENDED is enabled */

2182

$match = preg_match('/te   #~~~~

2183

st/', $subject);

2184

var_dump($match);

2185

2186

/* When PCRE_EXTENDED is enabled, all whitespace data characters and anything

2187

   that follows an unescaped # on the same line is ignored */

2188

$match = preg_match('/te    #~~~~

2189

st/x', $subject);

2190

var_dump($match);

2191

]]>

2192

    </programlisting>

2193

    &example.outputs;

2194

    <screen>

2195

<![CDATA[

2196

int(1)

2197

int(0)

2198

int(1)

2199

]]>

2200

    </screen>

2201

   </example>

2202

  </para>

1950

2203

 </section>

1951

2204

1952

2205

 <section xml:id="regexp.reference.recursive">

1953

2206

  <title>Recursive patterns</title>

1954

2207

  <para>

1955

   Consider the problem of matching a  string  in  parentheses,

1956

   allowing  for  unlimited nested parentheses. Without the use

1957

   of recursion, the best that can be done is to use a  pattern

1958

   that  matches  up  to some fixed depth of nesting. It is not

1959

   possible to handle an arbitrary nesting depth. Perl 5.6  has

1960

   provided   an  experimental  facility  that  allows  regular

1961

   expressions to recurse (among other things).  The  special 

1962

   item (?R) is  provided for  the specific  case of recursion. 

1963

   This PCRE  pattern  solves the  parentheses  problem (assume 

2208

   Consider the problem of matching a string in parentheses,

2209

   allowing for unlimited nested parentheses. Without the use

2210

   of recursion, the best that can be done is to use a pattern

2211

   that matches up to some fixed depth of nesting. It is not

2212

   possible to handle an arbitrary nesting depth. Perl 5.6 has

2213

   provided an experimental facility that allows regular

2214

   expressions to recurse (among other things). The special

2215

   item (?R) is provided for the specific case of recursion.

2216

   This PCRE pattern solves the parentheses problem (assume

1964

2217

   the <link linkend="reference.pcre.pattern.modifiers">PCRE_EXTENDED</link>

1965

2218

   option is set so that white space is

1966

2219

   ignored):

...

@@ -1969,45 +2222,45 @@

1969

2222

  </para>

1970

2223

  <para>

1971

2224

   First it matches an opening parenthesis. Then it matches any

1972

   number  of substrings which can either be a sequence of

1973

   non-parentheses, or a recursive  match  of  the  pattern  itself

2225

   number of substrings which can either be a sequence of

2226

   non-parentheses, or a recursive match of the pattern itself

1974

2227

   (i.e. a correctly parenthesized substring). Finally there is

1975

2228

   a closing parenthesis.

1976

2229

  </para>

1977

2230

  <para>

1978

   This particular example pattern  contains  nested  unlimited

2231

   This particular example pattern contains nested unlimited

1979

2232

   repeats, and so the use of a once-only subpattern for matching

1980

   strings of non-parentheses is  important  when  applying

1981

   the  pattern to strings that do not match. For example, when

2233

   strings of non-parentheses is important when applying

2234

   the pattern to strings that do not match. For example, when

1982

2235

   it is applied to

1983

2236

1984

2237

   <literal>(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()</literal>

1985

2238

1986

   it yields "no match" quickly. However, if a  once-only  subpattern

1987

   is  not  used,  the match runs for a very long time

1988

   indeed because there are so many different ways the + and  *

1989

   repeats  can carve up the subject, and all have to be tested

2239

   it yields "no match" quickly. However, if a once-only subpattern

2240

   is not used, the match runs for a very long time

2241

   indeed because there are so many different ways the + and *

2242

   repeats can carve up the subject, and all have to be tested

1990

2243

   before failure can be reported.

1991

2244

  </para>

1992

2245

  <para>

1993

   The values set for any capturing subpatterns are those  from

2246

   The values set for any capturing subpatterns are those from

1994

2247

   the outermost level of the recursion at which the subpattern

1995

2248

   value is set. If the pattern above is matched against

1996

2249

1997

2250

   <literal>(ab(cd)ef)</literal>

1998

2251

1999

   the value for the capturing parentheses is  "ef",  which  is

2000

   the  last  value  taken  on  at the top level. If additional

2252

   the value for the capturing parentheses is "ef", which is

2253

   the last value taken on at the top level. If additional

2001

2254

   parentheses are added, giving

2002

2255

2003

2256

   <literal>\( ( ( (?>[^()]+) | (?R) )* ) \)</literal>

2004

2257

   then the string they capture

2005

2258

   is "ab(cd)ef", the contents of the top level parentheses. If

2006

   there are more than 15 capturing parentheses in  a  pattern,

2007

   PCRE  has  to  obtain  extra  memory  to store data during a

2008

   recursion, which it does by using  pcre_malloc,  freeing  it

2009

   via  pcre_free  afterwards. If no memory can be obtained, it

2010

   saves data for the first 15 capturing parentheses  only,  as

2259

   there are more than 15 capturing parentheses in a pattern,

2260

   PCRE has to obtain extra memory to store data during a

2261

   recursion, which it does by using pcre_malloc, freeing it

2262

   via pcre_free afterwards. If no memory can be obtained, it

2263

   saves data for the first 15 capturing parentheses only, as

2011

2264

   there is no way to give an out-of-memory error from within a

2012

2265

   recursion.

2013

2266

  </para>

...

@@ -2016,7 +2269,7 @@

2016

2269

   <literal>(?1)</literal>, <literal>(?2)</literal> and so on

2017

2270

   can be used for recursive subpatterns too. It is also possible to use named

2018

2271

   subpatterns: <literal>(?P&gt;name)</literal> or

2019

   <literal>(?P&amp;name)</literal>.

2272

   <literal>(?&amp;name)</literal>.

2020

2273

  </para>

2021

2274

  <para>

2022

2275

   If the syntax for a recursive subpattern reference (either by number or

...

@@ -2046,75 +2299,75 @@

2046

2299

  <title>Performance</title>

2047

2300

  <para>

2048

2301

   Certain items that may appear in patterns are more efficient

2049

   than  others.  It is more efficient to use a character class

2302

   than others. It is more efficient to use a character class

2050

2303

   like [aeiou] than a set of alternatives such as (a|e|i|o|u).

2051

   In  general,  the  simplest  construction  that provides the

2052

   required behaviour is usually the  most  efficient.  Jeffrey

2053

   Friedl's  book contains a lot of discussion about optimizing

2304

   In general, the simplest construction that provides the

2305

   required behaviour is usually the most efficient. Jeffrey

2306

   Friedl's book contains a lot of discussion about optimizing

2054

2307

   regular expressions for efficient performance.

2055

2308

  </para>

2056

2309

  <para>

2057

2310

   When a pattern begins with .* and the <link

2058

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>  option  is

2059

   set,  the  pattern  is implicitly anchored by PCRE, since it

2311

   linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link> option is

2312

   set, the pattern is implicitly anchored by PCRE, since it

2060

2313

   can match only at the start of a subject string. However, if

2061

2314

   <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>

2062

2315

   is not set, PCRE cannot make this optimization,

2063

   because the . metacharacter does not then match  a  newline,

2316

   because the . metacharacter does not then match a newline,

2064

2317

   and if the subject string contains newlines, the pattern may

2065

   match from the character immediately following one  of  them

2318

   match from the character immediately following one of them

2066

2319

   instead of from the very start. For example, the pattern

2067

2320

2068

2321

   <literal>(.*) second</literal>

2069

2322

2070

2323

   matches the subject "first\nand second" (where \n stands for

2071

2324

   a newline character) with the first captured substring being

2072

   "and". In order to do this, PCRE  has  to  retry  the  match

2325

   "and". In order to do this, PCRE has to retry the match

2073

2326

   starting after every newline in the subject.

2074

2327

  </para>

2075

2328

  <para>

2076

2329

   If you are using such a pattern with subject strings that do

2077

   not  contain  newlines,  the best performance is obtained by

2330

   not contain newlines, the best performance is obtained by

2078

2331

   setting <link linkend="reference.pcre.pattern.modifiers">PCRE_DOTALL</link>,

2079

   or starting the  pattern  with  ^.*  to

2080

   indicate  explicit anchoring. That saves PCRE from having to

2332

   or starting the pattern with ^.* to

2333

   indicate explicit anchoring. That saves PCRE from having to

2081

2334

   scan along the subject looking for a newline to restart at.

2082

2335

  </para>

2083

2336

  <para>

2084

   Beware of patterns that contain nested  indefinite  repeats.

2085

   These  can  take a long time to run when applied to a string

2337

   Beware of patterns that contain nested indefinite repeats.

2338

   These can take a long time to run when applied to a string

2086

2339

   that does not match. Consider the pattern fragment

2087

2340

2088

2341

   <literal>(a+)*</literal>

2089

2342

  </para>

2090

2343

  <para>

2091

   This can match "aaaa" in 33 different ways, and this  number

2092

   increases  very  rapidly  as  the string gets longer. (The *

2093

   repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of

2094

   those  cases other than 0, the + repeats can match different

2344

   This can match "aaaa" in 33 different ways, and this number

2345

   increases very rapidly as the string gets longer. (The *

2346

   repeat can match 0, 1, 2, 3, or 4 times, and for each of

2347

   those cases other than 0, the + repeats can match different

2095

2348

   numbers of times.) When the remainder of the pattern is such

2096

   that  the entire match is going to fail, PCRE has in principle

2097

   to try every possible variation, and this  can  take  an

2349

   that the entire match is going to fail, PCRE has in principle

2350

   to try every possible variation, and this can take an

2098

2351

   extremely long time.

2099

2352

  </para>

2100

2353

  <para>

2101

   An optimization catches some of the more simple  cases  such

2354

   An optimization catches some of the more simple cases such

2102

2355

as

2103

2356

2104

2357

   <literal>(a+)*b</literal>

2105

2358

2106

   where a literal character follows. Before embarking  on  the

2359

   where a literal character follows. Before embarking on the

2107

2360

   standard matching procedure, PCRE checks that there is a "b"

2108

   later in the subject string, and if there is not,  it  fails

2109

   the  match  immediately. However, when there is no following

2110

   literal this optimization cannot be used. You  can  see  the

2361

   later in the subject string, and if there is not, it fails

2362

   the match immediately. However, when there is no following

2363

   literal this optimization cannot be used. You can see the

2111

2364

   difference by comparing the behaviour of

2112

2365

2113

2366

   <literal>(a+)*\d</literal>

2114

2367

2115

   with the pattern above. The former gives  a  failure  almost

2116

   instantly  when  applied  to a whole line of "a" characters,

2117

   whereas the latter takes an appreciable  time  with  strings

2368

   with the pattern above. The former gives a failure almost

2369

   instantly when applied to a whole line of "a" characters,

2370

   whereas the latter takes an appreciable time with strings

2118

2371

   longer than about 20 characters.

2119

2372

  </para>

2120

2373

 </section>

2121

2374

Generated: 25 Apr 2024 11:20:47

Tools (Italian Manual)