[ Index ] |
PHP Cross Reference of Unnamed Project |
[Summary view] [Print] [Text view]
1 =head1 NAME 2 3 perlfaq6 - Regular Expressions ($Revision: 10126 $) 4 5 =head1 DESCRIPTION 6 7 This section is surprisingly small because the rest of the FAQ is 8 littered with answers involving regular expressions. For example, 9 decoding a URL and checking whether something is a number are handled 10 with regular expressions, but those answers are found elsewhere in 11 this document (in L<perlfaq9>: "How do I decode or create those %-encodings 12 on the web" and L<perlfaq4>: "How do I determine whether a scalar is 13 a number/whole/integer/float", to be precise). 14 15 =head2 How can I hope to use regular expressions without creating illegible and unmaintainable code? 16 X<regex, legibility> X<regexp, legibility> 17 X<regular expression, legibility> X</x> 18 19 Three techniques can make regular expressions maintainable and 20 understandable. 21 22 =over 4 23 24 =item Comments Outside the Regex 25 26 Describe what you're doing and how you're doing it, using normal Perl 27 comments. 28 29 # turn the line into the first word, a colon, and the 30 # number of characters on the rest of the line 31 s/^(\w+)(.*)/ lc($1) . ":" . length($2) /meg; 32 33 =item Comments Inside the Regex 34 35 The C</x> modifier causes whitespace to be ignored in a regex pattern 36 (except in a character class), and also allows you to use normal 37 comments there, too. As you can imagine, whitespace and comments help 38 a lot. 39 40 C</x> lets you turn this: 41 42 s{<(?:[^>'"]*|".*?"|'.*?')+>}{}gs; 43 44 into this: 45 46 s{ < # opening angle bracket 47 (?: # Non-backreffing grouping paren 48 [^>'"] * # 0 or more things that are neither > nor ' nor " 49 | # or else 50 ".*?" # a section between double quotes (stingy match) 51 | # or else 52 '.*?' # a section between single quotes (stingy match) 53 ) + # all occurring one or more times 54 > # closing angle bracket 55 }{}gsx; # replace with nothing, i.e. delete 56 57 It's still not quite so clear as prose, but it is very useful for 58 describing the meaning of each part of the pattern. 59 60 =item Different Delimiters 61 62 While we normally think of patterns as being delimited with C</> 63 characters, they can be delimited by almost any character. L<perlre> 64 describes this. For example, the C<s///> above uses braces as 65 delimiters. Selecting another delimiter can avoid quoting the 66 delimiter within the pattern: 67 68 s/\/usr\/local/\/usr\/share/g; # bad delimiter choice 69 s#/usr/local#/usr/share#g; # better 70 71 =back 72 73 =head2 I'm having trouble matching over more than one line. What's wrong? 74 X<regex, multiline> X<regexp, multiline> X<regular expression, multiline> 75 76 Either you don't have more than one line in the string you're looking 77 at (probably), or else you aren't using the correct modifier(s) on 78 your pattern (possibly). 79 80 There are many ways to get multiline data into a string. If you want 81 it to happen automatically while reading input, you'll want to set $/ 82 (probably to '' for paragraphs or C<undef> for the whole file) to 83 allow you to read more than one line at a time. 84 85 Read L<perlre> to help you decide which of C</s> and C</m> (or both) 86 you might want to use: C</s> allows dot to include newline, and C</m> 87 allows caret and dollar to match next to a newline, not just at the 88 end of the string. You do need to make sure that you've actually 89 got a multiline string in there. 90 91 For example, this program detects duplicate words, even when they span 92 line breaks (but not paragraph ones). For this example, we don't need 93 C</s> because we aren't using dot in a regular expression that we want 94 to cross line boundaries. Neither do we need C</m> because we aren't 95 wanting caret or dollar to match at any point inside the record next 96 to newlines. But it's imperative that $/ be set to something other 97 than the default, or else we won't actually ever have a multiline 98 record read in. 99 100 $/ = ''; # read in more whole paragraph, not just one line 101 while ( <> ) { 102 while ( /\b([\w'-]+)(\s+\1)+\b/gi ) { # word starts alpha 103 print "Duplicate $1 at paragraph $.\n"; 104 } 105 } 106 107 Here's code that finds sentences that begin with "From " (which would 108 be mangled by many mailers): 109 110 $/ = ''; # read in more whole paragraph, not just one line 111 while ( <> ) { 112 while ( /^From /gm ) { # /m makes ^ match next to \n 113 print "leading from in paragraph $.\n"; 114 } 115 } 116 117 Here's code that finds everything between START and END in a paragraph: 118 119 undef $/; # read in whole file, not just one line or paragraph 120 while ( <> ) { 121 while ( /START(.*?)END/sgm ) { # /s makes . cross line boundaries 122 print "$1\n"; 123 } 124 } 125 126 =head2 How can I pull out lines between two patterns that are themselves on different lines? 127 X<..> 128 129 You can use Perl's somewhat exotic C<..> operator (documented in 130 L<perlop>): 131 132 perl -ne 'print if /START/ .. /END/' file1 file2 ... 133 134 If you wanted text and not lines, you would use 135 136 perl -0777 -ne 'print "$1\n" while /START(.*?)END/gs' file1 file2 ... 137 138 But if you want nested occurrences of C<START> through C<END>, you'll 139 run up against the problem described in the question in this section 140 on matching balanced text. 141 142 Here's another example of using C<..>: 143 144 while (<>) { 145 $in_header = 1 .. /^$/; 146 $in_body = /^$/ .. eof; 147 # now choose between them 148 } continue { 149 $. = 0 if eof; # fix $. 150 } 151 152 =head2 I put a regular expression into $/ but it didn't work. What's wrong? 153 X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in> 154 X<$RS, regexes in> 155 156 $/ has to be a string. You can use these examples if you really need to 157 do this. 158 159 If you have File::Stream, this is easy. 160 161 use File::Stream; 162 163 my $stream = File::Stream->new( 164 $filehandle, 165 separator => qr/\s*,\s*/, 166 ); 167 168 print "$_\n" while <$stream>; 169 170 If you don't have File::Stream, you have to do a little more work. 171 172 You can use the four argument form of sysread to continually add to 173 a buffer. After you add to the buffer, you check if you have a 174 complete line (using your regular expression). 175 176 local $_ = ""; 177 while( sysread FH, $_, 8192, length ) { 178 while( s/^((?s).*?)your_pattern/ ) { 179 my $record = $1; 180 # do stuff here. 181 } 182 } 183 184 You can do the same thing with foreach and a match using the 185 c flag and the \G anchor, if you do not mind your entire file 186 being in memory at the end. 187 188 local $_ = ""; 189 while( sysread FH, $_, 8192, length ) { 190 foreach my $record ( m/\G((?s).*?)your_pattern/gc ) { 191 # do stuff here. 192 } 193 substr( $_, 0, pos ) = "" if pos; 194 } 195 196 197 =head2 How do I substitute case insensitively on the LHS while preserving case on the RHS? 198 X<replace, case preserving> X<substitute, case preserving> 199 X<substitution, case preserving> X<s, case preserving> 200 201 Here's a lovely Perlish solution by Larry Rosler. It exploits 202 properties of bitwise xor on ASCII strings. 203 204 $_= "this is a TEsT case"; 205 206 $old = 'test'; 207 $new = 'success'; 208 209 s{(\Q$old\E)} 210 { uc $new | (uc $1 ^ $1) . 211 (uc(substr $1, -1) ^ substr $1, -1) x 212 (length($new) - length $1) 213 }egi; 214 215 print; 216 217 And here it is as a subroutine, modeled after the above: 218 219 sub preserve_case($$) { 220 my ($old, $new) = @_; 221 my $mask = uc $old ^ $old; 222 223 uc $new | $mask . 224 substr($mask, -1) x (length($new) - length($old)) 225 } 226 227 $a = "this is a TEsT case"; 228 $a =~ s/(test)/preserve_case($1, "success")/egi; 229 print "$a\n"; 230 231 This prints: 232 233 this is a SUcCESS case 234 235 As an alternative, to keep the case of the replacement word if it is 236 longer than the original, you can use this code, by Jeff Pinyan: 237 238 sub preserve_case { 239 my ($from, $to) = @_; 240 my ($lf, $lt) = map length, @_; 241 242 if ($lt < $lf) { $from = substr $from, 0, $lt } 243 else { $from .= substr $to, $lf } 244 245 return uc $to | ($from ^ uc $from); 246 } 247 248 This changes the sentence to "this is a SUcCess case." 249 250 Just to show that C programmers can write C in any programming language, 251 if you prefer a more C-like solution, the following script makes the 252 substitution have the same case, letter by letter, as the original. 253 (It also happens to run about 240% slower than the Perlish solution runs.) 254 If the substitution has more characters than the string being substituted, 255 the case of the last character is used for the rest of the substitution. 256 257 # Original by Nathan Torkington, massaged by Jeffrey Friedl 258 # 259 sub preserve_case($$) 260 { 261 my ($old, $new) = @_; 262 my ($state) = 0; # 0 = no change; 1 = lc; 2 = uc 263 my ($i, $oldlen, $newlen, $c) = (0, length($old), length($new)); 264 my ($len) = $oldlen < $newlen ? $oldlen : $newlen; 265 266 for ($i = 0; $i < $len; $i++) { 267 if ($c = substr($old, $i, 1), $c =~ /[\W\d_]/) { 268 $state = 0; 269 } elsif (lc $c eq $c) { 270 substr($new, $i, 1) = lc(substr($new, $i, 1)); 271 $state = 1; 272 } else { 273 substr($new, $i, 1) = uc(substr($new, $i, 1)); 274 $state = 2; 275 } 276 } 277 # finish up with any remaining new (for when new is longer than old) 278 if ($newlen > $oldlen) { 279 if ($state == 1) { 280 substr($new, $oldlen) = lc(substr($new, $oldlen)); 281 } elsif ($state == 2) { 282 substr($new, $oldlen) = uc(substr($new, $oldlen)); 283 } 284 } 285 return $new; 286 } 287 288 =head2 How can I make C<\w> match national character sets? 289 X<\w> 290 291 Put C<use locale;> in your script. The \w character class is taken 292 from the current locale. 293 294 See L<perllocale> for details. 295 296 =head2 How can I match a locale-smart version of C</[a-zA-Z]/>? 297 X<alpha> 298 299 You can use the POSIX character class syntax C</[[:alpha:]]/> 300 documented in L<perlre>. 301 302 No matter which locale you are in, the alphabetic characters are 303 the characters in \w without the digits and the underscore. 304 As a regex, that looks like C</[^\W\d_]/>. Its complement, 305 the non-alphabetics, is then everything in \W along with 306 the digits and the underscore, or C</[\W\d_]/>. 307 308 =head2 How can I quote a variable to use in a regex? 309 X<regex, escaping> X<regexp, escaping> X<regular expression, escaping> 310 311 The Perl parser will expand $variable and @variable references in 312 regular expressions unless the delimiter is a single quote. Remember, 313 too, that the right-hand side of a C<s///> substitution is considered 314 a double-quoted string (see L<perlop> for more details). Remember 315 also that any regex special characters will be acted on unless you 316 precede the substitution with \Q. Here's an example: 317 318 $string = "Placido P. Octopus"; 319 $regex = "P."; 320 321 $string =~ s/$regex/Polyp/; 322 # $string is now "Polypacido P. Octopus" 323 324 Because C<.> is special in regular expressions, and can match any 325 single character, the regex C<P.> here has matched the <Pl> in the 326 original string. 327 328 To escape the special meaning of C<.>, we use C<\Q>: 329 330 $string = "Placido P. Octopus"; 331 $regex = "P."; 332 333 $string =~ s/\Q$regex/Polyp/; 334 # $string is now "Placido Polyp Octopus" 335 336 The use of C<\Q> causes the <.> in the regex to be treated as a 337 regular character, so that C<P.> matches a C<P> followed by a dot. 338 339 =head2 What is C</o> really for? 340 X</o, regular expressions> X<compile, regular expressions> 341 342 (contributed by brian d foy) 343 344 The C</o> option for regular expressions (documented in L<perlop> and 345 L<perlreref>) tells Perl to compile the regular expression only once. 346 This is only useful when the pattern contains a variable. Perls 5.6 347 and later handle this automatically if the pattern does not change. 348 349 Since the match operator C<m//>, the substitution operator C<s///>, 350 and the regular expression quoting operator C<qr//> are double-quotish 351 constructs, you can interpolate variables into the pattern. See the 352 answer to "How can I quote a variable to use in a regex?" for more 353 details. 354 355 This example takes a regular expression from the argument list and 356 prints the lines of input that match it: 357 358 my $pattern = shift @ARGV; 359 360 while( <> ) { 361 print if m/$pattern/; 362 } 363 364 Versions of Perl prior to 5.6 would recompile the regular expression 365 for each iteration, even if C<$pattern> had not changed. The C</o> 366 would prevent this by telling Perl to compile the pattern the first 367 time, then reuse that for subsequent iterations: 368 369 my $pattern = shift @ARGV; 370 371 while( <> ) { 372 print if m/$pattern/o; # useful for Perl < 5.6 373 } 374 375 In versions 5.6 and later, Perl won't recompile the regular expression 376 if the variable hasn't changed, so you probably don't need the C</o> 377 option. It doesn't hurt, but it doesn't help either. If you want any 378 version of Perl to compile the regular expression only once even if 379 the variable changes (thus, only using its initial value), you still 380 need the C</o>. 381 382 You can watch Perl's regular expression engine at work to verify for 383 yourself if Perl is recompiling a regular expression. The C<use re 384 'debug'> pragma (comes with Perl 5.005 and later) shows the details. 385 With Perls before 5.6, you should see C<re> reporting that its 386 compiling the regular expression on each iteration. With Perl 5.6 or 387 later, you should only see C<re> report that for the first iteration. 388 389 use re 'debug'; 390 391 $regex = 'Perl'; 392 foreach ( qw(Perl Java Ruby Python) ) { 393 print STDERR "-" x 73, "\n"; 394 print STDERR "Trying $_...\n"; 395 print STDERR "\t$_ is good!\n" if m/$regex/; 396 } 397 398 =head2 How do I use a regular expression to strip C style comments from a file? 399 400 While this actually can be done, it's much harder than you'd think. 401 For example, this one-liner 402 403 perl -0777 -pe 's{/\*.*?\*/}{}gs' foo.c 404 405 will work in many but not all cases. You see, it's too simple-minded for 406 certain kinds of C programs, in particular, those with what appear to be 407 comments in quoted strings. For that, you'd need something like this, 408 created by Jeffrey Friedl and later modified by Fred Curtis. 409 410 $/ = undef; 411 $_ = <>; 412 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse; 413 print; 414 415 This could, of course, be more legibly written with the C</x> modifier, adding 416 whitespace and comments. Here it is expanded, courtesy of Fred Curtis. 417 418 s{ 419 /\* ## Start of /* ... */ comment 420 [^*]*\*+ ## Non-* followed by 1-or-more *'s 421 ( 422 [^/*][^*]*\*+ 423 )* ## 0-or-more things which don't start with / 424 ## but do end with '*' 425 / ## End of /* ... */ comment 426 427 | ## OR various things which aren't comments: 428 429 ( 430 " ## Start of " ... " string 431 ( 432 \\. ## Escaped char 433 | ## OR 434 [^"\\] ## Non "\ 435 )* 436 " ## End of " ... " string 437 438 | ## OR 439 440 ' ## Start of ' ... ' string 441 ( 442 \\. ## Escaped char 443 | ## OR 444 [^'\\] ## Non '\ 445 )* 446 ' ## End of ' ... ' string 447 448 | ## OR 449 450 . ## Anything other char 451 [^/"'\\]* ## Chars which doesn't start a comment, string or escape 452 ) 453 }{defined $2 ? $2 : ""}gxse; 454 455 A slight modification also removes C++ comments, as long as they are not 456 spread over multiple lines using a continuation character): 457 458 s#/\*[^*]*\*+([^/*][^*]*\*+)*/|//[^\n]*|("(\\.|[^"\\])*"|'(\\.|[^'\\])*'|.[^/"'\\]*)#defined $2 ? $2 : ""#gse; 459 460 =head2 Can I use Perl regular expressions to match balanced text? 461 X<regex, matching balanced test> X<regexp, matching balanced test> 462 X<regular expression, matching balanced test> 463 464 Historically, Perl regular expressions were not capable of matching 465 balanced text. As of more recent versions of perl including 5.6.1 466 experimental features have been added that make it possible to do this. 467 Look at the documentation for the (??{ }) construct in recent perlre manual 468 pages to see an example of matching balanced parentheses. Be sure to take 469 special notice of the warnings present in the manual before making use 470 of this feature. 471 472 CPAN contains many modules that can be useful for matching text 473 depending on the context. Damian Conway provides some useful 474 patterns in Regexp::Common. The module Text::Balanced provides a 475 general solution to this problem. 476 477 One of the common applications of balanced text matching is working 478 with XML and HTML. There are many modules available that support 479 these needs. Two examples are HTML::Parser and XML::Parser. There 480 are many others. 481 482 An elaborate subroutine (for 7-bit ASCII only) to pull out balanced 483 and possibly nested single chars, like C<`> and C<'>, C<{> and C<}>, 484 or C<(> and C<)> can be found in 485 http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz . 486 487 The C::Scan module from CPAN also contains such subs for internal use, 488 but they are undocumented. 489 490 =head2 What does it mean that regexes are greedy? How can I get around it? 491 X<greedy> X<greediness> 492 493 Most people mean that greedy regexes match as much as they can. 494 Technically speaking, it's actually the quantifiers (C<?>, C<*>, C<+>, 495 C<{}>) that are greedy rather than the whole pattern; Perl prefers local 496 greed and immediate gratification to overall greed. To get non-greedy 497 versions of the same quantifiers, use (C<??>, C<*?>, C<+?>, C<{}?>). 498 499 An example: 500 501 $s1 = $s2 = "I am very very cold"; 502 $s1 =~ s/ve.*y //; # I am cold 503 $s2 =~ s/ve.*?y //; # I am very cold 504 505 Notice how the second substitution stopped matching as soon as it 506 encountered "y ". The C<*?> quantifier effectively tells the regular 507 expression engine to find a match as quickly as possible and pass 508 control on to whatever is next in line, like you would if you were 509 playing hot potato. 510 511 =head2 How do I process each word on each line? 512 X<word> 513 514 Use the split function: 515 516 while (<>) { 517 foreach $word ( split ) { 518 # do something with $word here 519 } 520 } 521 522 Note that this isn't really a word in the English sense; it's just 523 chunks of consecutive non-whitespace characters. 524 525 To work with only alphanumeric sequences (including underscores), you 526 might consider 527 528 while (<>) { 529 foreach $word (m/(\w+)/g) { 530 # do something with $word here 531 } 532 } 533 534 =head2 How can I print out a word-frequency or line-frequency summary? 535 536 To do this, you have to parse out each word in the input stream. We'll 537 pretend that by word you mean chunk of alphabetics, hyphens, or 538 apostrophes, rather than the non-whitespace chunk idea of a word given 539 in the previous question: 540 541 while (<>) { 542 while ( /(\b[^\W_\d][\w'-]+\b)/g ) { # misses "`sheep'" 543 $seen{$1}++; 544 } 545 } 546 547 while ( ($word, $count) = each %seen ) { 548 print "$count $word\n"; 549 } 550 551 If you wanted to do the same thing for lines, you wouldn't need a 552 regular expression: 553 554 while (<>) { 555 $seen{$_}++; 556 } 557 558 while ( ($line, $count) = each %seen ) { 559 print "$count $line"; 560 } 561 562 If you want these output in a sorted order, see L<perlfaq4>: "How do I 563 sort a hash (optionally by value instead of key)?". 564 565 =head2 How can I do approximate matching? 566 X<match, approximate> X<matching, approximate> 567 568 See the module String::Approx available from CPAN. 569 570 =head2 How do I efficiently match many regular expressions at once? 571 X<regex, efficiency> X<regexp, efficiency> 572 X<regular expression, efficiency> 573 574 ( contributed by brian d foy ) 575 576 Avoid asking Perl to compile a regular expression every time 577 you want to match it. In this example, perl must recompile 578 the regular expression for every iteration of the foreach() 579 loop since it has no way to know what $pattern will be. 580 581 @patterns = qw( foo bar baz ); 582 583 LINE: while( <DATA> ) 584 { 585 foreach $pattern ( @patterns ) 586 { 587 if( /\b$pattern\b/i ) 588 { 589 print; 590 next LINE; 591 } 592 } 593 } 594 595 The qr// operator showed up in perl 5.005. It compiles a 596 regular expression, but doesn't apply it. When you use the 597 pre-compiled version of the regex, perl does less work. In 598 this example, I inserted a map() to turn each pattern into 599 its pre-compiled form. The rest of the script is the same, 600 but faster. 601 602 @patterns = map { qr/\b$_\b/i } qw( foo bar baz ); 603 604 LINE: while( <> ) 605 { 606 foreach $pattern ( @patterns ) 607 { 608 print if /$pattern/i; 609 next LINE; 610 } 611 } 612 613 In some cases, you may be able to make several patterns into 614 a single regular expression. Beware of situations that require 615 backtracking though. 616 617 $regex = join '|', qw( foo bar baz ); 618 619 LINE: while( <> ) 620 { 621 print if /\b(?:$regex)\b/i; 622 } 623 624 For more details on regular expression efficiency, see Mastering 625 Regular Expressions by Jeffrey Freidl. He explains how regular 626 expressions engine work and why some patterns are surprisingly 627 inefficient. Once you understand how perl applies regular 628 expressions, you can tune them for individual situations. 629 630 =head2 Why don't word-boundary searches with C<\b> work for me? 631 X<\b> 632 633 (contributed by brian d foy) 634 635 Ensure that you know what \b really does: it's the boundary between a 636 word character, \w, and something that isn't a word character. That 637 thing that isn't a word character might be \W, but it can also be the 638 start or end of the string. 639 640 It's not (not!) the boundary between whitespace and non-whitespace, 641 and it's not the stuff between words we use to create sentences. 642 643 In regex speak, a word boundary (\b) is a "zero width assertion", 644 meaning that it doesn't represent a character in the string, but a 645 condition at a certain position. 646 647 For the regular expression, /\bPerl\b/, there has to be a word 648 boundary before the "P" and after the "l". As long as something other 649 than a word character precedes the "P" and succeeds the "l", the 650 pattern will match. These strings match /\bPerl\b/. 651 652 "Perl" # no word char before P or after l 653 "Perl " # same as previous (space is not a word char) 654 "'Perl'" # the ' char is not a word char 655 "Perl's" # no word char before P, non-word char after "l" 656 657 These strings do not match /\bPerl\b/. 658 659 "Perl_" # _ is a word char! 660 "Perler" # no word char before P, but one after l 661 662 You don't have to use \b to match words though. You can look for 663 non-word characters surrounded by word characters. These strings 664 match the pattern /\b'\b/. 665 666 "don't" # the ' char is surrounded by "n" and "t" 667 "qep'a'" # the ' char is surrounded by "p" and "a" 668 669 These strings do not match /\b'\b/. 670 671 "foo'" # there is no word char after non-word ' 672 673 You can also use the complement of \b, \B, to specify that there 674 should not be a word boundary. 675 676 In the pattern /\Bam\B/, there must be a word character before the "a" 677 and after the "m". These patterns match /\Bam\B/: 678 679 "llama" # "am" surrounded by word chars 680 "Samuel" # same 681 682 These strings do not match /\Bam\B/ 683 684 "Sam" # no word boundary before "a", but one after "m" 685 "I am Sam" # "am" surrounded by non-word chars 686 687 688 =head2 Why does using $&, $`, or $' slow my program down? 689 X<$MATCH> X<$&> X<$POSTMATCH> X<$'> X<$PREMATCH> X<$`> 690 691 (contributed by Anno Siegel) 692 693 Once Perl sees that you need one of these variables anywhere in the 694 program, it provides them on each and every pattern match. That means 695 that on every pattern match the entire string will be copied, part of it 696 to $`, part to $&, and part to $'. Thus the penalty is most severe with 697 long strings and patterns that match often. Avoid $&, $', and $` if you 698 can, but if you can't, once you've used them at all, use them at will 699 because you've already paid the price. Remember that some algorithms 700 really appreciate them. As of the 5.005 release, the $& variable is no 701 longer "expensive" the way the other two are. 702 703 Since Perl 5.6.1 the special variables @- and @+ can functionally replace 704 $`, $& and $'. These arrays contain pointers to the beginning and end 705 of each match (see perlvar for the full story), so they give you 706 essentially the same information, but without the risk of excessive 707 string copying. 708 709 =head2 What good is C<\G> in a regular expression? 710 X<\G> 711 712 You use the C<\G> anchor to start the next match on the same 713 string where the last match left off. The regular 714 expression engine cannot skip over any characters to find 715 the next match with this anchor, so C<\G> is similar to the 716 beginning of string anchor, C<^>. The C<\G> anchor is typically 717 used with the C<g> flag. It uses the value of C<pos()> 718 as the position to start the next match. As the match 719 operator makes successive matches, it updates C<pos()> with the 720 position of the next character past the last match (or the 721 first character of the next match, depending on how you like 722 to look at it). Each string has its own C<pos()> value. 723 724 Suppose you want to match all of consecutive pairs of digits 725 in a string like "1122a44" and stop matching when you 726 encounter non-digits. You want to match C<11> and C<22> but 727 the letter <a> shows up between C<22> and C<44> and you want 728 to stop at C<a>. Simply matching pairs of digits skips over 729 the C<a> and still matches C<44>. 730 731 $_ = "1122a44"; 732 my @pairs = m/(\d\d)/g; # qw( 11 22 44 ) 733 734 If you use the C<\G> anchor, you force the match after C<22> to 735 start with the C<a>. The regular expression cannot match 736 there since it does not find a digit, so the next match 737 fails and the match operator returns the pairs it already 738 found. 739 740 $_ = "1122a44"; 741 my @pairs = m/\G(\d\d)/g; # qw( 11 22 ) 742 743 You can also use the C<\G> anchor in scalar context. You 744 still need the C<g> flag. 745 746 $_ = "1122a44"; 747 while( m/\G(\d\d)/g ) 748 { 749 print "Found $1\n"; 750 } 751 752 After the match fails at the letter C<a>, perl resets C<pos()> 753 and the next match on the same string starts at the beginning. 754 755 $_ = "1122a44"; 756 while( m/\G(\d\d)/g ) 757 { 758 print "Found $1\n"; 759 } 760 761 print "Found $1 after while" if m/(\d\d)/g; # finds "11" 762 763 You can disable C<pos()> resets on fail with the C<c> flag, documented 764 in L<perlop> and L<perlreref>. Subsequent matches start where the last 765 successful match ended (the value of C<pos()>) even if a match on the 766 same string has failed in the meantime. In this case, the match after 767 the C<while()> loop starts at the C<a> (where the last match stopped), 768 and since it does not use any anchor it can skip over the C<a> to find 769 C<44>. 770 771 $_ = "1122a44"; 772 while( m/\G(\d\d)/gc ) 773 { 774 print "Found $1\n"; 775 } 776 777 print "Found $1 after while" if m/(\d\d)/g; # finds "44" 778 779 Typically you use the C<\G> anchor with the C<c> flag 780 when you want to try a different match if one fails, 781 such as in a tokenizer. Jeffrey Friedl offers this example 782 which works in 5.004 or later. 783 784 while (<>) { 785 chomp; 786 PARSER: { 787 m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; 788 m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; 789 m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; 790 m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; 791 } 792 } 793 794 For each line, the C<PARSER> loop first tries to match a series 795 of digits followed by a word boundary. This match has to 796 start at the place the last match left off (or the beginning 797 of the string on the first match). Since C<m/ \G( \d+\b 798 )/gcx> uses the C<c> flag, if the string does not match that 799 regular expression, perl does not reset pos() and the next 800 match starts at the same position to try a different 801 pattern. 802 803 =head2 Are Perl regexes DFAs or NFAs? Are they POSIX compliant? 804 X<DFA> X<NFA> X<POSIX> 805 806 While it's true that Perl's regular expressions resemble the DFAs 807 (deterministic finite automata) of the egrep(1) program, they are in 808 fact implemented as NFAs (non-deterministic finite automata) to allow 809 backtracking and backreferencing. And they aren't POSIX-style either, 810 because those guarantee worst-case behavior for all cases. (It seems 811 that some people prefer guarantees of consistency, even when what's 812 guaranteed is slowness.) See the book "Mastering Regular Expressions" 813 (from O'Reilly) by Jeffrey Friedl for all the details you could ever 814 hope to know on these matters (a full citation appears in 815 L<perlfaq2>). 816 817 =head2 What's wrong with using grep in a void context? 818 X<grep> 819 820 The problem is that grep builds a return list, regardless of the context. 821 This means you're making Perl go to the trouble of building a list that 822 you then just throw away. If the list is large, you waste both time and space. 823 If your intent is to iterate over the list, then use a for loop for this 824 purpose. 825 826 In perls older than 5.8.1, map suffers from this problem as well. 827 But since 5.8.1, this has been fixed, and map is context aware - in void 828 context, no lists are constructed. 829 830 =head2 How can I match strings with multibyte characters? 831 X<regex, and multibyte characters> X<regexp, and multibyte characters> 832 X<regular expression, and multibyte characters> X<martian> X<encoding, Martian> 833 834 Starting from Perl 5.6 Perl has had some level of multibyte character 835 support. Perl 5.8 or later is recommended. Supported multibyte 836 character repertoires include Unicode, and legacy encodings 837 through the Encode module. See L<perluniintro>, L<perlunicode>, 838 and L<Encode>. 839 840 If you are stuck with older Perls, you can do Unicode with the 841 C<Unicode::String> module, and character conversions using the 842 C<Unicode::Map8> and C<Unicode::Map> modules. If you are using 843 Japanese encodings, you might try using the jperl 5.005_03. 844 845 Finally, the following set of approaches was offered by Jeffrey 846 Friedl, whose article in issue #5 of The Perl Journal talks about 847 this very matter. 848 849 Let's suppose you have some weird Martian encoding where pairs of 850 ASCII uppercase letters encode single Martian letters (i.e. the two 851 bytes "CV" make a single Martian letter, as do the two bytes "SG", 852 "VS", "XX", etc.). Other bytes represent single characters, just like 853 ASCII. 854 855 So, the string of Martian "I am CVSGXX!" uses 12 bytes to encode the 856 nine characters 'I', ' ', 'a', 'm', ' ', 'CV', 'SG', 'XX', '!'. 857 858 Now, say you want to search for the single character C</GX/>. Perl 859 doesn't know about Martian, so it'll find the two bytes "GX" in the "I 860 am CVSGXX!" string, even though that character isn't there: it just 861 looks like it is because "SG" is next to "XX", but there's no real 862 "GX". This is a big problem. 863 864 Here are a few ways, all painful, to deal with it: 865 866 # Make sure adjacent "martian" bytes are no longer adjacent. 867 $martian =~ s/([A-Z][A-Z])/ $1 /g; 868 869 print "found GX!\n" if $martian =~ /GX/; 870 871 Or like this: 872 873 @chars = $martian =~ m/([A-Z][A-Z]|[^A-Z])/g; 874 # above is conceptually similar to: @chars = $text =~ m/(.)/g; 875 # 876 foreach $char (@chars) { 877 print "found GX!\n", last if $char eq 'GX'; 878 } 879 880 Or like this: 881 882 while ($martian =~ m/\G([A-Z][A-Z]|.)/gs) { # \G probably unneeded 883 print "found GX!\n", last if $1 eq 'GX'; 884 } 885 886 Here's another, slightly less painful, way to do it from Benjamin 887 Goldberg, who uses a zero-width negative look-behind assertion. 888 889 print "found GX!\n" if $martian =~ m/ 890 (?<![A-Z]) 891 (?:[A-Z][A-Z])*? 892 GX 893 /x; 894 895 This succeeds if the "martian" character GX is in the string, and fails 896 otherwise. If you don't like using (?<!), a zero-width negative 897 look-behind assertion, you can replace (?<![A-Z]) with (?:^|[^A-Z]). 898 899 It does have the drawback of putting the wrong thing in $-[0] and $+[0], 900 but this usually can be worked around. 901 902 =head2 How do I match a regular expression that's in a variable? 903 X<regex, in variable> X<eval> X<regex> X<quotemeta> X<\Q, regex> 904 X<\E, regex>, X<qr//> 905 906 (contributed by brian d foy) 907 908 We don't have to hard-code patterns into the match operator (or 909 anything else that works with regular expressions). We can put the 910 pattern in a variable for later use. 911 912 The match operator is a double quote context, so you can interpolate 913 your variable just like a double quoted string. In this case, you 914 read the regular expression as user input and store it in C<$regex>. 915 Once you have the pattern in C<$regex>, you use that variable in the 916 match operator. 917 918 chomp( my $regex = <STDIN> ); 919 920 if( $string =~ m/$regex/ ) { ... } 921 922 Any regular expression special characters in C<$regex> are still 923 special, and the pattern still has to be valid or Perl will complain. 924 For instance, in this pattern there is an unpaired parenthesis. 925 926 my $regex = "Unmatched ( paren"; 927 928 "Two parens to bind them all" =~ m/$regex/; 929 930 When Perl compiles the regular expression, it treats the parenthesis 931 as the start of a memory match. When it doesn't find the closing 932 parenthesis, it complains: 933 934 Unmatched ( in regex; marked by <-- HERE in m/Unmatched ( <-- HERE paren/ at script line 3. 935 936 You can get around this in several ways depending on our situation. 937 First, if you don't want any of the characters in the string to be 938 special, you can escape them with C<quotemeta> before you use the string. 939 940 chomp( my $regex = <STDIN> ); 941 $regex = quotemeta( $regex ); 942 943 if( $string =~ m/$regex/ ) { ... } 944 945 You can also do this directly in the match operator using the C<\Q> 946 and C<\E> sequences. The C<\Q> tells Perl where to start escaping 947 special characters, and the C<\E> tells it where to stop (see L<perlop> 948 for more details). 949 950 chomp( my $regex = <STDIN> ); 951 952 if( $string =~ m/\Q$regex\E/ ) { ... } 953 954 Alternately, you can use C<qr//>, the regular expression quote operator (see 955 L<perlop> for more details). It quotes and perhaps compiles the pattern, 956 and you can apply regular expression flags to the pattern. 957 958 chomp( my $input = <STDIN> ); 959 960 my $regex = qr/$input/is; 961 962 $string =~ m/$regex/ # same as m/$input/is; 963 964 You might also want to trap any errors by wrapping an C<eval> block 965 around the whole thing. 966 967 chomp( my $input = <STDIN> ); 968 969 eval { 970 if( $string =~ m/\Q$input\E/ ) { ... } 971 }; 972 warn $@ if $@; 973 974 Or... 975 976 my $regex = eval { qr/$input/is }; 977 if( defined $regex ) { 978 $string =~ m/$regex/; 979 } 980 else { 981 warn $@; 982 } 983 984 =head1 REVISION 985 986 Revision: $Revision: 10126 $ 987 988 Date: $Date: 2007-10-27 21:29:20 +0200 (Sat, 27 Oct 2007) $ 989 990 See L<perlfaq> for source control details and availability. 991 992 =head1 AUTHOR AND COPYRIGHT 993 994 Copyright (c) 1997-2007 Tom Christiansen, Nathan Torkington, and 995 other authors as noted. All rights reserved. 996 997 This documentation is free; you can redistribute it and/or modify it 998 under the same terms as Perl itself. 999 1000 Irrespective of its distribution, all code examples in this file 1001 are hereby placed into the public domain. You are permitted and 1002 encouraged to use this code in your own programs for fun 1003 or for profit as you see fit. A simple comment in the code giving 1004 credit would be courteous but is not required.
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated: Tue Mar 17 22:47:18 2015 | Cross-referenced by PHPXref 0.7.1 |