Perl logo

Perl Refreshments ⌘

Regular Expressions #5 - Pattern Quantifiers

Searching for a pattern can be quite complex as you've no doubt discovered. But language itself is complex. To accommodate that, the Perl pattern matching engine makes use of Quantifiers.

These tell the regex engine how many times a character must appear in order to be considered a match. Now that's getting pretty finicky.

You may have read that Perl was used heavily in the Human Genome Project completed in 2003. This is because DNA (the double helix) is made up of only 4 chemicals, signified by a letter: G, A, C, T. But there are tens of thousands of ways they may be combined.

Because of Perl's strength in dealing with large quantities of text, it was used for "slicing, dicing, twisting, wringing, smoothing, summarizing and otherwise mangling" all the text.

Meet the quantifiers:

MinimalMaximalAllowed Range
(m,n)(m,n)?Must occur at least m times but no more than n times
(m,)(m,)?Must occur at least m times
(count)(count)?Must occur exactly count times
**?0 or more times (same as (0,1)
++?1 or more times (same as (1,1)
???0 or 1 time (same as (0,1)

The term exactly in the above table refers to the repeat count, not the overall string.

$text =~ /\d(3)/ does not ask "is this string exactly 3 digits long?".

It does ask "is there any point in the string where 3 digits occur in a row?".

So strings like ABC 123, 101 Perl Street, 1-800-555-1212 would all return positive for a match.

The question mark ? after a quantifier changes the behavior of the quantifier from maximal to minimal.

Qualifiers are greedy by default, meaning they match as many characters as they can, being consistent with the search position in the string, and the regex it is trying to match.

Example: change 'That is' to 'That’s' ...

$text = "That is some text, isn't it?";
$text =~ s/.*is/That's/;
print "\n\$text: $text\n\n";

regex5.1

Hmmm. Not quite. Here's what is going on.

The *is is matching from the beginning of the string all the way to the end of the second is, not the first. Being a greedy thing.

To fix it:

$text = "That is some text, isn't it?";
$text =~ s/.*?is/That's/;
print "\n\$text: $text\n\n";

regex5.2

Another. Change 'those' to 'these' ...

$text = "Are't those the documents you want?";
$text =~ s/the(.*)/those/;
print "\n\$text: $text\n\n";

regex5.3

Fixed:

$text = "Aren't those the documents you want?";
$text =~ s/the(.*1)e/those/;
print "\n\$text: $text\n\n";

regex5.4

Regex 6 - Assertions / Positions


-30-