Strings #5

#0 | #1 | #2 | #3 | #4 | #5 | #5.1 | #5.2

Regular Expressions

While we are on the topic of strings ...

One of the most powerful features of Perl is its built-in engine for doing regular expressions. You may have heard of these if you have some programming experience. Sometimes known as 'filters' or 'pattern matching', a regular expression (aka 'regex') is a set of special characters ordered in such a way that it will match a string containing the specified characters in the order specified.

A typical programming task is to look for a string (some text characters) in a whole haystack full of strings. You may have used the 'eq' operator to determine whether 1 string is equal to another. Or the '==' operator when dealing with numbers. Regex are like them but on steroids.

Perhaps an example will give you a better idea.

First here are some fundamentals.

A simple example follows. We have a string containing only 3 digits. We need to create a regex that will EXACTLY match that condition.

The code to play with. Enter it into a Perl script to run from a shell window.

$text = "123";
print "--->$text<---\n";
if ($text =~ /^\d{3}$/) {
     print "match\n";
}
else {
     print "no match\n";
}

As mentioned, the regex begins with a front slash '/'. Immediately following that is a caret '^'. In this particular case it is a 'metacharacter' - it has a special meaning. It is a line anchor and stands for 'start-of-a-string'. Yes, every string has a 0-space character beginning and ending it.

Next we have a 'metacharacter' \d. Recall that a character preceeded by '\' means something else (not a 'd'). In this case it stands for a single digit (0,1,2,3,4,5,6,7,8,9).

Then we have the digit 3 surrounded by braces {3}, so we know that means something special as well. Preceeding that is our symbol for a single digit - surprisingly this means we are looking for exactly 3 digits! The braces indicate an interval quantifier.

Our regex then holds a dollar sign $, which is another line anchor. Can you guess what it stands for?

If you guessed 'end-of-a-string' you win the prize. These 2 metacharacters affect the 'element' immediately after '^' and before '$' them. '$' is also used in Perl to indicate variable interpolation but is rarely ambiguous.

Finally we end our regex with another forward slash /.

If you paste the above few lines into a Perl script and run it from a shell window, here is what you might see:

--->123<---
match

Changing the string will result in different outcomes. Adding a blank space at the beginning ...

$text = " 123";
print "--->$text<---\n";
if ($text =~ /^\d{3}$/) {
     print "match\n";
}
else {
     print "no match\n";
}

... should give you this result:

---> 123<---
no match

Try changing the $text string to something else and see what happens (say '471', '000', or '1b3'). This should give you a better understanding of what is happening.

A good piece of advice when working with regexes is to verbalize each element.

So in our case we wouldn't say 'We are looking for 3 digits in a string.'

A better way is 'Look for a digit at the beginning of a line, followed by exactly 2 more digits, followed by the end of line.'

This will help in creating a regex that is closer to what is needed.

Next we will look at capturing regexes!