Sorting #1

#0 | #1 | #2

We've seen some examples of how to sort an array by one element, but of course, real life has a habit of throwing stuff at us we don't expect or are unprepared for.

With the basics of sorting covered already, we now cover the situation where we need to sort an array (list) by more than 1 column or element.

For an excellent in-depth look at sorting in general, and the Perl sort in particular, visit this page by Uri Guttman and Larry Rosler. There are other excellent papers here as well.

Sorting data involves comparing items in a list with other items in the list - several times. In previous pages we used sort to organize 2 arrays, then print both lists in different ways.

In those cases, we only sorted 1 item per list (cities or countries). And since each list contained unique items, we didn't have any problems.

However, many lists contain some items of identical value. How do you sort a list on 1 column, but if duplicates are found, sort on a different column?

A typical situation might be having to sort an array of branch offices. The data includes city, province, year it started, number of employees.

Some provinces have more than 1 branch; several branches started in the same year; others have the same number of employees.

We want to sort the array first by province, then city, then start date, and finally by number of staff. Oh, and while we're at capitalize the cities and province abbreviations.

Our array might look like this:

my @stats = (
"ottawa:on:1987:43",
"vancouver:bc:1990:33",
"brandon:mb:1990:33",
"st. john's:nl:1999:29",
"victoria:bc:1992:35",
"london:on:2000:44",
"calgary:al:2000:45",
"montreal:qc:2000:38",
"toronto:on:1999:50",
"london:on:2000:39",
"st. john:nb:2002:12",
"saskatoon:sk:2012:28",
"whitehorse:yt:2010:15",
"iqaluit:nu:2015:3",
"charlottetown:pe:2002:13",
"halifax:ns:2001:13",
"yellowknife:nt:2005:23",
);

First we build arrays for each of our 'columns'.

There is an odd situation with this data that requires some special work. Two of the cities have more than 1 word - "st. john" and "st. john's". Why is that a concern?

We know how to use ucfirst to make the first letter of a string upper-case. But in this case $city contains 2 words. Using ucfirst on the string only works with the first word: "st.".

We have to ensure $city has upper-case letters for ALL words in it.

my ($city,$prov,$started,$staff);
my (@cities,@prov,@started,@staff);
foreach(@stats) {
    ($city,$prov,$started,$staff) = split ":";
    $city = join ' ', map { ucfirst } split / /, $city;
    push @cities,$city;
    push @prov, uc($prov);
    push @started, $started;
    push @staff, $staff;
}

The 2nd line in the foreach loop does this magick:

$city = join ' ', map { ucfirst } split / /, $city;

Reading from right to left:

  1. split $city into separate strings terminated by a space
  2. capitalize each element (implicit $_) using map and ucfirst
  3. join the strings together with another space
  4. assign $city back to itself

Observant readers will see we've introduced a new Perl term as well: push

This is how we can build an array while going through a loop - we push the current item onto the end of the array.

After our arrays are properly populated, we need to perform our various sorting routines.

For this we use the || operator to string our comparisons together. This operator executes a logical OR. It is also known as a 'short-circuit operator' because if the previous condition returns TRUE, the current condition is not checked.


my @Prov = sort {
    $prov[$a] cmp $prov[$b] 
    || $cities[$a] cmp $cities[$b] 
    || $started[$a] <=> $started[$b] 
    || $staff[$a] <=> $staff[$b] } (0 .. $#stats);

print "Sorted by province, city, start date, staff:\n";
print "province\t city\t start\t staff\n";
foreach (@Prov) {
    print "$prov[$_]\t $cities[$_]\t $started[$_]\t $staff[$_]\n";
}

As you see, first we sort by province, using cmp since we are comparing strings.

If both provinces are the same, we fall through to compare cities. If they are the same we next compare the start date. Note we use <=> here because we are comparing numbers. Finally if the branch start dates are equal, we compare number of staff, again using <=>.

The whole 'sort' routine is a loop indexed from 0 to the last index in the array @Prov.

Notice that we capitalized the provinces while building the arrays earlier.

Here we sort the same data by start date, then number of staff, then city:

my @Starts = sort { 
            $started[$a] <=> $started[$b] 
            || $staff[$a] <=> $staff[$b] 
            || $cities[$a] cmp $cities[$b]  } (0 .. $#stats);
print "Sorted by start date, staff, city:\n";
print "start\t staff\t city\n";
foreach (@Starts) {
    print "$started[$_]\t $staff[$_]\t $cities[$_],$prov[$_]\n";
}

One more example - sorted by staff number, then city & province, and year.

print "Sorted by staff, city, province, start date:\n";
print "staff\t city\t province\t start\n";
@Prov = sort {
    $staff[$a] <=> $staff[$b]
    || $cities[$a] cmp $cities[$b] 
    || $prov[$a] cmp $prov[$b] 
    || $started[$a] <=> $started[$b] } (0 .. $#stats);

foreach (@Prov) {
    print "$staff[$_]\t $cities[$_], $prov[$_]\t $started[$_]\n";
}

Note that 4 cities (Charlottetown & Halifax; Brandon & Vancouver) have the same number of staff each, so they are then sorted by city.

To get a solid understanding of how this works you might want to try sorting different combinations of columns.

Just when you think you have a handle on sorting, we will show you some Perl magick that should impress you.