Digital Media

Moshell - Spring 99

Lecture 6: String Manipulation in Perl; Dynamic HTML

This lecture concerns techniques for working with strings, which is one of Perl's main strengths. We'll learn how to match and transform almost anything into something else. Then we'll look at ways of producing Dynamic HTML - that is, HTML that changes every time you call for it. One of the best ways to do this is with the CGI.pm library.

Major Hint: www.perl.org. The definitive home-base for the Perl Movement.
Another Major Hint, for CGI.pm: http://stein.cshl.org/WWW/software/CGI/
And Yet Another Related One: http://www.wiley.com/compbooks/stein/source.html

The first part of this material comes rather directly from Chapter 11 of the Castro book. The second part is lifted from the on-line tutorial about CGI.pm that is provided by its author, Lincoln Stein.

String Matching with Regular Expressions

The Match Operator is our first power tool. It looks like =~ m/. There are two distinct parts. The =~ means "apply the following operator to the string in this variable. The entire expression is True or False, depending on what happens  Thus, if we assign the following value,

$workstring="Mohandovich";

then the following value

    ($workstring =~ m/Mike/)

would be False. When the m operator scanned through $workstring, it didn't find the four characters "Mike" anywhere. Note that the / can be replaced by any other character, such as * or |. Whatever comes right after m is recognized as the delimiter, and it goes looking for another one.

If you were looking for two / characters, though, you'd have to "escape" them with the backslash, like this:

    ($workstring =~ m/\/\//)

That is about as confusing as it can be, so we prefer to use some other delimiter in that case, like

    ($workstring =~ m*//*)

When you use the slash as the delimiter, in fact, you can omit the m. (I don't recommend omitting it, for clarity's sake.) But if you encounter something like ($workstring =~ /Search Stuff/) you'll know what it means.

The Substitution Operator. This operator will replace the first occurrance of its search string, with its replacement string. For instance

$workstring = "Jesus' mother's name was Mike.\n";
$workstring =~ s/Mike/Mary/;
print $workstring;

would display

    Jesus' mother's name was Mary.

Normally the s/ operator only swaps out the first instance of a matched string; to make it go ahead and do it over and over, use the g parameter:

$workstring = "The dog that bit my dog was the son of a dog.\n";
$workstring =~ s/dog/cat/g;
print $workstring;

would display
    The cat that bit my cat was the son of a cat.

Special Variables. Perl sets three special variables to contain parts of the match string. Two of these require the use of slanted single quotes, which I don't even have on my keyboard. (Oh, there's the left slanted one on my tilde key. But let's not fool with it.)

The most useful one is

$& - which reports the part that exactly matched your test string.

Now, you may say 'why would I want to know that? I just TOLD THEM my test string!' but you would be betraying your ignorance if you said that. Because most of the useful test strings are not just literal search targets, they are

Regular Expressions!

A Regular Expression (RE) is a precise and formal way of describing any class of strings that can be recognized by a finite automaton. That's cool if you know what an FA is; otherwise let's just learn about RE from scratch. The following symbols have special meanings. Explanations and examples follow.

{3,6} means at least 3 and at most 6 of the preceding element
{3} means exactly 3 of the preceding element
[] defines a class of characters.
() groups elements for control by a quantifier. Also it "snapshots" what it matches, into $1, $2, $3 etc.
* means zero or more of the preceding element
? means zero or one of the preceding element
+ means one or more of the preceding element

$ means "anchor the search pattern at the end of the string."

. matches any single character

(I made that period larger so you could see it. But that made the text bigger too. Oh, well....)

^ Has several meanings. In a [] class definition, it means to negate a test.
    It can also mean "anchor the search pattern at the beginning of the string.
- is used in a range definition in a class, like a-z.
| means "or", in a limited sense.
\  makes the next character "just be itself" literally, and not be interpreted.

Examples.

[a-z] matches any single lowercase alphabetic character.
[aeiou] matches any vowel.
[a-zA-Z] matches any alphabetic character.
[a-ev-z] matches the first five and the last five lowercase letters of the English alphabet.
[\$\@\\\]] matches the dollar sign, at sign, backslash and right square bracket. Ugly, huh?

if ($zip =~/[^0-9\-])    # pattern says "if anything in $zip is not a numberal 0-9 or a dash,
    {print "bogus Zipcode\n";}

/p.p/ matches pip and pap and pop and pope and p p and zipup and ppp, but NOT paap.

/ice (tea|coffee)/ matches 'ice tea' and also 'ice coffee'.
 

Special class abbreviations. These DON'T go in square brackets. They occur INSTEAD of square brackets. (Note that if you just wanted to match a d, you don't need to backslash it!)

\t matches tab characters.
\n matches newlines.
\r matches carriage returns.
\f matches formfeeds.

\d matches any digit, like [0-9]
\D matches any nondigit, like [^0-9]
\w matches any upper or lowercase letter, digit or underscore, like [a-zA-Z0-9\_] ("alphanumeric")
\W matches any character that doesn't match \w
\s matches any space, tab, newline, return or formfeed, like [\t\n\r\f]; ("whitespace characters")
\S matches anything that doesn't match \s.

Example: Phone Number. Here's a search for a telephone number, fully spelled out.

($maybephone =~ /\(\d\d\d\) \d\d\d-\d\d\d\d/)

This would match a phone number like (407) 823-5341. And now you can see why $& would be handy, because otherwise how would you find out WHAT phone number the code had dug up in the string $maybephone?

Anchoring the search. The following example is self-explanatory, more or less.

 if !($testring =~ /^\w/)    # ^ anchors at beginning of the string
    { print "Your string must begin with an alphabetic, numeric or underscore character.\n"}

if ($testring =~b /[aA]\$/)    #$ anchors at the end of the string
    { print "Ah, you did it right. Your string ended in the letter a or A.\n";}

The \b and \B tags can be used to trap a pattern at the beginning or end of a word (where a word is a string of alphanumeric characters set off by non-alphanumerics )

Full Steam Ahead Examples.

Here's a monster:

if ($phone =~ /^((\(\d{3}\))? *\d{3}-\d{4},? *)+$/)
{print "Wow, that was a hard one.\n";}

From left to right:

An anchored expression that MIGHT begin with a left paren, 3 digits, a right paren (if it's a long distance number.)

The "MIGHT" comes from the ?. Count the parentheses backwards from the ? to see what it controls.

.... and then contains zero or more spaces, a 3 digit number, a dash, a 4 digit number, perhaps a comma and
zero or more spaces ... and WHOA here's a plus. The plus means that we would like one or maybe more phone numbers as we just described.

Finally the $ means "anchor this at the right", or "there better be nothing after it."

Warnings. A frequent trap is to match partial stuff, like

$searchstring = "90451";

if ($searchstring =~/d{3}/)
    {print "Gotcha! There are 3 consecutive digits in this string, fer shure.\n";}

Some Practice Queries:

Query 6.1: Write regular expressions to recognize all these things:

six letter alphabetic words with a space on either side
at the start of a line, 8 letter words ending in 'ing'
text filenames starting with precisely 3 numbers

lines with 'either/or' or 'and/or' in them
a URL bounded by whitespace
a nine digit zipcode in the form 32751-1004 (but any digits, not just these!)
 

Now for the GOOD STUFF: CGI.pm

Preliminary Remark #1: The CGI.pm library is a perfect example of how the Internet community has created an alternative economy. This library, basic to almost everyone's Web use of Perl, was developed and given away freely by Lincoln Stein of the Cold Spring Harbor Laboratory in Maine. (Well,  the Perl language itself was freely given away by its author, Larry Wall.) The currency in which these folks were paid is fame and respect - which is, after all, what we're really after (along with love, Twinkies and a warm place to sleep.)

And once you get famous and respected, people tend to give you money. So it all works out.

For instance, Stein has written a book about the CGI.pm library. You can get that book from John Wiley & Sons. Not surprisingly, there is a link to it on some of Stein's free web pages. I plan to order it, so here comes the money, Mr. Stein. I told you it would come.....

Preliminary Remark #2: Stein not only developed CGI.pm, he also provides a 71 page explanation of how it works, online at  http://stein.cshl.org/WWW/software/CGI/ You may find this one more convenient to read on-line, unless you have a free printer somewhere, because it really is that long. You can cut the examples and paste them into your test programs, with the necessary changes I describe below.

Preliminary Remark #3: There are two ways of using CGI.pm - the functional style and the object oriented style. The functional style looks cleaner on paper and is the preferred mode for simple problems. The object oriented style allows you to handle different parts of your form, or even different forms within a single HTML document, and is what's "really" going on behind the scenes.

Here's a hello-world example in functional style, from Stein. Note that the syntax qw/:standard/; doesn't seem to work on our system. I used ":standard"; and it worked just fine.

#!/usr/local/bin/perl
#
# The functional get-started program example, from Stein:
#
#   use CGI qw/:standard/; # Non working version for some reason
    use CGI ":standard";    # This works.

     print header(),
           start_html(-title=>'Wow!'),
           h1('Wow!'),
           'Look Ma, no hands!',
           end_html();

# # # # # End of Functional example

We can see that the print statement simply invokes a series of functions beginning with header(). In the start_html function, there is a parameter. But instead of just doing

start_html('wow')

the system requires that you tell it WHICH parameter of start_html is being used. That is, start_html (and many other functions) actually have lots of parameters, most of which you don't need at a given time. So these others default to some nice value, like "" the empty string. You have to name the parameter you're talking to, and you do that with -title=> (for a parameter named 'title'.) Another advantage of this system is that you don't have to remember the order of a function's parameters because you can put 'em in any order, as long as you name them.

Here's the same example in object style:

#!/usr/local/bin/perl
#
# The object oriented get-started program example, from Stein.
# We omit the "standard" request because we aren't using the functional form here.
#
     use CGI;
     $q = new CGI;
     print $q->header(),
           $q->start_html(-title=>'Wow!'),
           $q->h1('Wow!'),
           'Look Ma, no hands!',
           $q->end_html();
# End of Object example

I hope that the similarities and differences with the previous example are obvious. Now, onward to something Really Useful.

#!/usr/local/bin/perl

#################################################################
# Lincoln Stein's first live example, modified and commented by Moshell - 1/25/99
#################################################################
#
# This script demonstrates how CGI.pm captures your data into the default object's
# fields, and replays them when the script cycles. This script uses the functional
# rather than the object oriented option. The functional mode is selected by
# the command
#
use CGI ":standard";
#
# Note that Stein's syntax in his on-line tutorial, which looks like
# - use CGI qw(:standard) - doesn't work on our system,
# And so I used the above quoted form  instead.
#
########################## How It Works ##########################
#
# The CGI library establishes a set of functions which can be called without the
# usual &functionname() syntax. These functions return string values, and so they
# normally are imbedded in a print statement. But you'll see other uses later.
#
# Before this code begins, CGI has already read the standard input stream; if it
# found anything in the input, it split out the (key=value) and parsed their
# contents into the default param object. If it encounters named fields (like 'yourname')
# as parameters of functions like textfield below, it checks to see if a matching initial value
# is stored in the param object, and plugs the value in as the textfield's initial value.
#
# You should walk through this CGI example, and compare it line-by-line with the
# resulting HTML code which is below. This will help you to figure out what the
# various functions like h1 and p and start_form are emitting.
#
print header;
print start_html('A Simple Example'),
    h1('A Simple Example'),
    start_form,
    "What's your name? ",textfield('yourname'),
    p,
    "What's the combination?",    # Note that the print statment has simple text, too.
    p,
    checkbox_group(-name=>'words',
                   -values=>['eenie','meenie','minie','moe'],
                   -defaults=>['eenie','minie']),
    p,
    "What's your favorite color? ",
    popup_menu(-name=>'color',
               -values=>['red','green','blue','chartreuse']),
    p,
    submit,
    end_form,
    hr; # Here's the sammy colon that ends the long print command.

# This is diagnostic code, to show you how to get at the contents of
# param.
#
if (param()) {    # Namely, if it's non-null
    print
        "Your name is",em(param('yourname')),
        p,
        "The keywords are: ",em(join(", ",param('words'))),
        p,
        "Your favorite color is ",em(param('color')),
        hr;
}
print end_html;
#
# End of CGI Example 1
#####################################################

This is a multi-pass script.

PASS 1 is when you invoke the script from a direct link or from your browser. param comes in empty, so only the first part is used. It emits the form.

PASS 2 is when it is invoked by the form which it produced in Pass 1. In this case it sees the param values and so prints them out for us all to see.

Here's the HTML which comes out on the second pass. (You DO know how to capture the source code for a Web page don't you?)

<HTML><HEAD><TITLE>A Simple Example</TITLE>
</HEAD><BODY><h1>A Simple Example</h1>
<FORM METHOD="POST"  ENCTYPE="application/x-www-form-urlencoded">
What's your name? <INPUT TYPE="text" NAME="name" VALUE="Whompus">
<p>
What's the combination?<p>
<INPUT TYPE="checkbox" NAME="words" VALUE="eenie" CHECKED>eenie
<INPUT TYPE="checkbox" NAME="words" VALUE="meenie">meenie
<INPUT TYPE="checkbox" NAME="words" VALUE="minie" CHECKED>minie
<INPUT TYPE="checkbox" NAME="words" VALUE="moe">moe
<p>
What's your favorite color?
<SELECT NAME="color">
<OPTION  VALUE="red">red
<OPTION  VALUE="green">green
<OPTION SELECTED VALUE="blue">blue
<OPTION  VALUE="chartreuse">chartreuse
</SELECT>
<p><INPUT TYPE="submit" NAME=".submit"></FORM>

<! The rest of this stuff is the diagnostic output which only occurs on the second pass.>
<! These comments were added by Moshell>

<hr>Your name is<em>Whompus</em>
<p>
The keywords are: <em>eenie, minie</em>
<p>
Your favorite color is <em>blue</em>
<hr>
</BODY></HTML>

Now, the how. This form has one strange feature: the FORM specification doesn't tell the browser where to go, to submit the form. There's only one possible source of that information, and that's the leftover address from the first submission (when we fetched the form in the first place.) So the browser reuses the address that's in its goto buffer.

Remark: You don't need to use CGI.pm for Lab 1, but you might as well start getting used to it, no? So it's up to you. By the time you get to Lab 2, this property of recycling a form will become very useful. (However, I wonder how it works with respect to  A-B cycling, which I've always solved with hidden variables. Let's find out! (Next lecture, folks.)

Query 6.2: Modify the above example, so that when you submit the query, if your name doesn't contain the string "Stein", then the system just puts up a screen that says "Sorry, you aren't Stein." This would be a "dead end" because you won't give it a submit button. But you can give it a message saying "use the BACK button to back up and try again.

But if you submit a string with the proper name in it, you get the entire output including a refreshed form in which the name Stein has been REPLACED by the name Stone, and a message saying "Immigration officers often changed German names to English ones."

Well ... that's enough for one lecture, methinks.

Back to previous lecture
Forward to next lecture
Back to the Index
Back to the Syllabus