![]() |
|
![]() |
||
![]() |
![]() |
![]() |
![]() |
![]() |
Scoring a Web Page with Perl and LWPby Brent MichalskiOct. 30, 1998 Not only am I a Perl fanatic, I am also a hockey fanatic. If I am not hacking Perl, chances are I am at the rink either playing hockey or watching my kids play hockey. I like to keep up with the NHL scores and there are many places to find scores on the Internet, but I wanted something really simple to get the scores for me, without the ads or frames. There are tons of modules available for Perl. If you want to do a specific task with Perl, chances are pretty good that you can find a module that does either what you want, or something pretty close. When it comes to tasks on the Web, there is a complete library of modules that can help you out. In the article this week, I'll show you how to use the LWP module to fetch a Web page. Once retrieved, we will be removing everything except the data we are after -- in this case, recent scores from games in the NHL. Diving inThe program requires that you have the LWP module installed. This module comes as part of the standard installation in Perl version 5.004 and above.The program is run from this form interface. I have numbered the lines of code, as always, the line numbers are not part of the program. You can also view the program without the line numbers. The line numbers make it easier for me to talk about the program. 1: #!/usr/bin/perl 2: ## Program Name: fetch.cgi 3: use CGI; 4: use CGI::Carp (fatalsToBrowser); 5: use LWP::Simple; 6: $q = new CGI; 7: print $q->header; 8: $year = $q->param(year); 9: $month = $q->param(month); 10: $day = $q->param(day); 11: $date = $year.$month.$day; 12: $URL = "http://tsn.sportingnews.com/nhl/scoreboard/$date/"; 13: unless (defined ($page = get($URL))) { 14: die "There was an error getting URL: $URL\n"; 15: } 16: @page = split(/\n/,$page); 17: &page_header; 18: $start=0; 19: foreach $line (@page){ 20: print "$line\n" if($start == 1); 21: if($line =~ /Past Scores/){ 22: $start = 1; 23: } 24: last if($line =~ /--\/Body--/); 25: } 26: &page_footer; 27: ## End of program. 28: sub page_header{ 29: print<<HTML; 30: <HTML><HEAD><TITLE>NHL Scores</TITLE></HEAD> 31: <BODY BGCOLOR="#FFFFFF"> 32: <CENTER><FONT SIZE=5 FACE=ARIAL> NHL Scores for $month/$day/$year</FONT> 33: </CENTER><P><HR WIDTH=85%><P><CENTER> 34: HTML 35: } 36: sub page_footer{ 37: print<<HTML; 38: </CENTER><P><HR WIDTH=85%><P> 39: </BODY></HTML> 40: HTML 41: } Line-by-line explanationLine 1: Tells the program where to find Perl on the Web server. This line will vary depending on where Perl is installed on your server so you need to make any necessary changes. On a UNIX server, this line is required. If you are running this program on an NT server, this line is not required but won't hurt anything if included.Line 2: A comment that tells us what the program name is. It is a good idea to place the name of the program at the top so when you print it, you can easily tell what program you are looking at. Line 3: Loads the CGI.pm module into the program. Line 4: Loads the Carp package. Carp is part of the standard CGI.pm distribution and it allows you to get more graceful error messages. By using Carp FatalsToBrowser, we get most of the error messages sent to the browser rather than getting the vague 500 Internal Server Error message. Using the Carp package is very valuable for debugging -- I recommend it. Line 5: Loads the LWP::Simple module. The Simple module is described as "A simplified interface to the mammoth libwww bundle." Line 6: Creates a new CGI object and calls it $q. Line 7: Prints the standard header for CGI scripts. The header tells the Web server what kind of data it is sending. This line is equivalent to the following line: print "Content-type: text/html\n\n";Lines 8-10: These lines get the data from the calling Web form and store it in some variables for us. Line 11: Creates a variable called $date using the values we obtained above. The dot (.) in Perl is known as the concatenate operator. It is used to join strings together. On this line, we used it to take the year, month, and day to make a new variable with the values from the other variables. This new variable is used below to fetch the scores from the date that we want. Line 12: We create a new variable called $URL. This is the actual URL of the Web document we want to get. Line 13: The get method in LWP returns undef if there is an error. So, we use this line to make sure an error did not occur. The unless operator means the same thing as saying if not. So we are saying "if the return value of the operation is not defined, enter this block of code." The defined in the code is checking to see if the value returned from the action is defined or not. Remember, LWP returns undef if there is an error retrieving the Web page. $page = get($URL) gets the Web page, if it can, and stores the HTML from the Web page in the variable $page. Line 14: Tells the program to die and display the message we specified. It only gets executed if there was a problem getting the Web page. Line 15: Closes the if code block. Line 16: Takes the variable $page and splits it wherever it encounters a newline \n character. Each item that was split is then stored in the variable called @page. Line 17: Calls the page_header subroutine, which prints out the HTML we want displayed. Line 18: Sets a variable called $start to 0. We use this variable to help us capture only the HTML that we are interested in. Line 19: Begins a foreach loop that goes through each element in the @page array. Each time through, we store the current value in a variable called $line. Line 20: Prints the current line if the value of $start is equal to 1. Line 21: Checks the value of $line to see if it contains Past Scores. I did this because when I went to the original site, looked at the HTML, and saw that I could discard everything up to this point in the document. Line 22: Sets $start to 1. When $start is equal to 1, the data gets printed. When it is not, 1, the data is skipped. Line 23: Closes the if block. Line 24: Tells Perl that this is the last time we are to execute the while loop if the value of $line contains --/Body--. When Perl encounters the last command, it immediately exits the loop. Line 25: Closes the while loop. Line 26: Calls the page_footer subroutine to tidy up the HTML output. Line 27: A comment telling us we have reached the end of the program. The rest of the code is contained in subroutines. Line 28: Begins the page_header subroutine. This subroutine prints out the HTML we want the page to begin with. Lines 29-34: A here document that prints out the text between line 29 and the closing tag on line 34. Line 35: Ends the subroutine. Line 36: Begins the page_footer subroutine. This subroutine prints out the HTML to finish up the page. Lines 37-40: A here document that prints out the text between line 37 and the closing tag on line 40. Line 35: Ends the subroutine. Wrapping it upI bet that using Perl to grab Web pages was a lot easier than you though it would be. Perl is very good at making easy tasks easy, and hard tasks possible.Fetching the information is the easy part, once you have the information you're after, you'll need to unleash your imagination and do something to it. Keep in mind that this is a very simple program that is really only geared for hockey lovers. You all love hockey, don't you? But it could easily be adapted to gather other sports scores, custom weather reports, news and information, etc. You are definitely not limited to sports-related information. So get those creative juices flowing and use your new knowledge to do something with this program! See you next week!
Source Code for Scoring
a Web Page with Perl and LWP
|
|
|
Web Techniques and Web Design and Development copyright © 1995-99 Miller Freeman, Inc. ALL RIGHTS RESERVED |