blank.gif
webreview.com - Cross-Training for Web Teams
Search for: 
Jump to:
blank.gif
blank.gif

 
 

A Songline PACE Production



Searching Newsgroups with Perl

by Brent Michalski
Nov. 20, 1998
 
 

In my last column, I showed you how we could grab information from a Web page and redisplay it. This was a brute force method for accessing data from a Web server. It works fine for specific types of applications, but becomes less effective for gathering data from dynamic sites, like search engines, or for gathering more than just a single, pre-determined type of data. This week I'll use another Perl module to access search engines progammatically, so rather than just pull data off of a Web page, we'll move to a lower level and work directly with the search engine's CGI scripts.

There are many times when I need to find a solution to a programming problem, and chances are that I'm not the first person who has had the problem. If others have had this problem, they may have posted a question or, hopefully an answer, to a relevant Internet newsgroup.
 
View the demo


View the demo of this week's program.

As you may know, Deja News archives all newsgroup messages and also provides a good search interface to that archive. Using Deja News you can quickly search through newsgroups to find specific messages. But, I am too lazy (laziness being the number 1 characteristic of a Perl programmer) to go to Deja News and use the Deja News interface to find the answer. The solution? Write a Perl program to do the searching for me, of course!

Searching Deja News 

The WWW::Search module that we'll use for this article has interfaces for most of the major search engines. Currently it supports AltaVista (both Web and newsgroups), Deja News, Excite (Web only), HotBot (Web only), Infoseek (email, Web, and newsgroups), Lycos, Magellan, PLweb, SFgate, Simple (retrieve links on a page), Verity and Gopher. For the purposes of this article, I will only use the Deja News interface and I'll search only the comp.lang.perl.misc newsgroup.

Note: To use this module, you or your system administrator must first install the WWW:Search Perl module on your server. If you are unsure whether it is installed or not, check with your system administrator. Installation instructions are included in the module. For Windows users, Jim Smyser was nice enough to port it for your use! The Windows version is at: http://pubinfo.phx.primenet.com/www.search/.

I have limited the number of matches the program returns to 50. I did this to help conserve bandwidth since this is really just an educational tool. I do, however, plan on refining this script to provide a simple interface for all Perl programmers to use. The power of this simple tool is awesome when you consider how many solutions to programming problems are archived at Deja News!

Diving in 

Grabbing the results from Deja News is very simple, thanks to the module. Once we have the data we are looking for, we'll display it in a visually pleasing manner to make it easy for the users to read.

The program is run from this form interface. I have numbered the lines of code, as always, the line numbers are not part of the program. You can also view the program without the line numbers. The line numbers make it easier for me to talk about the program.

 1: #!/usr/bin/perl
 2: use CGI qw/:standard/;
 3: use CGI::Carp (fatalsToBrowser);
 4: print header;
 5: use WWW::Search;
 6: use WWW::SearchResult;
 7: $query = param(search);
 8: &print_heading;
 9: &perform_search;
10: print "</TABLE><P><HR WIDTH=70%></BODY></HTML>";
11: exit;
12: sub perform_search{
13:   my($search) = new WWW::Search('Dejanews');
14:   $max = $search->maximum_to_retrieve(50);
15:   $search->native_query(
        WWW::Search::escape_query($query),
16:     {groups => 'comp.lang.perl.misc'});
17:   @results = $search->results();
18:   $matches = @results;
19:   if($matches == 50){
20:     $message = "<B><I>There were more than
        50 matches.<BR>
21:     You may want to refine your search
        and try again.</I></B>";
22:   } else {
23:     $message = "There were <B>$matches</B> results.";
24:   }
25:   print "<TR><TD COLSPAN=2><CENTER>
        <FONT SIZE=2 FACE=ARIAL>";
26:   print "$message<HR><P></FONT></CENTER></TD></TR>";
27:   foreach $result (@results){
28:     $url   = $result->url;
29:     $title = $result->title;
30:     $desc  = $result->description;
31:     $date  = $result->change_date;
32:     ($extra,$author) = split(/Author:/, $desc);
33:     print<<STOP;
34:      <TR>
35:       <TD><FONT SIZE=2 FACE=ARIAL>
36:        <B>Title :</B>
37:       </FONT></TD>
38:       <TD><FONT SIZE=2 FACE=ARIAL>
39:        <A HREF="$url">$title</A>
40:       </FONT></TD>
41:      </TR>
42:      <TR>
43:       <TD><FONT SIZE=2 FACE=ARIAL>
44:        <B>Author:</B>
45:       </FONT></TD>
46:       <TD><FONT SIZE=2 FACE=ARIAL>
47:        $author
48:       </FONT></TD>
49:      </TR>
50:      <TR>
51:       <TD><FONT SIZE=2 FACE=ARIAL>
52:        <B>Date:</B>
53:       </FONT></TD>
54:       <TD><FONT SIZE=2 FACE=ARIAL>
55:        $date
56:       </FONT></TD>
57:      </TR>
58:      <TR>
59:       <TD COLSPAN=2>
60:        <HR WIDTH=60%>
61:       </TD>
62:      </TR>
63: STOP
64:   } # End of foreach
65: } # End of perform_search
66: sub print_heading{
67:  print<<STOP;
68:   <HTML><HEAD><TITLE>Search Results</TITLE></HEAD>
69:    <BODY BGCOLOR="#FFFFFF">
70:     <FONT SIZE=4 FACE=ARIAL><CENTER>
71:      Search Results<BR>
72:     </FONT>
73:     <FONT SIZE=2 FACE=ARIAL>
74:      You Searched for: <FONT COLOR="#0000FF">$query
        </FONT>
75:     </FONT></CENTER>
76:    <HR WIDTH=70%>
77:    <P>
78:    <CENTER>
79:     <TABLE BORDER=0>
80: STOP
81: } # End of print_heading

Line-by-line explanation 

Line 1: Tells the program where to find Perl on the Web server. This line will vary depending on where Perl is installed on your server so you need to make any necessary changes. On a UNIX server, this line is required. If you are running this program on an NT server, this line is not required but won't hurt anything if included. 

Line 2: Loads the CGI.pm module into the program and imports the standard function definitions. By using the standard definitions, you don't have to use the $cgi->function conventions.

Line 3: Loads the Carp package. Carp is part of the standard CGI.pm distribution and it allows you to get more graceful error messages. By using Carp fatalsToBrowser, we get most of the error messages on the browser rather than getting the uninformative 500 Internal Server Error. Using the Carp package is a very valuable debugging tool, I recommend using it.

Line 4: Prints the standard header for CGI scripts. The header tells the Web server what kind of data it is sending. This line is equivalent to the following line:

print "Content-type: text/html\n\n";

Lines 5-6: Tell Perl that we are going to use the WWW::Search and WWW::SearchResult modules. The WWW::SearchResult module provides some extra functions that help us manage the results of the searches more easily.

Line 7: Creates a variable called $query and fills it with the value passed from the search variable on the calling Web page.

Line 8: Calls the print_heading subroutine. This subroutine takes care of setting up the initial HTML for the results page.

Line 9: Calls the perform_search subroutine. You guessed it! This subroutine does the searching and prints the results.

Line 10: Finishes off the HTML code for the results page.

Line 11: Exits the program, we are done at this point, since all of the work is done in the subroutines. The exit; is not required in Perl programs. Perl knows when the program is finished. I like to add it because I feel it offers a little extra clarity when looking at the code.

Line 12: Begins the perform_search subroutine.

Line 13: Creates a new instance of WWW:Search (using the 'DejaNews' interface) and stores the reference in a variable called $search.

Line 14: Sets the maximum number of search results to 50. The module returns the results in the same order as you would see them if you did the search via the search engine's interface. Any results beyond the maximum are disregarded.

Lines 15-16: This performs the search. These two lines are actually one line that I wrapped for clarity. 

native_query is the WWW:Search function that stores the query.

WWW::Search::escape_query transforms any special characters into escaped characters so we have a proper URL to use for the search.

The groups=> option lets us specify the newgroup(s) we want to search. The default is all news groups. We are only searching the comp.lang.perl.misc newsgroup.

Line 17: Calls the results function from the module and stores the results in an array called @results.

Line 18: Stores the number of matches in the variable $matches.

Line 19: An if..else block. This is provided as a way to let the users know if they had "too many" matches. This line begins the block, we enter the if portion if the searches equal 50. I didn't use > 50 because, since we limited the searches in line 14, we will never get more than 50 results.

Lines 20-21: Set the $message variable to something that lets the user know that they had the max number of results.

Line 22: the else condition.

Line 23: Sets the $message variable to something that lets the user know how many results were found.

Line 24: Closes the if..else block.

Lines 25-26: Print the user message at the top of the table.

Line 27: Begins a foreach loop that traverses through each item in the @results array and stores the value in $result. The value we are storing in $result is a reference to current search result.

Lines 28-31: Store the search results in some easy to remember variable names. 

Line 32: The description for this particular search engine contained some information we didn't need so I split off just the author information.

Lines 33-63: A "here document" that displays the current search result information in the Web page.

Line 64: Ends the foreach loop.

Line 65: Ends the perform_search subroutine.

Line 66: Begins the print_heading subroutine.

Lines 67-80: This whole subroutine is a here document which prints out the beginning HTML for the search results page.

Line 81: Ends the print_heading subroutine.

Wrapping it up 

Wasn't that easy? Modules can sure make a Perl programmer's life easier. Imagine how much work it would take to do this same thing if we didn't have the module! This example is just a taste of the power available in the WWW:Search module, now you are ready to make your own customized search interfaces using WWW:Search.


Source Code for Searching Newsgroups with Perl
View and download the source code for this week's script.
Next: My Favorite Perl Functions

Web Review copyright © 1995-99 Songline Studios, Inc.
Web Techniques and Web Design and Development copyright © 1995-99 Miller Freeman, Inc.
ALL RIGHTS RESERVED