Engineering Log -- ICS 613 -- Brian Jaress: WebSpider Review

Today's assignment is WebSpider Review. I'm reviewing Ben Karsin's spider.

Installation

The package installed fine, including all the ant files. Everything verified and ant jar worked.

The one thing that gave me a little trouble was the logging. Running the program with -logging at the end of the command line didn't log anything. I was pretty confused by that until I checked the source code. All the logging code was just commented out.

Aside from that minor snag, all was well.

Code Format Fixes

All in WebSpider.java:

Line	Rule	Rule Description
13	ICS-SE-Java-4	Do not include // comments at the top of the file.
40	EJS-27	Pluralize the names of collection references.
152	EJS-36	Use standard comments to hide code without removing it.

Test Cases

Black Box

The testing wasn't really being done from a black box perspective. I think there was just a decision to go with white box.

Because of that, there isn't much partitioning into equivalence classes. The JUnit tests cover a typical case (starting at http://www.hackystat.org) and the case of a bad URL for the starting page (http://). They also cover both of the command line arguments.

Some equivalence classes that could be added:

Pages that link to themselves.
Cycles of more than one page.
"Diamond" linking (A links to B and C, B and C both link to D).
Pages with links to missing pages.
Pages with bad URLs in their links.
Pages with no links.
Running out of pages before the maximum.

These would be tested mostly for the result rather than whether or not they throw an exception. The hackystat.org test probably already has several of these, but it's only being tested for exceptions.

White Box

Test coverage is much better from a white box perspective.

Emma reports code coverage of:

Emma Coverage summary
class:   100% (2/2)
method:  100% (7/7)
block:   95%  (346/364)
line:    94%  (68/72)

A couple of white box tests that could be added:

Relative links.
Duplicate links on the same page.

Some of the things listed under black box could also be seen as white box tests because there's specific code to handle them.

Breaking

Page with no links:

$ java -jar *jar -mostpopular 'http://www2.hawaii.edu/~jaress' 10
Exception in thread "main" java.lang.NullPointerException
       at edu.hawaii.webspider.WebSpider.main(WebSpider.java:133)

Lessons Learned

My spider actually has a very similar flaw:

$ java -jar *jar -mostpopular 'http://www2.hawaii.edu/~jaress' 10
Most Popular: http://www2.hawaii.edu/~jaress/ with null

(It should say "0" instead of "null".)

Probably the best way to fix that would be initialize the count for the start page to zero.

Seeing Ben's approach also helped me decide what I should have done differently. There were parts of my spider that I wasn't happy with, and I was thinking about how I should have done it. After seeing Ben's code I decided that the root problem was over-generalizing.

I had created a helper class that was designed as if I had no clue what it was actually going to be used for. For example, it would return objects defined in HttpUnit -- with all their error-prone methods -- in case the calling code wanted to do something wacky with it like analyze the text. It also forced the calling code to decide when it had seen enough pages, in case the decision needed to be based on some complicated logic.

That was a Bad Idea, and my helper should have:

Pulled the link data out of the HttpUnit object, put it into a simpler object, and returned that.
Kept track of the maximum number of pages.

Engineering Log -- ICS 613 -- Brian Jaress

Saturday, September 29, 2007

WebSpider Review