UMBC CMSC 331 Principles of Programming Languages

	UMBC CMSC 331 Spring 2011 Principles of Programming Languages
Home · About · Schedule · HW · Exams · Notes · Code · Examples · Resources ·

Homework 6
Using Perl to Mine the Web

out: April 27, due: May 11

This assignment has two parts. Please prepare two perl scripts, which you can call hw6a.pl and hw6b.pl.

(a) Given a well-formed URL, e.g. http://www.umbc.edu, read the contents of that web page and print a list of the sites that web page is linked to. This may involve using Perl modules to fetch the URL's contents, and parse the resulting HTML. Your program will take a well-formed URL as input, and the output will be a list of URLs that appear on that web page.

(b) Duplication of web documents occurs when the contents of two different web sites are identical or close to identical. This could be the result of mirroring, for example, but it could also be copying for less noble purposes. There are several ways to compare strings for similarity, one of which being the cosine function used in most IR systems. Another somewhat more robust approach is based on compression.

We can use any of several algorithms to compress a string. It turns out that Compress::Zlib is installed on GL, and that works fine. Whatever algoerithm we use for compression, we use the notation c(x) to refer to the compressed form of the string x. In general, length(c(x)) < length(x) unless x is already compressed. Suppose we have two strings x and y, and consider their concatenation x.y in Perl. If x and y are similar, then the redundancy will cause length(c(x.y)) to be less than would be the case if x and y were different. This observation leads to something called compression based similarity. (See Li et al., "The Similarity Metric", IEEE Transactions on Information Theory, December 2004.pdf) Li gives a formula for this similarity

NCD(x,y) = (C(x.y) - min{C(x),C(y)}) / max{C(x),C(y)}

where x and y are strings, x.y is the concatenation of x and y, and C(x) refers to the length of the compressed string x. NCD is always positive, but values close to zero indicate similar strings. For example, in my prototype code, using Zlib, I got these results for different Google pages:

NCD of http://www.google.com and http://www.google.com is 0.04281
NCD of http://www.google.com and http://www.google.fr is 0.186365
NCD of http://www.google.com and http://www.google.fi is 0.238186

(These results suggest that French is closer to English than Finnish is.) Your program will take two URLs as input, and the output will be the numerical similarity score. Your results may not be identical to what is shown since the pages themselves change from time to time.

What to hand in

Submit two files containing perl code using the submit command on the gl system with course cs331 and project hw6a and hw6b.