UMBC CMSC 331 Spring 2011
Principles of Programming Languages
Home · About · Schedule · HW · Exams · Notes · Code · Examples · Resources ·


Homework 5
Some Fun With Perl, Forsooth

out: April 13, due: April 27

Perl is useful for many things, including file manipulation and string processing.

William Shakespeare is credited with writing almost forty plays, plus numerous sonnets and other poems. Questions about whether he wrote all the works credited to him have been debated for at least two hundred years. One way to assess authorship of the Shakepeare corpus, or indeed any collection of text, is to look at statistical properties of vocabulary.

The Shakespeare corpus is available here as a tarfile. It is also available on GL in the directory
/afs/umbc.edu/users/n/i/nicholas/pub/cs331spring2011/shakespeare

This assignment has four parts. Please prepare a perl script, which you can call hw5.pl for example, that does the following:

1. The first line of each file in the shakespeare directory contains the title of the play. For example, the first line of hamlet.txt is "The TRAGEDY OF HAMLET, PRINCE OF DENMARK" Read each file in the directory, and print the title of the play.

2. The word "forsooth" is no longer widely used in English, although it has been in the language for more than a thousand years. Have your Perl program answer the question, how many times does this word occur in the Shakespeare corpus? (Hint: the answer is more than forty, but less than fifty.) Don't just have your program invoke fgrep and wc with the appropriate parameters and print the output.

3. Using a hash table, determine how many times each word in the corpus occurs. Discard punctuation, and regard upper case and lower case versions of a word as the same word, e.g. "Fie!" is the same as "fie". Which words, if any, occur more than 5000 times? For questions 2 and 3, assume that the hyphen character '-' is not punctuation.

4. Finally, what are the ten most common words in the Shakespeare corpus? Print each such word, and the number of times that word occurs, in descending order by frequency.

What to hand in

Submit a single file containing perl code using the submit command on the gl system with course cs331 and project hw5.