Generating ultimate movie wishlist with Perl and IMDB

Leonid Mamchenkov

20 years ago

One of the things that will go into history with the year 2005 is the number of bad movies produced by Hollywood. IMDB says:

Hollywood is mired in its biggest box-office slump in over 20 years.

With all those bad movies around, finding something worth the time and effort becomes increasing difficult. Luckily, there are these two things that can make our lives easier – IMDB and Perl.

Combining the two all sorts of interesting things can be achieved. Particularly, an ultimate movie wishlist can be generated.

If you are here just for the script, than here is the movie_wishlist.pl. If you want just the result, than here is wishlist.html. Otherwise read on for the explanations on how it works and how you can make it better.

I’ll start from the very beginning.

Writing the script

Trying to figure out where to get good movies, I asked Martin – DVD club owner – to bring some. Our tastes in movies differ a lot, so it was smart of him to ask for a specific list of films that I wanted to see. I already have a movie wishlist, but most of the stuff there was put becaus of some vague references that I don’t even remember. I needed something better.

IMDB has a number of lists with Top Movies. All I had to do was select those lists that I liked and import them into a single list that I would give to Martin. Which lists did I select? Not all of them, that’s for sure. I wasn’t interested in box office. I wasn’t interested in video. I wanted the ones with best ratings. There is Top 250 Movies of All-Time, which is a good start. Additionally, I chose a few lists by genre – Action, Comedy, and few others. Than I added a few lists by decades – 2000 and 1990.

Now, what do I have to do to combine them all nicely in one list? I decided to download the lists, so that I could inspect them more closely. Nothing fancy here – simple Ctrl+S to ‘Save As’ did the job.

After closer examination two things became obvious:

Dups. Some movies appeared in more than one list. That means I would have to somehow remove dups leaving a single copy in the list.
Parsing. The structure of HTML was complex. Separating the list from the rest of the page could be a problem.

Solving the first problem – removing dups – would be much easier if I had the simple lists to deal with, and not complex HTML pages. So I decided to start looking for a solution for the second problem – parsing.

As with any other task in Perl – TIMTOWTDI – There Is More Than One Way To Do It. Two possible ways to solve the problem were:

Write some regular expressions to clean-up HTML and get the list of films.
Check out CPAN for some helper module.

Since I am a lazy guy and not a big fan of writing HTML-parsing regular expressions, I chose the second way.

Let’s see what it has to offer. List of results for query “html” at http://search.cpan.org was all that I needed. The third match was HTML::LinkExtractor – exactly what I needed. You see, items in all those IMDB top movie lists look like this:

Position in the list
Rating
Title of the film with the link to the film page
Year the film was produced (if available)
Number of IMDB users who voted for film rating

If I could just extract the link to film page and film’s title, I would have all I need. Hence, HTML::LinkExtractor.

I started with the code

#!/usr/bin/perl -w
use strict;
use HTML::LinkExtractor;

Since I had all the HTML files saved on disk, I could just give them as the list of arguments to my script and it would than process all of them together or some of them or one of them. If no files specified, the script should just exit without doing anything. Maybe a small complain or something…

# Get filenames from the command line
my @files = @ARGV;
die 'No files given' unless (@files);

Now, I was about to get a number of movies and sort out dups. There are again a few ways to do it. Two that came to my mind are:

Have a script run two loops. The first loop would go through the all files one by one, extract film links and titles and save them in some array. The second loop would then go through the array removing dups.
Have a script run single loop which would go over all files one by one, extract film links and titles, check if those are already in the list, and if not, than add them.

Writing the script mostly for one time use and trying to keep it as simple as possible, I decided to go with two loops. That would be more processing for the machine, but simplier for me to write. If the script would survive for more than one day, I knew where to optimize it than.

So, I need an array for all my movie matches.

my @matches = ();

Now I want the script to go through files one by one. In order to avoid weird stuff like couphing up on line breaks and processing empty lines, I decided to read the whole file into a string variable. That is, read a file line by line, and than combine all lines into one really long line.

foreach my $file (@files) {
    my $data = '';
    open (FILE, ") {
        chomp($line);
        $data .= $line;
    }
    close(FILE);

Note that the above piece of code is a part of a big snipper and thus won’t work on it’s own.

Now that I have a file in a very long line, I would want HTML::LinkExtractor to get me all those links to film pages. Wait a second! Not all of them! All genre and decade lists have 50 top movies AND 10 bottom movies. Oh, no. I don’t need those. Please remove them.

With a short regular expression, I chop off the part with bottom movies from that long string of mine.

    $data =~ s/Bottom\s+Rated\s+.*$//g;

The rest of the really long sting can be parsed with HTML::LinkExtractor.

    my $lx = new HTML::LinkExtractor();
    $lx->parse(\$data);

HTML::LinkExtractor parses HTML code and extracts a list of links. Since my HTML pages aren’t very clean, they can, and as a matter of fact, do contain other links too. Links to other IMDB and third party pages, that I don’t need at the moment. How do I know right links from wrong ones? Easy. URLs to film pages look like this: http://www.imdb.com/title/tt0057012/ . http://www.imdb.com – is the address of IMDB, than /title/, than tt and than the numeric ID of the film. So if I would just look for URLs that have /title/tt I should be on the right track.

    foreach my $item (@{ $lx->links }) {
        if ($item->{'href'} && $item->{'href'} =~ m#title/tt#) {
            my $text = $item->{'_TEXT'};
            my $url  = $item->{'href'};

HTML::LinkExtractor is nice enough to provide me with both clean URL and an HTML link code that it was parsing. From that HTML link code I can get the link text, which in my case would be movie title. Here is how I do it:

           $text =~ s#(.*)?#$1#;

Let’s see what information I have so far:

URL – link to the film’s page at IMDB. I think I should save it, just in case I would want to get more information about the film later on.
Film title – that’s the most important bit in my whole script.
Name of file I found the above two in – if the list of movies will come out long, it would be nice to see where from all the stuff came. So I’ll save this one two. Especially because I used descriptive file names like action.html when I was saving list in the beginning of the process.

Because I decided to have to passes, I don’t have to deal with dups now and can just save this data into my global array.

            my %match = ();
            # Save URL, title, and the file we found it in
            $match{'url'} = $url;
            $match{'title'} = $text;
            $match{'file'} = $file;
            push @matches, \%match;
        }
    }
}

Good! I am half way through. I have extracted all the links with titles from all the files and saved this all into approrpiate data structure. Let’s remove dups now.

my %uniqs = ();

As I have mentioned before, some films can be seen in two or more lists. I have been saving the filename of each list with each match, so I can aggregate those into lists. For each item I will have a list of files in which I have seen this item.

foreach my $match (@matches) {
    if ($uniqs{ $match->{'url'} }) {
        $uniqs{ $match->{'url'} }{'files'} .= ', ' .  $match->{'file'};    }
    else {
        $uniqs{ $match->{'url'} }{'title'} = $match->{'title'};        
        $uniqs{ $match->{'url'} }{'files'} = $match->{'file'};
    }
}

So far so good. I have a list of uniq movies now. Film title, film URL, and the list of files (top lists) for each item. All I have to do now is generate my ultimate wishlist. Since I have URLs as part of my data, I think HTML would be the most appropriate format, as I can easily click on film title and instantly get more information about it from IMDB. In essense, my wishlist is a simplified combination of IMDB top lists.

I could, of course, write my own HTML, but as I said, I am too lazy and so I will use CGI module which can perfectly generate HTML for me.

use CGI;
my $q = new CGI;

With all these data available, I think I should put something nice at the top of my list. Everybody loves statistics. So I will add the total number of items in my list and the list of files that I got these films from.

my $wishlist_title = 'Ultimate movie wishlist';
# Put some header in HTML file
print $q->start_html($wishlist_title);
print $q->h1($wishlist_title);print $q->p('Total number of movies: ' . scalar(keys (%uniqs)));
print $q->p('Processed files: '. join(', ', @ARGV));

Now I will print my list of films. I should probably sort it alphabetically by title. Each title should be linked to appropriate IMDB page. List of files should be printed nearby each title. As the last cosmetics bit, I should remove the .html extensions from these file names.

foreach my $url (sort { $uniqs{$a}{'title'} cmp $uniqs{$b}{'title'} } keys %uniqs) {
    $uniqs{$url}{'files'} =~ s/\.html//ig;
    print $q->a({href=>$url},$uniqs{$url}{'title'}) . ' (' . $uniqs{$url}{'files'} . ')' . $q->br . "\n";
}

Done. All I have to do is tell CGI module to close that HTML deal and be off.

print $q->end_html();

It’s time to generate that the list. Here is the command line:

[me@here dir]$ movie_wishlist.perl *.html > wishlist.html

wishlist.html is exactly what I wanted. Sorted standalone list of films with high ratings from genres and decades that I am interested in. All titles link to appropriate IMDB pages for more information.

Room for improvement

If you like this idea and want to extend it, here are a few tips.

Optimization. If you are planning to run this script more than once, you should probably rewrite the main two loops into one loop that would do everything. The changes are minimal, but can save you some CPU time and memory.
Automation. I needed to run the script only once. That’s why I simply saved the top lists into HTML files by hand and worked with local copies. If you are planning to run this script more than once, you should probably add a tiny piece of code that would get fresh top lists from IMDB. Caching can be also implemented easily.
Cosmetics. Since the resulting list of movies can be rather large, it would be difficult to find the differences with the older results. Either some way of HTML highlight can be used, or a command line argument that would force the script to show just the differences.
Flexibility. Currently it is impossible to tell the script which genres or decades you are interested in. It goes through all HTML files in the directory and produces the result from all matches. A couple of command line arguments could make the script more attentive to your preferences and likings.
Customization. You can modify the script to provide more information about films in the results. Ratings, casts, directors, etc – all can be the part of the generated wishlist. IMDB::Film can help you with that.
Ultimate customization coolness. The resulting list of films probably includes quite a few movies that you have already seen. If you have any list of seen movies – like, for example, I can get from the list of movie reviews on this site – you can tailor the script to take that list into account and not include the films that you have already seen.

Writing the script

Room for improvement

Share: