{"id":9587,"date":"2005-10-19T05:06:26","date_gmt":"2005-10-19T02:06:26","guid":{"rendered":"https:\/\/mamchenkov.net\/wordpress\/?p=9587"},"modified":"2005-10-20T07:02:05","modified_gmt":"2005-10-20T04:02:05","slug":"generating-ultimate-movie-wishlist-with-perl-and-imdb","status":"publish","type":"post","link":"https:\/\/mamchenkov.net\/wordpress\/2005\/10\/19\/generating-ultimate-movie-wishlist-with-perl-and-imdb\/","title":{"rendered":"Generating ultimate movie wishlist with Perl and IMDB"},"content":{"rendered":"<!-- google_ad_section_start -->\n<p>One of the things that will go into history with the year 2005 is the number of bad movies produced by Hollywood.  <a href=\"http:\/\/www.imdb.com\/features\/15thanniversary\/2005\">IMDB says<\/a>:<\/p>\n<blockquote cite=\"http:\/\/www.imdb.com\/features\/15thanniversary\/2005\"><p>Hollywood is mired in its biggest box-office slump in over 20 years.<\/p><\/blockquote>\n<p>With all those bad movies around, finding something worth the time and effort becomes increasing difficult.  Luckily, there are these two things that can make our lives easier &#8211; <a href=\"http:\/\/www.imdb.com\">IMDB<\/a> and <a href=\"http:\/\/www.perl.org\/\">Perl<\/a>.<\/p>\n<p>Combining the two all sorts of interesting things can be achieved.  Particularly, an ultimate movie wishlist can be generated.<\/p>\n<p>If you are here just for the script, than here is the <a href='\/wordpress\/wp-content\/movie_wishlist.perl' title='movie_wishlist.pl'>movie_wishlist.pl<\/a>.  If you want just the result, than here is <a href='\/wordpress\/wp-content\/wishlist.html' title='wishlist.html'>wishlist.html<\/a>. Otherwise read on for the explanations on how it works and how you can make it better.<\/p>\n<p><!--more--><\/p>\n<p>I&#8217;ll start from the very beginning.<\/p>\n<h4>Writing the script<\/h4>\n<p>Trying to figure out where to get good movies, I asked Martin &#8211; DVD club owner &#8211; to bring some.  Our tastes in movies differ a lot, so it was smart of him to ask for a specific list of films that I wanted to see.  I already have a movie <a href=\"\/wordpress\/2004\/06\/28\/movies-to-see\/\">wishlist<\/a>, but most of the stuff there was put becaus of some vague references that I don&#8217;t even remember.  I needed something better.<\/p>\n<p>IMDB has a number of lists with <a href=\"http:\/\/www.imdb.com\/Top\/\">Top Movies<\/a>.  All I had to do was select those lists that I liked and import them into a single list that I would give to Martin.  Which lists did I select?  Not all of them, that&#8217;s for sure.  I wasn&#8217;t interested in box office.  I wasn&#8217;t interested in video.  I wanted the ones with best ratings.  There is <a href=\"http:\/\/www.imdb.com\/top_250_films\">Top 250 Movies of All-Time<\/a>, which is a good start.  Additionally, I chose a few lists by genre &#8211; <a href=\"http:\/\/www.imdb.com\/Charts\/Votes\/action\">Action<\/a>, <a href=\"http:\/\/www.imdb.com\/Charts\/Votes\/comedy\">Comedy<\/a>, and few others.  Than I added a few lists by decades &#8211; <a href=\"http:\/\/www.imdb.com\/chart\/2000s\">2000<\/a> and <a href=\"http:\/\/www.imdb.com\/Charts\/Votes\/1990\">1990<\/a>.<\/p>\n<p>Now, what do I have to do to combine them all nicely in one list?  I decided to download the lists, so that I could inspect them more closely.  Nothing fancy here &#8211; simple <code>Ctrl+S<\/code> to &#8216;Save As&#8217; did the job.<\/p>\n<p>After closer examination two things became obvious:<\/p>\n<ol>\n<li><strong>Dups.<\/strong>  Some movies appeared in more than one list.  That means I would have to somehow remove dups leaving a single copy in the list.<\/li>\n<li><strong>Parsing.<\/strong> The structure of HTML was complex.  Separating the list from the rest of the page could be a problem.<\/li>\n<\/ol>\n<p>Solving the first problem &#8211; removing dups &#8211; would be much easier if I had the simple lists to deal with, and not complex HTML pages.  So I decided to start looking for a solution for the second problem &#8211; parsing.<\/p>\n<p>As with any other task in Perl &#8211; TIMTOWTDI &#8211; There Is More Than One Way To Do It.  Two possible ways to solve the problem were:<\/p>\n<ol>\n<li>Write some regular expressions to clean-up HTML and get the list of films.<\/li>\n<li>Check out <a href=\"http:\/\/www.cpan.org\">CPAN<\/a> for some helper module.<\/li>\n<\/ol>\n<p>Since I am a lazy guy and not a big fan of writing HTML-parsing regular expressions, I chose the second way.<\/p>\n<p>Let&#8217;s see what it has to offer.  <a href=\"http:\/\/search.cpan.org\/search?query=html&#038;mode=all\">List of results<\/a> for query &#8220;html&#8221; at <a href=\"http:\/\/search.cpan.org\">http:\/\/search.cpan.org<\/a> was all that I needed.  The third match was <a href=\"http:\/\/search.cpan.org\/search?query=HTML%3A%3ALinkExtractor&#038;mode=module\">HTML::LinkExtractor<\/a> &#8211; exactly what I needed.  You see, items in all those IMDB top movie lists look like this:<\/p>\n<ul>\n<li>Position in the list<\/li>\n<li>Rating<\/li>\n<li>Title of the film with the link to the film page<\/li>\n<li>Year the film was produced (if available)<\/li>\n<li>Number of IMDB users who voted for film rating<\/li>\n<\/ul>\n<p>If I could just extract the link to film page and film&#8217;s title, I would have all I need.  Hence, HTML::LinkExtractor.<\/p>\n<p>I started with the code<\/p>\n<pre>\r\n#!\/usr\/bin\/perl -w\r\nuse strict;\r\nuse HTML::LinkExtractor;\r\n<\/pre>\n<p>Since I had all the HTML files saved on disk, I could just give them as the list of arguments to my script and it would than process all of them together or some of them or one of them.  If no files specified, the script should just exit without doing anything.  Maybe a small complain or something&#8230;<\/p>\n<pre>\r\n# Get filenames from the command line\r\nmy @files = @ARGV;\r\ndie 'No files given' unless (@files);\r\n<\/pre>\n<p>Now, I was about to get a number of movies and sort out dups.  There are again a few ways to do it.  Two that came to my mind are:<\/p>\n<ol>\n<li>Have a script run two loops.  The first loop would go through the all files one by one, extract film links and titles and save them in some array.  The second loop would then go through the array removing dups.<\/li>\n<li>Have a script run single loop which would go over all files one by one, extract film links and titles, check if those are already in the list, and if not, than add them.<\/li>\n<\/ol>\n<p>Writing the script mostly for one time use and trying to keep it as simple as possible, I decided to go with two loops. That would be more processing for the machine, but simplier for me to write.  If the script would survive for more than one day, I knew where to optimize it than.<\/p>\n<p>So, I need an array for all my movie matches.<\/p>\n<pre>\r\nmy @matches = ();\r\n<\/pre>\n<p>Now I want the script to go through files one by one.  In order to avoid weird stuff like couphing up on line breaks and processing empty lines, I decided to read the whole file into a string variable.  That is, read a file line by line, and than combine all lines into one really long line.<\/p>\n<pre>\r\nforeach my $file (@files) {\r\n    my $data = '';\r\n    open (FILE, \"<$file\") or die \"Couldn't read from $file : $!\\n\";\r\n    # Join on all lines into one\r\n    while (my $line = <FILE>) {\r\n        chomp($line);\r\n        $data .= $line;\r\n    }\r\n    close(FILE);\r\n<\/pre>\n<p>Note that the above piece of code is a part of a big snipper and thus won&#8217;t work on it&#8217;s own.<\/p>\n<p>Now that I have a file in a very long line, I would want HTML::LinkExtractor to get me all those links to film pages.  Wait a second!  Not all of them!  All genre and decade lists have 50 top movies AND 10 bottom movies.  Oh, no.  I don&#8217;t need those.  Please remove them.<\/p>\n<p>With a short regular expression, I chop off the part with bottom movies from that long string of mine.<\/p>\n<pre>\r\n    $data =~ s\/Bottom\\s+Rated\\s+.*$\/\/g;\r\n<\/pre>\n<p>The rest of the really long sting can be parsed with HTML::LinkExtractor.<\/p>\n<pre>\r\n    my $lx = new HTML::LinkExtractor();\r\n    $lx->parse(\\$data);\r\n<\/pre>\n<p>HTML::LinkExtractor parses HTML code and extracts a list of links.  Since my HTML pages aren&#8217;t very clean, they can, and as a matter of fact, do contain other links too.  Links to other IMDB and third party pages, that I don&#8217;t need at the moment.  How do I know right links from wrong ones?  Easy.  URLs to film pages look like this: <code>http:\/\/www.imdb.com\/title\/tt0057012\/<\/code>  .  <code>http:\/\/www.imdb.com<\/code> &#8211; is the address of IMDB, than <code>\/title\/<\/code>, than <code>tt<\/code> and than the numeric ID of the film.  So if I would just look for URLs that have <code>\/title\/tt<\/code> I should be on the right track.<\/p>\n<pre>\r\n    foreach my $item (@{ $lx->links }) {\r\n        if ($item->{'href'} && $item->{'href'} =~ m#title\/tt#) {\r\n            my $text = $item->{'_TEXT'};\r\n            my $url  = $item->{'href'};\r\n<\/pre>\n<p>HTML::LinkExtractor is nice enough to provide me with both clean URL and an HTML link code that it was parsing.  From that HTML link code I can get the link text, which in my case would be movie title.  Here is how I do it:<\/p>\n<pre>\r\n           $text =~ s#<a.*?>(.*)?<\/a>#$1#;\r\n<\/pre>\n<p>Let&#8217;s see what information I have so far:<\/p>\n<ul>\n<li><strong>URL<\/strong> &#8211; link to the film&#8217;s page at IMDB.  I think I should save it, just in case I would want to get more information about the film later on.<\/li>\n<li><strong>Film title<\/strong> &#8211; that&#8217;s the most important bit in my whole script.<\/li>\n<li><strong>Name of file I found the above two in<\/strong> &#8211; if the list of movies will come out long, it would be nice to see where from all the stuff came.  So I&#8217;ll save this one two.  Especially because I used descriptive file names like <code>action.html<\/code> when I was saving list in the beginning of the process.<\/li>\n<\/ul>\n<p>Because I decided to have to passes, I don&#8217;t have to deal with dups now and can just save this data into my global array.<\/p>\n<pre>\r\n            my %match = ();\r\n            # Save URL, title, and the file we found it in\r\n            $match{'url'} = $url;\r\n            $match{'title'} = $text;\r\n            $match{'file'} = $file;\r\n            push @matches, \\%match;\r\n        }\r\n    }\r\n}\r\n<\/pre>\n<p>Good!  I am half way through.  I have extracted all the links with titles from all the files and saved this all into approrpiate data structure.  Let&#8217;s remove dups now.<\/p>\n<pre>\r\nmy %uniqs = ();\r\n<\/pre>\n<p>As I have mentioned before, some films can be seen in two or more lists.  I have been saving the filename of each list with each match, so I can aggregate those into lists.  For each item I will have a list of files in which I have seen this item.<\/p>\n<pre>\r\nforeach my $match (@matches) {\r\n    if ($uniqs{ $match->{'url'} }) {\r\n        $uniqs{ $match->{'url'} }{'files'} .= ', ' .  $match->{'file'};    }\r\n    else {\r\n        $uniqs{ $match->{'url'} }{'title'} = $match->{'title'};        \r\n        $uniqs{ $match->{'url'} }{'files'} = $match->{'file'};\r\n    }\r\n} \r\n<\/pre>\n<p>So far so good.  I have a list of uniq movies now.  Film title, film URL, and the list of files (top lists) for each item.  All I have to do now is generate my ultimate wishlist.  Since I have URLs as part of my data, I think HTML would be the most appropriate format, as I can easily click  on film title and instantly get more information about it from IMDB.  In essense, my wishlist is a simplified combination of IMDB top lists.<\/p>\n<p>I could, of course, write my own HTML, but as I said, I am too lazy and so I will use CGI module which can perfectly generate HTML for me.<\/p>\n<pre>\r\nuse CGI;\r\nmy $q = new CGI;\r\n<\/pre>\n<p>With all these data available, I think I should put something nice at the top of my list.  Everybody loves statistics.  So I will add the total number of items in my list and the list of files that I got these films from.<\/p>\n<pre>\r\nmy $wishlist_title = 'Ultimate movie wishlist';\r\n# Put some header in HTML file\r\nprint $q->start_html($wishlist_title);\r\nprint $q->h1($wishlist_title);print $q->p('Total number of movies: ' . scalar(keys (%uniqs)));\r\nprint $q->p('Processed files: '. join(', ', @ARGV));\r\n<\/pre>\n<p>Now I will print my list of films.  I should probably sort it alphabetically by title.  Each title should be linked to appropriate IMDB page.  List of files should be printed nearby each title.  As the last cosmetics bit, I should remove the <code>.html<\/code> extensions from these file names.<\/p>\n<pre>\r\nforeach my $url (sort { $uniqs{$a}{'title'} cmp $uniqs{$b}{'title'} } keys %uniqs) {\r\n    $uniqs{$url}{'files'} =~ s\/\\.html\/\/ig;\r\n    print $q->a({href=>$url},$uniqs{$url}{'title'}) . ' (' . $uniqs{$url}{'files'} . ')' . $q->br . \"\\n\";\r\n}\r\n<\/pre>\n<p>Done.  All I have to do is tell CGI module to close that HTML deal and be off.<\/p>\n<pre>\r\nprint $q->end_html();\r\n<\/pre>\n<p>It&#8217;s time to generate that the list.  Here is the command line:<\/p>\n<pre>[me@here dir]$ movie_wishlist.perl *.html > wishlist.html<\/pre>\n<p><a href='\/wordpress\/wp-content\/wishlist.html' title='wishlist.html'>wishlist.html<\/a> is exactly what I wanted.  Sorted standalone list of films with high ratings from genres and decades that I am interested in.  All titles link to appropriate IMDB pages for more information.<\/p>\n<h4>Room for improvement<\/h4>\n<p>If you like this idea and want to extend it, here are a few tips.<\/p>\n<ul>\n<li><strong>Optimization.<\/strong>  If you are planning to run this script more than once, you should probably rewrite the main two loops into one loop that would do everything.  The changes are minimal, but can save you some CPU time and memory.<\/li>\n<li><strong>Automation.<\/strong> I needed to run the script only once.  That&#8217;s why I simply saved the top lists into HTML files by hand and worked with local copies.  If you are planning to run this script more than once, you should probably add a tiny piece of code that would get fresh top lists from IMDB.  Caching can be also implemented easily.<\/li>\n<li><strong>Cosmetics.<\/strong> Since the resulting list of movies can be rather large, it would be difficult to find the differences with the older results.  Either some way of HTML highlight can be used, or a command line argument that would force the script to show just the differences.<\/li>\n<li><strong>Flexibility.<\/strong> Currently it is impossible to tell the script which genres or decades you are interested in.  It goes through all HTML files in the directory and produces the result from all matches.  A couple of command line arguments could make the script more attentive to your preferences and likings.<\/li>\n<li><strong>Customization.<\/strong> You can modify the script to provide more information about films in the results.  Ratings, casts, directors, etc &#8211; all can be the part of the generated wishlist.  <a href=\"http:\/\/search.cpan.org\/search?query=imdb%3A%3Afilm&#038;mode=module\">IMDB::Film<\/a> can help you with that.<\/li>\n<li><strong>Ultimate customization coolness.<\/strong> The resulting list of films probably includes quite a few movies that you have already seen.  If you have any list of seen movies &#8211; like, for example, I can get from the list of movie reviews on this site &#8211; you can tailor the script to take that list into account and not include the films that you have already seen.<\/li>\n<\/ul>\n<!-- google_ad_section_end -->\n","protected":false},"excerpt":{"rendered":"<!-- google_ad_section_start -->\n<p>One of the things that will go into history with the year 2005 is the number of bad movies produced by Hollywood. IMDB says: Hollywood is mired in its biggest box-office slump in over 20 years. With all those bad movies around, finding something worth the time and effort becomes increasing difficult. Luckily, there are &hellip; <a href=\"https:\/\/mamchenkov.net\/wordpress\/2005\/10\/19\/generating-ultimate-movie-wishlist-with-perl-and-imdb\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Generating ultimate movie wishlist with Perl and IMDB<\/span><\/a><\/p>\n<!-- google_ad_section_end -->\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_links_to":"","_links_to_target":""},"categories":[1,24,18],"tags":[551,36,25],"keyring_services":[],"class_list":["post-9587","post","type-post","status-publish","format-standard","hentry","category-general","category-movies","category-programming","tag-imdb","tag-perl","tag-wishlist"],"aioseo_notices":[],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":7892,"url":"https:\/\/mamchenkov.net\/wordpress\/2004\/09\/11\/limassol-cinema-schedule\/","url_meta":{"origin":9587,"position":0},"title":"Limassol cinema schedule","author":"Leonid Mamchenkov","date":"September 11, 2004","format":false,"excerpt":"Not only Michael Stepanov has written and CPANeda Perl module to work with IMDB website comfortably, but he has also created a webpage with schedule of all Limassol cinemas based on it, which he updates weekly. Updates are usually done on Friday. All movies are linked to their respective pages\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9355,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/08\/19\/to-kill-a-mockingbird\/","url_meta":{"origin":9587,"position":1},"title":"To Kill a Mockingbird","author":"Leonid Mamchenkov","date":"August 19, 2005","format":false,"excerpt":"\"To Kill a Mockingbird\". This film was on my wishlist for as long as I had a wishlist. It seems that it was one of the most important movies of the last century as everyone and their brother are referring to it. Directed by: Robert Mulligan Genres: Drama Cast: Gregory\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9641,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/11\/03\/donnie-brasco\/","url_meta":{"origin":9587,"position":2},"title":"Donnie Brasco","author":"Leonid Mamchenkov","date":"November 3, 2005","format":false,"excerpt":"Finding movies from my wishlist is getting harder and harder and I get to watch lots of crap. Sometimes though I get lucky. Like today, when I finally watched \"Donnie Brasco\". Directed by: Mike Newell Genres: Crime, Drama, Thriller Cast: Al Pacino, Johnny Depp, Michael Madsen, Bruno Kirby, James Russo,\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9107,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/06\/18\/the-anderson-tapes\/","url_meta":{"origin":9587,"position":3},"title":"The Anderson Tapes","author":"Leonid Mamchenkov","date":"June 18, 2005","format":false,"excerpt":"I had enough time for a movie today so I watched \"The Anderson Tapes\". It sounded like a movie from my wishlist, but it wasn't. Directed by: Sidney Lumet Genres: Crime, Drama, Thriller Cast: Sean Connery, Dyan Cannon, Martin Balsam, Ralph Meeker, Alan King, Christopher Walken, Val Avery, Dick Anthony\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":8939,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/05\/07\/saved\/","url_meta":{"origin":9587,"position":4},"title":"Saved!","author":"Leonid Mamchenkov","date":"May 7, 2005","format":false,"excerpt":"I should watch at least one movie a month, no? I grabbed something from my wishlist today - \"Saved!\" and enjoyed it, although I had to hold a halfsleeping baby for the second part of the film. Directed by: Brian Dannelly Genres: Comedy, Drama Cast: Jena Malone, Mandy Moore, Macaulay\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":9305,"url":"https:\/\/mamchenkov.net\/wordpress\/2005\/08\/05\/ed-wood\/","url_meta":{"origin":9587,"position":5},"title":"Ed Wood","author":"Leonid Mamchenkov","date":"August 5, 2005","format":false,"excerpt":"\"Ed Wood\" has been the first movie on my wishlist for more than two years now. None of the local DVD rentals had it. I couldn't find it in the shops. And I haven't seen any trace of it online. That all was very surprising considering a high rating of\u2026","rel":"","context":"In &quot;All&quot;","block_context":{"text":"All","link":"https:\/\/mamchenkov.net\/wordpress\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_sharing_enabled":true,"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/9587","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/comments?post=9587"}],"version-history":[{"count":0,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/posts\/9587\/revisions"}],"wp:attachment":[{"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/media?parent=9587"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/categories?post=9587"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/tags?post=9587"},{"taxonomy":"keyring_services","embeddable":true,"href":"https:\/\/mamchenkov.net\/wordpress\/wp-json\/wp\/v2\/keyring_services?post=9587"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}