Sunday, September 30, 2012

Extracting urls from Google search result pages in html(Perl script)

If you want to extract the urls of the site you give in site: in Google search from the html code of the SERP ( google search results pages) then here is a simple Perl script to do just that:

use strict;
use feature "switch";

my $file ="f:/tmp/v1.htm";
my $url_starts_with="";  

my $content;
 local $/;
 open FP, "<$file" or die "Can't open $file for reading";
 $content = <FP>;
 close FP;

my %map;
 $content =~ m#($url_starts_with[^"]+)#gsi) 

map { print "$_\n"; } sort keys %map;