If you have a list of url's which contains backlinks to your single site and you want to know the complete statistics of how many total backlinks exist, pages exist, total dofollow links, total nofollow links etc, then you can use this small perl script to do just that.
Input
A file containing single url on each line make sure that it contains no empty lines, partial url's etc. How this script takes care of empty lines which don't contain urls
Website url which you want to check for. All the url's in the input file are expected to contain backlink to your site mentioned in this.
To mention these two options define them here:
my $url_to_check="http://www.indiacustomercare.com"; my $file_containing_backlinks= "<m:/anurag/tmp/197B.txt";
Output
A sample shortened output looks like:
Out of 13792 links total unique=13752 and total duplicate=40 Printing STATS Total Existing web pages, but containing no link to your site => 5
Total existing Web Page URLs which you purchased(which may/may not contain backlinks to your site) => 9
Total found backlinks to your site(multiple backlinks to your site in the same web page are counted as 1) => 4
Total web pages urls which you had purchased and exist no more(may have disappeared later) => 4
PRINTING VERBOSE REPORT
-------------------------------------------------- Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site) -------------------------------------------------- 1) http://dinnerwareforsale.net/blue-floral 2) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense 3) http://blog.avlmoving.com/2010/07/packing-for-the-road 4) http://64kbyte.homeip.net/wordpress/?p=74 5) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 6) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports 7) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA 8) http://www.watchmenfansite.com/door-speakers 9) http://www.a-ressaca.com/?p=663
-------------------------------------------------- List of existing web pages which you purchased, but not containing ANY backlinks to your site -------------------------------------------------- 1) http://64kbyte.homeip.net/wordpress/?p=74 2) http://blog.avlmoving.com/2010/07/packing-for-the-road 3) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 4) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense 5) http://www.watchmenfansite.com/door-speakers
-------------------------------------------------- List of existing web pages which you purchased, containing atleast one backlink to your site(with nofollow/dofollow) -------------------------------------------------- 1) http://dinnerwareforsale.net/blue-floral 2) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports 3) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA 4) http://www.a-ressaca.com/?p=663
-------------------------------------------------- Report of backlinks to your site from an existing web page link which you purchased -------------------------------------------------- 1) http://64kbyte.homeip.net/wordpress/?p=74 valid links=0 nofollow=0 dofollow=0
2) http://blog.avlmoving.com/2010/07/packing-for-the-road valid links=0 nofollow=0 dofollow=0
3) http://dinnerwareforsale.net/blue-floral valid links=1 nofollow=1 dofollow=0
4) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports valid links=1 nofollow=1 dofollow=0
5) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA valid links=1 nofollow=1 dofollow=0
6) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 valid links=0 nofollow=0 dofollow=0
7) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense valid links=0 nofollow=0 dofollow=0
8) http://www.a-ressaca.com/?p=663 valid links=1 nofollow=1 dofollow=0
9) http://www.watchmenfansite.com/door-speakers valid links=0 nofollow=0 dofollow=0
-------------------------------------------------- Those web pages urls which you had purchased and exist no more(with error codes) -------------------------------------------------- 1) [500] -> http://www.luvrulz.com/perfume-3-4
2) [500] -> http://www.onthehookfishing.com/fishing-nets
3) [500] -> http://www.onthehookfishing.com/making-fishing-lures-spinners
4) [500] -> http://www.shopping-servant.com/index.php/2011/03/my-first-colour-shape-snap
How to Run
Save this script in myscript.pl and run it like: perl myscript.pl > outfile.txt
It shows current url sequence no/total urls and time left in minutes:
processing 1/13752 time left in min=0
processing 2/13752 time left in min=3437.5
processing 3/13752 time left in min=2367.88333333333
processing 4/13752 time left in min=1833.06666666667
The Perl Script
#!/usr/bin/perl -w #This code automatically picks and formats all address from the company url only and not from any files.
use strict; #@Author Anurag Gupta #@License GNU GPL License
use LWP::UserAgent;
use HTML::Element;
use HTML::TreeBuilder;
sub trim($) { my $string = shift; $string =~ s/^\s+//; $string =~ s/\s+$//; return $string; }
my %links; #map of each link to array(no of valid links, no of nofollow)
my $url_to_check="http://www.indiacustomercare.com"; my $file_containing_backlinks= "<m:/anurag/tmp/197B.txt";
$url_to_check=~s{/$}{}; #strip the last / $url_to_check=lc($url_to_check); #change to lower case
my @links; { my $FP; open $FP, $file_containing_backlinks or die "Failed to open the input links file"; #open $FP, "<f:/tmp/links.txt"; (@links)=(<$FP>); }
map { chomp; $_=trim($_); } @links;
@links=grep {m{/};} @links;
my @unique= keys %{{map {$_=>1} @links}}; print "Out of ",scalar(@links)," links total unique=",scalar(@unique)," and total duplicate=",$#links-$#unique,"\n"; @links = @unique;
my $i=0; my %logdata; my %longlogdata; #The data containing lot of information my @nonexistingpages; my $starttime=time;
foreach my $link (@links) { my $currtime=time; my $timeleft= (($currtime-$starttime)/($i+1)*(scalar(@links)-($i+1))); #=last if($i==1000); ++$i; print STDERR "processing $i/",scalar(@links)," time left in min=",$timeleft/60,"\n"; my $agent = LWP::UserAgent->new(env_proxy => 1,keep_alive => 1, timeout => 30, agent => "Mozilla/4.76 [en] (Win98; U)"); my $header = HTTP::Request->new(GET => $link);
my $request = HTTP::Request->new('GET', $link, $header);
my $response = $agent->request($request); if ($response->is_success){ my $content = $response->decoded_content; my $root = HTML::TreeBuilder->new_from_content($content); $root->warn(1); my @info=(0,0); traversehtml($root,\@info); $links{$link}=\@info; push @{$longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"}},$link; } else { #print $response->code."\n"; push @{$longlogdata{"Those web pages urls which you had purchased and exist no more(with error codes)"}},"[".$response->code."] -> $link\n"; ++$logdata{"Total web pages urls which you had purchased and exist no more(may have disappeared later)"}; } }
if( ! exists $longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"} ) { die "No existing pages were found to be accessible!"; } foreach my $link ( sort @{$longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"}}) { my @info = @{$links{$link}}; my $dofollow = $info[0]-$info[1]; push @{$longlogdata{"Report of backlinks to your site from an existing web page link which you purchased"}}, "$link valid links=$info[0] nofollow=$info[1] dofollow=$dofollow\n"; ++$logdata{"Total existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"}; if($info[0]>0) { ++$logdata{"Total found backlinks to your site(multiple backlinks to your site in the same web page are counted as 1)"}; push @{$longlogdata{"List of existing web pages which you purchased, containing atleast one backlink to your site(with nofollow/dofollow)"}},$link; } else { ++$logdata{"Total Existing web pages, but containing no link to your site"}; push @{$longlogdata{"List of existing web pages which you purchased, but not containing ANY backlinks to your site"}},$link; } if($dofollow>0) { ++$logdata{"Total dofollow links to your site(multiple dofollow backlinks to your site in the same web page are counted as 1)"}; push @{$longlogdata{"List of existing web pages which you purchased, containing DOFOLLOW links to your site"}},$link; } } print "Printing STATS\n"; map { print "$_ => $logdata{$_}\n\n" } sort keys %logdata; print "\n\nPRINTING VERBOSE REPORT\n"; foreach my $key (sort keys %longlogdata) { print "\n\n\n","-" x50,"\n"; print "$key\n"; print "-" x50,"\n"; my $i=0; foreach my $item (@{$longlogdata{$key}}) { ++$i; print "\t$i) $item\n"; } } sub traversehtml { my $node=$_[0]; my $info=$_[1]; if(ref(\$node) eq "SCALAR") { return; } elsif ( ($node->tag() eq "a")) { my $href=$node->attr('href'); if( $href and $href=~m{^$url_to_check}i) #defined { ++$info->[0]; my $rel=$node->attr("rel"); if($rel and (lc($rel) =~m{\bnofollow\b})) { ++$info->[1]; } } } my @h = $node->content_list(); foreach my $item (@h) { if(ref(\$item) ne "SCALAR") {traversehtml($item,$info); } #skip scalar items } }