Tuesday, August 2, 2011

Copy Directory Structure using Perl utility

Do you have a directory and want to create a replica of the directory structure but without it's file contents? Then this is the utility that let you do it.

#!/usr/bin/perl -w
use strict;

use File::Find;
use File::Path;

my %files;

#CORRECT USAGE
#my $ROOT_DIR="F:/NOT TO BE BACKED UP";
#my $RELATIVE_PATH_UNDER_ROOT_DIR="INSTALL"; #That is I want the directory structure of $ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR created new at $NEW_ROOT/$RELATIVE_PATH_UNDER_ROOT_DIR
#my $NEW_ROOT="F:/Anurag";

my $ROOT_DIR="F:/STATIC/Anurag";
my $RELATIVE_PATH_UNDER_ROOT_DIR="personal"; #That is I want the directory structure of $ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR created new at $NEW_ROOT/$RELATIVE_PATH_UNDER_ROOT_DIR
my $NEW_ROOT="F:/Anurag";

sub mySub
{

my $dir=$File::Find::dir;

$dir=~s{^$ROOT_DIR}{$NEW_ROOT};

$files{$dir}++};

sub loadFiles {

find( \&mySub, "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR"); #custom subroutine find, parse $dir

}

die "$NEW_ROOT is not directory" if( ! -d $NEW_ROOT);

die "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR is not directory" if ( ! -d "$ROOT_DIR/$RELATIVE_PATH_UNDER_ROOT_DIR" );

loadFiles();

my $dirs = keys %files;
my $created=0;

foreach my $key (keys %files)
{
if( ! -d $key)
{

if( scalar( mkpath($key)) <= 0) #Add error checking here
{
print STDERR "Failed to create directory: $key!\n";
}
else
{
++$created;
}
}

}

#map { print "$_\n"; } sort keys %files;

print "Total dirs created=$created out of $dirs","\n";

Saturday, July 16, 2011

Backlink Checker - Check Your Existing Backlinks with Perl Script

If you have a list of url's which contains backlinks to your single site and you want to know the complete statistics of how many total backlinks exist, pages exist, total dofollow links, total nofollow links etc, then you can use this small perl script to do just that.

Input

A file containing single url on each line make sure that it contains no empty lines, partial url's etc. How this script takes care of empty lines which don't contain urls

Website url which you want to check for. All the url's in the input file are expected to contain backlink to your site mentioned in this.

To mention these two options define them here:

my $url_to_check="http://www.indiacustomercare.com";
my $file_containing_backlinks= "<m:/anurag/tmp/197B.txt";

Output

A sample shortened output looks like:

Out of 13792 links total unique=13752 and total duplicate=40 Printing STATS Total Existing web pages, but containing no link to your site => 5

Total existing Web Page URLs which you purchased(which may/may not contain backlinks to your site) => 9

Total found backlinks to your site(multiple backlinks to your site in the same web page are counted as 1) => 4

Total web pages urls which you had purchased and exist no more(may have disappeared later) => 4

PRINTING VERBOSE REPORT

-------------------------------------------------- Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site) -------------------------------------------------- 1) http://dinnerwareforsale.net/blue-floral 2) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense 3) http://blog.avlmoving.com/2010/07/packing-for-the-road 4) http://64kbyte.homeip.net/wordpress/?p=74 5) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 6) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports 7) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA 8) http://www.watchmenfansite.com/door-speakers 9) http://www.a-ressaca.com/?p=663

-------------------------------------------------- List of existing web pages which you purchased, but not containing ANY backlinks to your site -------------------------------------------------- 1) http://64kbyte.homeip.net/wordpress/?p=74 2) http://blog.avlmoving.com/2010/07/packing-for-the-road 3) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 4) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense 5) http://www.watchmenfansite.com/door-speakers

-------------------------------------------------- List of existing web pages which you purchased, containing atleast one backlink to your site(with nofollow/dofollow) -------------------------------------------------- 1) http://dinnerwareforsale.net/blue-floral 2) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports 3) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA 4) http://www.a-ressaca.com/?p=663

-------------------------------------------------- Report of backlinks to your site from an existing web page link which you purchased -------------------------------------------------- 1) http://64kbyte.homeip.net/wordpress/?p=74 valid links=0 nofollow=0 dofollow=0

2) http://blog.avlmoving.com/2010/07/packing-for-the-road valid links=0 nofollow=0 dofollow=0

3) http://dinnerwareforsale.net/blue-floral valid links=1 nofollow=1 dofollow=0

4) http://iphone-case.org/iphone_news/apple/iphone-4-dissed-by-consumer-reports valid links=1 nofollow=1 dofollow=0

5) http://iphone-case.org/iphone_news/new/%E2%98%BAnew-limera1n-jailbreak-4-2-1-untethered-iphone-4-3gs-ipod-touch-4g-3g-ipad%E2%98%BA valid links=1 nofollow=1 dofollow=0

6) http://livingwell.wellness-studio1.com/2011/03/11/watch-watch-me-tone-week-215 valid links=0 nofollow=0 dofollow=0

7) http://nike-adidas-puma.com/how-to-constructively-criticize-a-friends-fashion-sense valid links=0 nofollow=0 dofollow=0

8) http://www.a-ressaca.com/?p=663 valid links=1 nofollow=1 dofollow=0

9) http://www.watchmenfansite.com/door-speakers valid links=0 nofollow=0 dofollow=0

-------------------------------------------------- Those web pages urls which you had purchased and exist no more(with error codes) -------------------------------------------------- 1) [500] -> http://www.luvrulz.com/perfume-3-4

2) [500] -> http://www.onthehookfishing.com/fishing-nets

3) [500] -> http://www.onthehookfishing.com/making-fishing-lures-spinners

4) [500] -> http://www.shopping-servant.com/index.php/2011/03/my-first-colour-shape-snap

How to Run

Save this script in myscript.pl and run it like: perl myscript.pl > outfile.txt

It shows current url sequence no/total urls and time left in minutes:

processing 1/13752 time left in min=0

processing 2/13752 time left in min=3437.5

processing 3/13752 time left in min=2367.88333333333

processing 4/13752 time left in min=1833.06666666667

The Perl Script

#!/usr/bin/perl -w
#This code automatically picks and formats all address from the company url only and not from any files.
use strict;

#@Author Anurag Gupta
#@License GNU GPL License
use LWP::UserAgent;
use HTML::Element;
use HTML::TreeBuilder;
sub trim($) {
my $string = shift;
$string =~ s/^\s+//;
$string =~ s/\s+$//;
return $string;
}
my %links; #map of each link to array(no of valid links, no of nofollow)
my $url_to_check="http://www.indiacustomercare.com";
my $file_containing_backlinks= "<m:/anurag/tmp/197B.txt";
$url_to_check=~s{/$}{};  #strip the last /
$url_to_check=lc($url_to_check); #change to lower case
 
my @links;
{
my $FP;
open $FP, $file_containing_backlinks or die "Failed to open the input links file";
#open $FP, "<f:/tmp/links.txt";

(@links)=(<$FP>);

}
map { chomp; $_=trim($_); }  @links;
@links=grep {m{/};} @links;
my @unique= keys %{{map {$_=>1} @links}};

print "Out of ",scalar(@links)," links total unique=",scalar(@unique)," and total duplicate=",$#links-$#unique,"\n";

@links = @unique;
my $i=0;

my %logdata;
my %longlogdata; #The data containing lot of information
my @nonexistingpages;

my $starttime=time;
foreach my $link (@links)
{
my $currtime=time;

my $timeleft= (($currtime-$starttime)/($i+1)*(scalar(@links)-($i+1)));


#=last if($i==1000);
++$i;
print STDERR "processing $i/",scalar(@links)," time left in min=",$timeleft/60,"\n";

my $agent = LWP::UserAgent->new(env_proxy => 1,keep_alive => 1, timeout => 30, agent => "Mozilla/4.76 [en] (Win98; U)");
my $header = HTTP::Request->new(GET => $link);
my $request = HTTP::Request->new('GET', $link, $header);
my $response = $agent->request($request);

if ($response->is_success){

my $content = $response->decoded_content;
my $root = HTML::TreeBuilder->new_from_content($content);
$root->warn(1);

my @info=(0,0);

traversehtml($root,\@info);

$links{$link}=\@info;

push @{$longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"}},$link;

}
else
{
#print $response->code."\n";
push @{$longlogdata{"Those web pages urls which you had purchased and exist no more(with error codes)"}},"[".$response->code."] ->  $link\n"; 

++$logdata{"Total web pages urls which you had purchased and exist no more(may have disappeared later)"};


}
}
if( ! exists $longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"} )
{
die "No existing pages were found to be accessible!";
}

foreach my $link ( sort @{$longlogdata{"Existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"}})
{
my @info =   @{$links{$link}};

my $dofollow = $info[0]-$info[1];

push @{$longlogdata{"Report of backlinks to your site from an existing web page link which you purchased"}}, "$link  valid links=$info[0]  nofollow=$info[1]  dofollow=$dofollow\n";

++$logdata{"Total existing Web Page URLs which you purchased(which may/may not contain backlinks to your site)"};

if($info[0]>0)
{
++$logdata{"Total found backlinks to your site(multiple backlinks to your site in the same web page are counted as 1)"};
push @{$longlogdata{"List of existing web pages which you purchased, containing atleast one backlink to your site(with nofollow/dofollow)"}},$link;
}
else
{
++$logdata{"Total Existing web pages, but containing no link to your site"};
push @{$longlogdata{"List of existing web pages which you purchased, but not containing ANY backlinks to your site"}},$link;
}

if($dofollow>0)
{
++$logdata{"Total dofollow links to your site(multiple dofollow backlinks to your site in the same web page are counted as 1)"};
push @{$longlogdata{"List of existing web pages which you purchased, containing DOFOLLOW links to your site"}},$link;
}
}

print "Printing STATS\n";

map { print "$_ => $logdata{$_}\n\n" } sort keys %logdata;


print "\n\nPRINTING VERBOSE REPORT\n";


foreach my $key (sort keys %longlogdata)
{

print "\n\n\n","-" x50,"\n";
print "$key\n";
print "-" x50,"\n";
my $i=0;
foreach my $item (@{$longlogdata{$key}})
{
++$i;
print "\t$i) $item\n";
}

}


sub traversehtml
{
my $node=$_[0];
my $info=$_[1];


if(ref(\$node) eq "SCALAR")
{
return;
}
elsif ( ($node->tag() eq "a"))
{
my $href=$node->attr('href');
if( $href and $href=~m{^$url_to_check}i) #defined
{
++$info->[0];

my $rel=$node->attr("rel");

if($rel and (lc($rel) =~m{\bnofollow\b}))
{
++$info->[1]; 
}
}
}

my @h = $node->content_list();

foreach my $item (@h)
{

if(ref(\$item) ne "SCALAR")  {traversehtml($item,$info); } #skip scalar items
} 



}

Sort HTML Tables rows using perl script

Do you have a table and you want to quickly sort it based on first cell values?
This script does just that.

Input:

File containing only table elements ( remove everything else)
First column must be serial number so this will be ignored and only second column will be compared rest all also will be ignored

The Perl Script

#Mandatory: first column must be serial number so this will be ignored and only second column will be compared rest all also will be ignored
#Also keep only tables in the html file

#This assumes
# any html file containing tables which may have tbody. But th must be used to denote the culumn header.
use strict;
use feature "switch";

use Common;

use HTML::Element;

use HTML::TreeBuilder;

my $filename="F:/tmp/t1.html";

my $reverse=1;

my $toc;

sub autoincrement
{

my $table=$_[0];

#see if tboday is present
my @children=$table->content_list();

foreach my $item (@children)
{
if($item->tag() eq "tbody")
{
$table = $item;
last;
}

}

my @unsortedrows = grep { $_->tag() eq "tr" and (($_->content_list())[0]->tag() eq "td")} $table->content_list();

my @sortedrows = sort { uc(($a->content_list())[1]->as_trimmed_text()) cmp uc(($b->content_list())[1]->as_trimmed_text())} @unsortedrows;

@sortedrows = reverse @sortedrows if $reverse;

my $parent=$sortedrows[0]->parent();

for(my $i=0;$i<@unsortedrows;++$i)
{
$unsortedrows[$i]->detach();

}

for(my $i=0;$i<@sortedrows;++$i)
{
$parent->push_content($sortedrows[$i]);
}

}

die "File $filename not found" if !-r $filename;

my $tree = HTML::TreeBuilder->new();

$tree->parse_file($filename);

my @h = $tree->content_list();

my @all_elements=$h[1]->content_list();

foreach my $item (@all_elements)
{

autoincrement($item) if ref(\$item) ne "SCALAR" and $item->tag() eq "table";

}

my @list1=$tree->content_list();

my @list2=$list1[1]->content_list();

foreach my $table (@list2)
{
print $table->as_HTML();
}

# Finally:

Autoincrement an html table serial numbers using Perl script

This perl script will append a new column along with the table with each row serial number incrementing from 1.

Input: 1) A file containing single table. This should not be further nested one and also should not contain colspan or rowspan

Script:

#!/usr/bin/perl -w

#This assumes

# any html file containing tables which may have tbody. This code only adds single column and does not change anything else.

use strict;

use feature "switch";

use Common;

use HTML::Element;

use HTML::TreeBuilder;

my $filename="F:/tmp/t1.html";

my $toc;

sub autoincrement

{

my $table=$_[0];

#see if tboday is present

my @children=$table->content_list();

foreach my $item (@children)

{

if($item->tag() eq "tbody")

{

$table = $item;

last;

}

}

my @rows = grep { $_->tag() eq "tr" } $table->content_list();

#First row is header row

my HTML::Element $th = HTML::Element->new('th');

$th->push_content("SNo.");

$rows[0]->unshift_content($th);

for(my $i=1;$i<@rows;++$i)

{

my HTML::Element $td = HTML::Element->new('td');

$td->push_content($i);

$rows[$i]->unshift_content($td);

}

}

die "File $filename not found" if !-r $filename;

my $tree = HTML::TreeBuilder->new();

$tree->parse_file($filename);

my @h = $tree->content_list();

my @all_elements=$h[1]->content_list();

foreach my $item (@all_elements)

{

autoincrement($item) if ref(\$item) ne "SCALAR" and $item->tag() eq "table";

}

my @list1=$tree->content_list();

my @list2=$list1[1]->content_list();

foreach my $table (@list2)

{

print $table->as_HTML();

}

# Finally:

TOC HTML Generator in Perl - Build Table of Content for your web page

I've searched on Internet any program/script which would automatically generate me a TOC for my html code which I need for all pages of a new website: India customer care. See a sample page whose TOC has been generated using this script.

What I found on the internet

Codeproject HTML TOC Generator
This did not seem to work. Even though I correctly inserted "INSERT contents" it was giving some error. It only worked partially.
A perl TOC GeneratorBut with the following issues:
- Does not use DOM to parse or generate (means it just uses some tricks to read html)
- Did not work when I tried
- Had listed it's own limitations
one zzee company selling TOC generator costing $20. Since it's an executable, forget customizing it.

My Solution

My Needs

Each of my web page will consist of only single H1 tag which will be same as page title because of SEO reasons. Remaining of my page will consist of H2-H7 tags and I want a TOC generated for the page for those H2-H7 tags. I don't want any TOC generator which automatically replaces any previous TOC, since if I'd added any customizations it'd lose that.

My Perl Script TOC Generator

Limitations

It'd ignore any H1 tags in the HTML page.
The h2-h7 titles must be only in the following fomat:
- <h2>Title abc</h2>
- <h2><a ...>Title abc</h2>
- It must not be in the format: <h2>Title<a ..>abc</a></h2>. That is h2 must not contain any html tag inside except of single a tag which must enclose full h2 text and not partially.

How to run it

Change the $filename to point to the desired file

How it works

Please see it does not modify the input file but prints out the new file content! It always dumps TOC in the beginning and remaining web page with inserted name tags without touching anything else. TOC contains nested UL's depending on H2-H7 nesting. So you'll need to copy paste the TOC html code in the desired location.

If it's a new code never generated by this script
- It generates TOC followed by web page with name= tags inserted within h tags
If it's already processed code by this script
- Beyond what it does above it, updates/deletes the old name tags.

Testing

I've tested with CSE HTML Validator and it's generating correct syntax code. Pl. see that my test code starts with a div tag only and not <html> since my site is on a CMS.

Does not work with your HTML code?

It happened with me then I found out that the input html file was not in correct syntax, one closing div was missing.

How Output TOC looks(I'm providing a sample!)

Code

#!/usr/bin/perl -w

#Copyright anurag gupta ; free to use under GNU GPL License

use strict;

use feature "switch";

use Common;

use HTML::Element;

use HTML::TreeBuilder;

#"F:/anurag/work/indiacustomercare/airtel/recharge.html";

my $filename="F:/tmp/t9.html";

my $index=0;

my $labelprefix="anu555ltg-";

my $tocIndex=100001;

my $toc;

my @stack;

my $prevHtag="h2";

sub hTagEncountered($)

{

my $hTag=shift;

my $currLevel=(split //, $hTag)[1];

given($hTag)

{

when(/h1/)

{

break;

}

default{

my $countCurr= (split /h/,$hTag)[1];

my $countPrev= (split /h/,$prevHtag)[1];

if($countCurr>$countPrev)

{

push @stack,($currLevel);

$toc.="<ul>";

}

elsif($countCurr<$countPrev)

{

# Now check in the stack

while ( @stack and $currLevel < $stack[$#stack])

{

pop @stack;

$toc.="</ul>";

}

$prevHtag=$hTag;

}

sub getLabel

{

my $name=$labelprefix.++$tocIndex;

}

sub traversehtml

{

my $node=$_[0];

# $node->dump();

# print "-----------------\n";

# print $node->tag()."\n";

# print ref($node),"->\n";

if((ref(\$node) ne "SCALAR" )and ($node->tag() =~m/^h[2-7]$/i)) #it's an H Element!

{

my @h = $node->content_list();

if(@h==1 and ref(\$h[0]) eq "SCALAR") #H1 contains simple string and nothing else

{

hTagEncountered($node->tag());

my $label=getLabel();

my $a = HTML::Element->new('a', name => $label);

my $text=$node->as_trimmed_text();

$a->push_content($text);

$node->delete_content();

$text=HTML::Entities::encode_entities($text);

$node->push_content($a);

$toc.=<<EOF;

<li><a href="#$label">$text</a>

EOF

}

elsif ( @h==1 and ($h[0]->tag() eq "a")) # <h1><a href="abc.com">ttt</a></h1> case

{

#See if any previous label already exists

my $prevlabel = $h[0]->attr("name");

$h[0]->attr("name",undef) if(defined($prevlabel) and $prevlabel=~m/$labelprefix/); #delete previous name tag if any

#set the new label

my $label=getLabel();

$h[0]->attr("name",$label);

hTagEncountered($node->tag());

my $text=HTML::Entities::encode_entities($node->as_trimmed_text());

$toc.=<<EOF;

<li><a href="#$label">$text</a>

EOF

}

elsif (@h>1) #<h1>some text here<a href="abc.com">ttt</a></h1> case

{

die "h1 must not contain any html elements";

}

my @h = $node->content_list();

foreach my $item (@h)

{

if(ref(\$item) ne "SCALAR") {traversehtml($item); } #skip scalar items

}

die "File $filename not found" if !-r $filename;

my $tree = HTML::TreeBuilder->new();

$tree->parse_file($filename);

my @h = $tree->content_list();

traversehtml($h[1]);

while(pop @stack)

{

$toc.="</ul>";

}

$toc="<ul>$toc</ul>";

print qq{<div id="icctoc"><h2>TOC</h2>$toc</div>};

my @list1=$tree->content_list();

my @list2=$list1[1]->content_list();

for(my $i=0;$i<@list2;++$i){

if(ref(\$list2[$i]) eq "SCALAR")

{

print $list2[$i]

}

else{

print $list2[$i]->as_HTML();

}

# Finally:

Friday, July 15, 2011

Privacy Policy

If you require any more information or have any questions about our privacy policy, please feel free to contact us by email at bhasker1@hotmail.com.

At http://usefulhtmlscripts.blogspot.com/, the privacy of our visitors is of extreme importance to us. This privacy policy document outlines the types of personal information is received and collected by http://usefulhtmlscripts.blogspot.com/ and how it is used.

Log Files
Like many other Web sites, http://usefulhtmlscripts.blogspot.com/ makes use of log files. The information inside the log files includes internet protocol ( IP ) addresses, type of browser, Internet Service Provider ( ISP ), date/time stamp, referring/exit pages, and number of clicks to analyze trends, administer the site, track user’s movement around the site, and gather demographic information. IP addresses, and other such information are not linked to any information that is personally identifiable.

Cookies and Web Beacons
http://usefulhtmlscripts.blogspot.com/ does use cookies to store information about visitors preferences, record user-specific information on which pages the user access or visit, customize Web page content based on visitors browser type or other information that the visitor sends via their browser.

DoubleClick DART Cookie
.:: Google, as a third party vendor, uses cookies to serve ads on http://usefulhtmlscripts.blogspot.com/.
.:: Google's use of the DART cookie enables it to serve ads to users based on their visit to http://usefulhtmlscripts.blogspot.com/ and other sites on the Internet.
.:: Users may opt out of the use of the DART cookie by visiting the Google ad and content network privacy policy at the following URL - http://www.google.com/privacy_ads.html

Some of our advertising partners may use cookies and web beacons on our site. Our advertising partners include ....
Google Adsense
Chitika
Kontera

These third-party ad servers or ad networks use technology to the advertisements and links that appear on http://usefulhtmlscripts.blogspot.com/ send directly to your browsers. They automatically receive your IP address when this occurs. Other technologies ( such as cookies, JavaScript, or Web Beacons ) may also be used by the third-party ad networks to measure the effectiveness of their advertisements and / or to personalize the advertising content that you see.

http://usefulhtmlscripts.blogspot.com/ has no access to or control over these cookies that are used by third-party advertisers.

You should consult the respective privacy policies of these third-party ad servers for more detailed information on their practices as well as for instructions about how to opt-out of certain practices. http://usefulhtmlscripts.blogspot.com/'s privacy policy does not apply to, and we cannot control the activities of, such other advertisers or web sites.

If you wish to disable cookies, you may do so through your individual browser options. More detailed information about cookie management with specific web browsers can be found at the browsers' respective websites.